DexH2R: A Benchmark for Dynamic Dexterous Grasping in Human-to-Robot Handover

1ShanghaiTech University, 2The University of Hong Kong
ICCV 2025

*Indicates Equal Contribution
Indicates Corresponding Author

Abstract

Handover between a human and a dexterous robotic hand is a fundamental yet challenging task in human-robot collaboration. It requires handling dynamic environments and a wide variety of objects, and demands robust and adaptive grasping strategies. However, progress in developing effective dynamic dexterous grasping methods is limited by the absence of high-quality, real-world human-to-robot handover datasets. Existing datasets primarily focus on grasping static objects or rely on synthesized handover motions, which differ significantly from real-world robot motion patterns, creating a substantial gap in applicability. In this paper, we introduce DexH2R, a comprehensive real-world dataset for human-to-robot handovers, built on dexterous robotic hand. Our dataset captures a diverse range of interactive objects, dynamic motion patterns, rich visual sensor data, and detailed annotations. Additionally, to ensure natural and human-like dexterous motions, we utilize teleoperation for data collection, enabling the robot’s movements to align with human behaviors and habits, which is a crucial characteristic for intelligent humanoid robots. Furthermore, we propose an effective solution, DynamicGrasp, for human-to-robot handover and evaluate various state-of-the-art approaches, including auto-regressive models and diffusion policy methods, providing a thorough comparison and analysis. We believe our benchmark will drive advancements in human-to-robot handover research by offering a high-quality dataset, effective solutions, and comprehensive evaluation metrics.

Real Robot Results

Overview

Directional Weight Score

We introduce DexH2R, the first real human-to-robot handover dataset, and establish an effective and practical solution DynamicGrasp along with a comprehensive benchmark for human-to-robot handover. This benchmark can benefit broad real-world applications and can be extended to robot-to-robot handover tasks.

Hardware System Setup

Directional Weight Score

Overview of our hardware system. The upper part shows the system setup during dataset recording, while the lower part highlights the key components: RealSense D455, Azure Kinect, ZCAM E2, and the teleoperation glove.

Dataset

Directional Weight Score

Comparison with existing handover datasets. Entries marked with * denote datasets that can serve as givers to generate synthetic handover data or act as benchmarks. H2H denotes human-to-human, H2R denotes human-to-robot, H2X denotes human-to-any, and R2X denotes robot-to-any. # indicates attribute count. - represents unavailable data.

Dataset Visualization

Directional Weight Score

Directional Weight Score

Our Solution Pipeline

Directional Weight Score

Our dynamic grasping solution for human-to-robot handover consists of three stages: (a) grasping pose preparation, (b) approaching motion generation, and (c) goal pose alignment. In the first stage, we pre-train a grasping pose generation model on large-scale simulation data and fine-tune it with real-world grasping data. The generated pose candidates are refined using physical and geometric filters to guarantee a stable and practical grasping pose. For approaching motion generation, we explore both diffusion-policy and autoregressive methods to generate sequential poses. Once the hand reaches a predefined proximity to the object, the system transitions to the final stage. In the goal pose alignment stage, we employ a simple yet effective linear interpolation method to ensure precise and physically plausible grasping, maintaining accuracy and adherence to environmental constraints for a robust and reliable grasping outcome.

Grasp Pose Preparation Performance

Directional Weight Score

Evaluation results on the DexH2R dataset. We compare cVAE with DexgraspAnything and conduct an ablation study on pretraining with large-scale synthetic data. Results show that training directly on DexH2R achieves comparable performance to pretraining on dexgraspnet, with higher succ1 (89.38 vs. 88.75) and succ6 (35.56 vs. 35.00). This highlights the effectiveness of DexH2R's grasping poses, despite its focus on dynamic human-robot handover tasks.

Our Solution Performance

Directional Weight Score

Comparison for Dynamic grasping solutions on test dataset. To comprehensively evaluate the models' robustness and precision, we defined two task modes: Easy Mode and Hard Mode. The Easy Mode evaluates the model's ability to understand the global position of the object, while the Hard Mode tests fine-grained pose alignment and collision avoidance in close proximity, crucial for safe and precise interactions. By incorporating both modes, our benchmark provides a balanced evaluation of model robustness, ensuring applicability in real-world scenarios where both global localization and local precision are essential.

Visualization of our solution's results

Directional Weight Score