The Robot-Human demonstration in 20TB (RH20T) dataset contains over 110,000 contact-rich real-world robot manipulation sequences, each with a corresponding human demonstration video and a natural language description of the task performed. Every manipulation sequence includes timestamped visual (RGB, depth, and binocular IR images from three types of cameras), audio (from both in-hand and global sources) force (6 DoF F/T measurements at the robot’s wrist, joint torques, and for some sequences, fingertip tactile data), and action (6 DoF TCP/end-effector Cartesian poses, joint angles, and gripper states) information from multiple, manually calibrated sensors and multi-view cameras. In total, there are over 40 million image frames of robotic manipulation sequences and over 10 million image frames of human demonstrations.
RH20T was designed for learning complex robot manipulation skills from multi-modal perception data in one-shot. The diversity of tasks, environments, robot configurations, and camera viewpoints is intended to promote generalization to different scenarios.
@inproceedings{fang2024rh20t,
title = {RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot},
author = {Fang, Hao-Shu and Fang, Hongjie and Tang, Zhenyu and Liu, Jirong and Wang, Chenxi and Wang, Junbo and Zhu, Haoyi and Lu, Cewu},
booktitle = {2024 IEEE International Conference on Robotics and Automation (ICRA)},
pages = {653--660},
year = {2024},
organization = {IEEE}
}