The CrowdHuman dataset is large, rich-annotated and contains high diversity. There are a total of 470K human instances from the train and validation subsets, and 22.6 persons per image, with various kinds of occlusions in the dataset. Each human instance is annotated with a head bounding-box, human visible-region bounding-box and human full-body bounding-box. The cross-dataset generalization results of CrowdHuman dataset demonstrate state-of-the-art performance on previous dataset including Caltech-USA, CityPersons, and Brainwash without bells and whistles.
Images were obtained by crawling Google's image search engine with ∼150 keywords for query. Example keywords include “Pedestrians on the Fifth Avenue”, “people crossing the roads”, “students playing basketball” and “friends at a party”. These keywords cover more than 40 different cities around the world, various activities (e.g., party, traveling, and sports), and numerous viewpoints (e.g., surveillance viewpoint and horizontal viewpoint). The number of images crawled from a keyword is limited to 500 to make the distribution of images balanced. The authors crawled ∼60, 000 candidate images in total. The images with only a small number of persons, or with small overlaps between persons, are filtered. Of the total number, ∼25,000 final images were included in the CrowdHuman dataset.
@article{shao2018crowdhuman,
author = {Shao, Shuai and Zhao, Zijian and Li, Boxun and Xiao, Tete and Yu, Gang and Zhang, Xiangyu and Sun, Jian},
journal = {arXiv preprint arXiv:1805.00123},
title = {CrowdHuman: A Benchmark for Detecting Human in a Crowd},
year = {2018}
}