The WebVision dataset is designed to facilitate the research on learning visual representation from noisy web data. Our goal is to disentangle the deep learning techniques from huge human labor on annotating large-scale vision dataset. We release this large scale web images dataset as a benchmark to advance the research on learning from web data, including weakly supervised visual representation learning, visual transfer learning, text and vision, etc. (see recommended settings for the WebVision dataset).
Similar to WebVisioin 1.0 dataset, the WebVision 2.0 dataset also contains images crawled from the Flickr website and Google Images search. In this new version, we extend the number of visual concepts from 1,000 to 5,000, and the total number of trianing images reaches 16 million. The 5,000 visual concepts contains the original 1,000 concepts in WebVision 1.0 dataset, and additional 4,000 synsets in ImageNet with the most number of images. Semantically overlapped synsets are removed, such that there is no pair of synsets that one is the parent or child of the other. All 5,000 visual concepts have their corresponding synsets in the ImageNet dataset, so a bunch of existing approaches can be directly investigated and compared to the models trained from the human annotated ImageNet dataset, and also makes it possible to study the dataset bias issue in the large scale scenario. The textual information accompanied with those images (e.g., caption, user tags, or description) are also provided as additional meta information. A validation set contains around 250K images (up to 50 images per category) is provided to facilitate the algorithmic development.