Challenge Results

Rank Team name Best Top-5 Accuracy (%) Entry-1 Entry-2 Entry-3 Entry-4 Entry-5
1 Alibaba-Vision 82.54 82.37 / 60.03 82.50 / 60.13 82.51 / 60.22 82.54 / 60.24 82.54 / 60.24
2 BigVideo 82.05 82.01 / 59.77 82.02 / 59.73 82.01 / 59.73 81.94 / 59.66 82.05 / 59.80
3 huaweicloud 81.15 80.46 / 57.74 81.07 / 58.54 81.15 / 58.60 81.11 / 58.60 81.15 / 58.63
4 Y_Y 80.69 80.69 / 57.88 80.61 / 57.87 80.45 / 57.09 79.75 / 56.57 80.50 / 57.49
5 PCI 77.92 76.64 / 54.85 77.17 / 55.57 77.20 / 55.51 77.92 / 55.88 75.18 / 52.82

Team Information

Team name Team member Method description
Alibaba-Vision Lele Cheng, Liming Zhao, Dangwei Li, Chenwei Xie, Yun Zheng, Yingya Zhang, Pan Pan

Alibaba Group.

deeplearnweb@126.com
The main idea of our method is to learn with side information provided by search engine, WordNet[1] and BERT model[2]. The semantic knowledge extracted from side information is used to generate each image's sampling weight. In the training stage, we adopt the class balanced sampling strategy to handle the long-tail problem. For each class, we choose images with generated weights according to semantic knowledge to handle noise annotations.

We adopt ResNext101 [3] as the baseline network. Besides, SeResNext101 [4], SeNet154 [4], and NasNet [5] are also used to improve the performance. In the test stage, we use multi-crop, multi-scale, and multi-model fusion as the final submitted results.

[1]https://wordnet.princeton.edu/
[2]https://github.com/google-research/bert
[3]Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017
[4]Squeeze-and-Excitation Networks, CVPR 2018
[5]Learning Transferable Architectures for Scalable Image Recognition, CVPR 2018

Entry Description:
Entry 1: fusion all the models with average
Entry 2: random selected partial models and average
Entry 3: the same models with entry2, using rank-based average
Entry 4: another random selected partial models and average
Entry 5: the same as entry2, with predefined model weights
BigVideo Huabin Zheng, Litong Feng, Yuming Chen, Weirong Chen, Zhe Huang, Zhanbo Sun, Wayne Zhang

SenseTime

lightedfeng@gmail.com

The WebVision 2019 challenge defines a large-scale webly-supervised image classification problem. We solve this problem from four perspectives: strong models, data filtering, training strategy, and ensemble strategy.
(1) Strong models
CNN models with strong visual description abilities are needed to handle this large-scale image classification task with 5000 classes. At the same time, we need to make a good trade-off between training time and accuracy due to limited GPU resources. Typical model structures selected are as follows:
- Network A: a variant of SEResNeXt152 [1], with only 3 SE blocks between 4 residual stages.
- Network B: a 5-stage variant of Model A.
- Network C: a variant of OctaveResNet152 [2].
- Network D: a variant of Res2Net152 [3].
- Network E: the original SEResNet152 [1].
(2) Data filtering with NLP model
Noisy samples are wide-spread in WebVison due to webly-crawled images without human annotations. The crawled images from search engines have text data in meta information. We use text data to filter out noisy images using BERT embedding [4]. WordNet description is utilized to compute label embedding for each class and meta info is utilized to compute document embedding for each sample. Noisy samples are filtered out by dropping the samples whose embedding does not match its label embedding using cosine similarity. By setting different thresholds and filtering percentages, we produce 4 partial training sets from original full training set, with different sizes: 8.51M, 5.31M and 3.51M. These partial training sets can help speed up training from scratch and improve accuracy of a model through fine-tuning.
(3) Training strategy
Due to limited GPU resources, we perform expanded input sizes, de-noising, and model diversity through fine-tuning. The training strategy is detailed as follows:
- Phase 1: We train the networks from scratch using roughly the same training settings as [5].
- Phase 2: We fine-tune models with expanded input image sizes accompanied with GEM pooling [6]. We use negative log-likelihood loss to self-supervise models without ground truth for denosing.
- Phase 3: Different partial training sets are selected to fine-tune models for pursuing diversity. We only fine-tune the final fully-connected layers and fix all other layers. Ground truth labels are used again. Cross Entropy (CE) Loss and Binary Cross Entropy Loss (BCE) are used separately for diversity.
(4) Ensemble strategy
We produce various single models. Each model is tested using multi-crops. Finally, predictions of different models are weighted-sum together to produce final predictions.
References
[1] Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[2] Chen, Yunpeng, et al. "Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution." arXiv preprint arXiv:1904.05049 (2019).
[3] Gao, Shang-Hua, et al. "Res2Net: A New Multi-scale Backbone Architecture." arXiv preprint arXiv:1904.01169 (2019).
[4] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[5] Xie, Junyuan, et al. "Bag of tricks for image classification with convolutional neural networks." arXiv preprint arXiv:1812.01187 (2018).
[6] Berman, Maxim, et al. "MultiGrain: a unified image embedding for classes and instances." arXiv preprint arXiv:1902.05509 (2019).
5. Entry Description:
Entry 1: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list with average
Entry 2: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list with weighted average
Entry 3: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list, C-8.5M-list, D-8.5M-list, E-8.5M-list with average
Entry 4: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list, C-8.5M-list, D-8.5M-list, E-8.5M-list with weighted average
Entry 5: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list, C-8.5M-list, D-8.5M-list, E-8.5M-list, A-full-list-BCE, B-full-list-BCE with weighted average

huaweicloud Lin Chen1, Zhikun Lin2, Anyin Song3, Yaxiong Chi3 Chenhui Qiu4, Shouping Shan4, Lixin Duan2

1Futurewei
2University of Eletronic Science and Technology of China
3Huawei Cloud
4Xidian University

songanyin@huawei.com

Our work is implemented using Huawei MoXing framework [1], which slightly improve accuracy while being much faster in training. As for the algorithms, the main idea is to leverage the meta information of each image and from search engine to clean up the data, and knowledge distillation for handling noise labels, as well as heuristic algorithm for learning an ensemble model. The details are as follows:
1. We cluster the images based on density and confusion matrix to clean up data, and then assign a different weight for each training image;
2. For each class, we compute the weight of each sample based on its distance to the high-ranked images based on search engine;
3. We clean up the data by matching image descriptions/tags with WordNet [3] glosses based on BERT [4] and TF-IDF;
4. We use knowledge distillation for handling noise labels following [2];
5. We use different types of state-of-the-art network architectures, including ResNet, ResNeXt, DenseNet, etc.;
6. During testing, we apply multi-scale and multi-crop to each test image;
7. We also ensemble different models using different strategies, including 1) average logits; and 2) learn the combination weights using heuristic algorithm.


5. Entry Description:

Entry 1: single model distillation + model ensemble(avg loglits)

Entry 2: fusion model distillation + model ensemble(avg loglits)

Entry 3: single model distillation + model ensemble(heuristic algorithm)

Entry 4: fusion model distillation + model ensemble(auto drop)

Entry 5: fusion model distillation + model ensemble(drop + heuristic algorithm)

[1] What Is MoXing? https://support.huaweicloud.com/en-us/dls_faq/en-us_topic_0105152979.html
[2] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jia Li and Jiebo Luo, “Learning from Noisy Labels with Distillation,” International Conference on Computer Vision (ICCV), Venice, Italy, October 2017.
[3] https://wordnet.princeton.edu/
[4] https://bert-as-service.readthedocs.io/en/latest/

Y_Y Jianfeng Zhu, Ying Su, PengCheng Yuan, Bin Zhang, Chenghan Fu, Zhendong Li, Shumin Han

Baidu Vis

yangchaoyue.vis@gmail.com

Our method is based on the Resnext101……
5. Entry Description:
Entry 1: 8 different models average vote, including resnext101, resnet101, which apply different sampling strategy
Entry 2: 8 different models weighted vote, including resnext101, resnet101, which apply different sampling strategy
Entry 3: 8 different models plus 3 retrieve results, weighted vote. Same models in entry 1 and entry2.
Entry 4: 8 different models plus 12 other base models which mostly are checkpoints in middle training process
Entry 5: fusion entry 1 to entry4

PCI fyy,wzw,ssw,lkm,zr

wzw@pcitech.com

Our method is based on the Resnet, include Resnet101 and Resnet152. First we randomly select one million samples to train a course
Resnet101 model, and we use this model to clean samples. Second we use cleaned data to train Resnet101 and Resnet152 models separately.
Third we use all samples to finetune Resnet101 and Resnet152 models separately. At last we ensemble all of the models to get last result.