||Lele Cheng, Liming Zhao, Dangwei Li, Chenwei Xie, Yun Zheng, Yingya Zhang, Pan Pan
The main idea of our method is to learn with side information provided by search engine, WordNet and BERT model. The semantic knowledge extracted from side information is used to generate each image's sampling weight. In the training stage, we adopt the class balanced sampling strategy to handle the long-tail problem. For each class, we choose images with generated weights according to semantic knowledge to handle noise annotations.
We adopt ResNext101  as the baseline network. Besides, SeResNext101 , SeNet154 , and NasNet  are also used to improve the performance. In the test stage, we use multi-crop, multi-scale, and multi-model fusion as the final submitted results.
Aggregated Residual Transformations for Deep Neural Networks, CVPR 2017
Squeeze-and-Excitation Networks, CVPR 2018
Learning Transferable Architectures for Scalable Image Recognition, CVPR 2018
Entry 1: fusion all the models with average
Entry 2: random selected partial models and average
Entry 3: the same models with entry2, using rank-based average
Entry 4: another random selected partial models and average
Entry 5: the same as entry2, with predefined model weights
||Huabin Zheng, Litong Feng, Yuming Chen, Weirong Chen, Zhe Huang, Zhanbo Sun, Wayne Zhang
The WebVision 2019 challenge defines a large-scale webly-supervised image classification problem. We solve this problem from four perspectives: strong models, data filtering, training strategy, and ensemble strategy.
(1) Strong models
CNN models with strong visual description abilities are needed to handle this large-scale image classification task with 5000 classes. At the same time, we need to make a good trade-off between training time and accuracy due to limited GPU resources. Typical model structures selected are as follows:
- Network A: a variant of SEResNeXt152 , with only 3 SE blocks between 4 residual stages.
- Network B: a 5-stage variant of Model A.
- Network C: a variant of OctaveResNet152 .
- Network D: a variant of Res2Net152 .
- Network E: the original SEResNet152 .
(2) Data filtering with NLP model
Noisy samples are wide-spread in WebVison due to webly-crawled images without human annotations. The crawled images from search engines have text data in meta information. We use text data to filter out noisy images using BERT embedding . WordNet description is utilized to compute label embedding for each class and meta info is utilized to compute document embedding for each sample. Noisy samples are filtered out by dropping the samples whose embedding does not match its label embedding using cosine similarity. By setting different thresholds and filtering percentages, we produce 4 partial training sets from original full training set, with different sizes: 8.51M, 5.31M and 3.51M. These partial training sets can help speed up training from scratch and improve accuracy of a model through fine-tuning.
(3) Training strategy
Due to limited GPU resources, we perform expanded input sizes, de-noising, and model diversity through fine-tuning. The training strategy is detailed as follows:
- Phase 1: We train the networks from scratch using roughly the same training settings as .
- Phase 2: We fine-tune models with expanded input image sizes accompanied with GEM pooling . We use negative log-likelihood loss to self-supervise models without ground truth for denosing.
- Phase 3: Different partial training sets are selected to fine-tune models for pursuing diversity. We only fine-tune the final fully-connected layers and fix all other layers. Ground truth labels are used again. Cross Entropy (CE) Loss and Binary Cross Entropy Loss (BCE) are used separately for diversity.
(4) Ensemble strategy
We produce various single models. Each model is tested using multi-crops. Finally, predictions of different models are weighted-sum together to produce final predictions.
 Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
 Chen, Yunpeng, et al. "Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution." arXiv preprint arXiv:1904.05049 (2019).
 Gao, Shang-Hua, et al. "Res2Net: A New Multi-scale Backbone Architecture." arXiv preprint arXiv:1904.01169 (2019).
 Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
 Xie, Junyuan, et al. "Bag of tricks for image classification with convolutional neural networks." arXiv preprint arXiv:1812.01187 (2018).
 Berman, Maxim, et al. "MultiGrain: a unified image embedding for classes and instances." arXiv preprint arXiv:1902.05509 (2019).
5. Entry Description:
Entry 1: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list with average
Entry 2: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list with weighted average
Entry 3: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list, C-8.5M-list, D-8.5M-list, E-8.5M-list with average
Entry 4: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list, C-8.5M-list, D-8.5M-list, E-8.5M-list with weighted average
Entry 5: fusion A-full-list, A-8.5M-list, B-full-list , B-3.5M-list, C-8.5M-list, D-8.5M-list, E-8.5M-list, A-full-list-BCE, B-full-list-BCE with weighted average
||Lin Chen1, Zhikun Lin2, Anyin Song3, Yaxiong Chi3 Chenhui Qiu4, Shouping Shan4, Lixin Duan2
2University of Eletronic Science and Technology of China
Our work is implemented using Huawei MoXing framework , which slightly improve accuracy while being much faster in training. As for the algorithms, the main idea is to leverage the meta information of each image and from search engine to clean up the data, and knowledge distillation for handling noise labels, as well as heuristic algorithm for learning an ensemble model. The details are as follows:
1. We cluster the images based on density and confusion matrix to clean up data, and then assign a different weight for each training image;
2. For each class, we compute the weight of each sample based on its distance to the high-ranked images based on search engine;
3. We clean up the data by matching image descriptions/tags with WordNet  glosses based on BERT  and TF-IDF;
4. We use knowledge distillation for handling noise labels following ;
5. We use different types of state-of-the-art network architectures, including ResNet, ResNeXt, DenseNet, etc.;
6. During testing, we apply multi-scale and multi-crop to each test image;
7. We also ensemble different models using different strategies, including 1) average logits; and 2) learn the combination weights using heuristic algorithm.
5. Entry Description:
Entry 1: single model distillation + model ensemble(avg loglits)
Entry 2: fusion model distillation + model ensemble(avg loglits)
Entry 3: single model distillation + model ensemble(heuristic algorithm)
Entry 4: fusion model distillation + model ensemble(auto drop)
Entry 5: fusion model distillation + model ensemble(drop + heuristic algorithm)
 What Is MoXing? https://support.huaweicloud.com/en-us/dls_faq/en-us_topic_0105152979.html
 Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jia Li and Jiebo Luo, “Learning from Noisy Labels with Distillation,” International Conference on Computer Vision (ICCV), Venice, Italy, October 2017.
||Jianfeng Zhu, Ying Su, PengCheng Yuan, Bin Zhang, Chenghan Fu, Zhendong Li, Shumin Han
Our method is based on the Resnext101……
5. Entry Description:
Entry 1: 8 different models average vote, including resnext101, resnet101, which apply different sampling strategy
Entry 2: 8 different models weighted vote, including resnext101, resnet101, which apply different sampling strategy
Entry 3: 8 different models plus 3 retrieve results, weighted vote. Same models in entry 1 and entry2.
Entry 4: 8 different models plus 12 other base models which mostly are checkpoints in middle training process
Entry 5: fusion entry 1 to entry4