Challenge Results

WebVision Image Classification Task

Rank Team name Run1 Run2 Run3 Run4 Run5
1 Malong AI Research 0.9358 0.9467 0.9478 0.9478 0.9470
2 SHTU_SIST 0.9223 0.9225 0.9218 0.9219 0.9216
3 HG-AI 0.9189 0.9152 0.9152 0.9189 0.9189
4 VISTA 0.8979 0.9005 0.8980 0.8992 0.8980
5 LZ_NES 0.8853 0.8758 0.8723 0.8504 0.8504
6 CRCV 0.8707 0.8717 0.8701 0.8712 0.8721
7 Chahrazad 0.8705 0.8705 0.8705 0.8705 0.8705
8 Gombru (CVC and Eurecat) 0.8475 0.8374 0.8586 0.8586 0.8586

Pascal VOC Transfer Learning Task

Rank Team name mAP
1 Malong AI Research 0.90

Team Information

Team name Team member Method description
Malong AI Research Sheng Guo, Weilin Huang, Chenfan Zhuang, Dengke Dong, Haozhi Zhang, Matthew R. Scott, Dinglong Huang

Malong Technologies Co., Ltd.
We propose a semi-supervised learning method to solve the problem of training large-scale deep neural networks with noisy and unbalanced data. First, we use clustering algorithms to divide the training data into two parts: clean data and noisy data. We then train a deep network model with the clean data. Afterwards, we use all parts of the data (both the clean and noisy data) to train the same network on the first model (clean data). Regarding the network, notably, we include two different sized kernels (5,9) on the initial convolutional layer. Regarding the training, notably, we data balance on the clean data, and designed a new adaptive lr drop which applies differently based on the particular data part.
SHTU_SIST Ziheng Zhang, Jia Zheng, Shenghua Gao, Yi Ma

Shanghaitech University
Our method is based on the Inception-Resnet-v2. Five base models are trained using popular data augmentation techniques. The original noise dataset is cleaned by ensemble of base models directly trained on it, then it is used to refine all base models. A trainable ensemble layer is used to find the best ensemble weights for each base models and classes. The final result is obtained using standard 144-crop evaluation.
HG-AI Bei Hu

Random
Our method is based on the Googlenet, inception_resnet_v2 network and resnet50. We have one 4-GPU server to work this challenge, and It's very difficult to train so large-sacle dataset. we train this models with some new skill on this dataset. Because of memory, we use different batch size on different networks (the resnet50 and googlenet is 256, and the inception_resnet_v2 is 128). We first randomly select 30% data form each class to train a basic model. There is a serious imbalance in this database. We've balanced the training data. Then we finetune on this basic model with all data. We show that this method can effectively suppress overfitting.
VISTA Yuncheng Li, Jianchao Yang

University of Rochester and Snap Inc.
Our submission is all based on random initialized inception-v3. The training pipeline is directly adopted from tensorflow slim repo. We tried a few methods to combat label noise, including label smooth, different ways to bootstrap, our label distillation (with model trained from google as base net), different ways of CRFs (hope to use the label relations), different text classifiers (hope to learn something from the meta data), different ways of training example denoising (pretrain a model, and remove the low confidence ones). Surprisingly, the simpliest method works the best as described below.
1. train inception v3 from scratch with learning rate 0.01, (with default learning rate policy)
2. train another round use the model from step 1, but with base learning rate 0.001
3. train another round use the model from step 2, but with base learning rate 0.0001
4. train inception v1 from scratch with learning rate 0.01, and use the model to subset training set (use different ratio as specified in the entry descriptions)
5. use the subset data to finetune model from step 3
LZ_NES Xihui Liu, Yuhang Zhao

Tsinghua University
Our method is based on the resnet101 structure and NLP tools.Our main work is getting different train_list.txt files from the noisy meta and trying different loss function ,then comparing their performance.Firts,We use NLP tools (mainly, word2vec and gensim)to get list_v1 and list_v2 as our train file.The difference between v1 and v2 is that they have different threshold of comparing similarity when we use them to produce labels of imagenet1k. Second,we use query list(in which queries are used as labels)to train the baseline,then train use other lists from the baseline.And we give different weights to the query list and v1,v2 list.
CRCV Yifan Ding, Muhammad Abdullah Jamal, Boqing Gong

CRCV, University of Central Florida
Baseline: We use Inception Resnet v2 network for our baseline. We train the inception resnet v2 model from the scratch using Webvision dataset.

Residual loss: We train inception resnet v2 network from the scratch for 980k iterations and save the model. We, then fine-tune the model based on our residual loss. Our method assumes that the weight of new classifier is w+u, giving the weight of our trained model's classifier is w. To fine-tune our model, we first saved the predictions of our pre-trained model and called them as softlabels. Our new loss becomes L = CrossEntropyLoss(y,softmax(w+u)) - CrossEntropyLoss(y,softmax(w)) + lambda/2(L2Regularization(u)). Here the second loss represents the cross-entropy loss between the softlabels and groundtruth.

Learning Without Forgetting + purifying: We first fine-tune the model using Google dataset only from the checkpoint we train from scratch for 980k iterations. Then we use this model to get and store the output of softmax layer of each image and use it as a soft label. After that we add three fully connected layers and a second softmax layer in the inception resnet v2 network and load all the weights from the fine-tuned model to our modified network. Finally, we fine-tune the new network with the Flickr dataset. The whole process is called as learning without forgetting, which means when our modified network is learning from the Flickr dataset, it won't forget the knowledge it learned from the Google dataset which is cleaner and has a better performance than train using the mixture of both Google and Flickr dataset. Also, in order to get a cleaner training dataset. We use the best checkpoint we have to predict over all the training images and remove those which has the lowerest prediction confidence and at the same time has a different prediction from the original category.
Chahrazad Chahrazad Essalim

Image Lab, Computer Science Engineering, Chung Ang University, seoul, South Korea
Our method is based on the DenseNet-BC architecture[1], we experimented with the configuration {L = 121, k = 32}, with L the number of layers and k the growth rate. Learning parameters are similar to the ones used to train ImageNet in [1]. Because of GPU memory constraints, the model was trained with a mini-batch size of 128, for 100 epochs, with the initial learning rate set to 0.1 and devided by 10 after every 30 epochs. For data preprocessing, random sized crop of 224 and random horizontal Flip were used as data augmentation methods for training images, for validation and test images, a scaling of 256 and center crop of 224 were used. Train, validation and test images were normalized using dataset mean and standard deviation. Best validation accuracy is 68.692 and 87.928 for top1 and top5 respectively. ( best accuracy was obtained at the last epoch 100, there is a chance to increase accuracy if the model was trained longer, but due to deadline constraints, training was for only 100 epochs).
References:
[1] G Huang, Z Liu, KQ Weinberger, L van der Maaten - arXiv preprint arXiv:1608.06993, 2016.
[2] Code was largly inspired from the Pytorch ImageNet example: https://github.com/pytorch/examples/tree/master/imagenet
Gombru (CVC and Eurecat) Raúl Gómez Brub

Computer Vision Center - Universitat Autònoma de Barcelona, Eurt
An LDA has been trained using the images associated text (title, description and tags). Then two different strategies have been explored:
a) The mean LDA topic distribution of each class has been computed. Then the similarity of each image associated text to the mean topic distribution of the class has been computed. That similarity has been used to weight the sample label. A CNN (Googlenet) has been trained using a softmax loss where the contribution of each sample to the loss is weighted.
b) A CNN (Googlenet) with 2 heads has been trained. One classification head with softmax loss and one regression head with sigmoid cross entropy loss. The image labels have been used as groundtruth for the classification head, and the LDA topic ditribution given by the text associated to the image has been used as the regression head groundtruth.
In both methods an agressive online data augmentation strategy has been used: Mirroring, random crops, rotation, rescaling, color casting and saturation and value jittering.