Please wait a minute...
Big Data Mining and Analytics  2020, Vol. 03 Issue (01): 29-40    DOI: 10.26599/BDMA.2019.9020017
    
A Semi-Supervised Attention Model for Identifying Authentic Sneakers
Yang Yang, Nengjun Zhu, Yifeng Wu, Jian Cao, Dechuan Zhan*, Hui Xiong*
Yang Yang and Dechuan Zhan are with National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. E-mail: yangy@lamda.nju.edu.cn.
Nengjun Zhu and Jian Cao are with Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China. E-mail: zhu_nj@sjtu.edu.cn; cao-jian@sjtu.edu.cn.
Yifeng Wu is with Alibaba Company, Hangzhou 310000, China. E-mail: yixin.wyf@alibaba-inc.com.
Hui Xiong is with Rutgers University, New York, NJ 07102, USA.
Download: PDF (11668 KB)      HTML  
Export: BibTeX | EndNote (RIS)      

Abstract  

To protect consumers and those who manufacture and sell the products they enjoy, it is important to develop convenient tools to help consumers distinguish an authentic product from a counterfeit one. The advancement of deep learning techniques for fine-grained object recognition creates new possibilities for genuine product identification. In this paper, we develop a Semi-Supervised Attention (SSA) model to work in conjunction with a large-scale multiple-source dataset named YSneaker, which consists of sneakers from various brands and their authentication results, to identify authentic sneakers. Specifically, the SSA model has a self-attention structure for different images of a labeled sneaker and a novel prototypical loss is designed to exploit unlabeled data within the data structure. The model draws on the weighted average of the output feature representations, where the weights are determined by an additional shallow neural network. This allows the SSA model to focus on the most important images of a sneaker for use in identification. A unique feature of the SSA model is its ability to take advantage of unlabeled data, which can help to further minimize the intra-class variation for more discriminative feature embedding. To validate the model, we collect a large number of labeled and unlabeled sneaker images and perform extensive experimental studies. The results show that YSneaker together with the proposed SSA architecture can identify authentic sneakers with a high accuracy rate.



Key wordssneaker identification      fine-grained classification      multi-instance learning      attention mechanism     
Received: 21 May 2019      Published: 13 January 2020
Corresponding Authors: Dechuan Zhan,Hui Xiong   
Cite this article:

Yang Yang, Nengjun Zhu, Yifeng Wu, Jian Cao, Dechuan Zhan, Hui Xiong. A Semi-Supervised Attention Model for Identifying Authentic Sneakers. Big Data Mining and Analytics, 2020, 03(01): 29-40.

URL:

http://bigdata.tsinghuajournals.com/10.26599/BDMA.2019.9020017     OR     http://bigdata.tsinghuajournals.com/Y2020/V03/I01/29

Fig. 1 Sampled examples in YSneaker for same class sneaker. Can you distinguish them? Answer: (a) counterfeit sneakers; (b) authentic sneakers.
Fig. 2 An illustration of the YSneaker with multiple source images. Each instance includes: (a) appearance; (b) tag; (c) midsole; (d) insole; (e) box logo; (f) stamp; (g) extra images (not indispensable).
ExpertAuthenticCounterfeitUnlabelAll
InstanceImageInstanceImageInstanceImageInstanceImage
E143 892394 92647 514362 47725 200237 259116 606994 662
E245 647417 63336 488274 14137 314328 691119 4491 020 465
E333 989296 62610 72080 21615 312129 89560 021506 737
E443 680378 32148 220373 21245 200390 038137 1001 141 571
E550 240414 37825 000186 82930 920263 286106 160864 493
E641 440378 14951 120394 38740 120369 165132 6801 141 701
E745 640445 99745 179352 83546 559411 096137 3781 209 928
All304 5282 726 030264 2412 024 097240 6252 129 430809 3946 879 557
Table 1 Dataset description. "Instance" is the number of sneakers; "Image" is the number of total images considering all sources.
Fig. 3 Noise data: (a) Anomalous data, which can be identified as clothing from the context and (b) incomplete data, missing the stamp image, which causes the order confusion of different source images.
Fig. 4 Illustration of the proposed SSA. Specifically, sneaker is denoted as a bag with various number of instances. Then, SSA calculats the instance-level representations with the deep network, and utilizes additional attention-based network to get the final bag-level representation, which is used for semi-supervised fine-grained identification.
MethodAccuracy
Source1Source2Source3Source4Source5Source6Source7Ensemble
CNN72.1378.4578.1277.1882.0873.8773.3182.59
MS-CNN74.0579.1078.7676.8282.0473.8471.2682.36
MS-Bilinear74.1580.2479.8478.0584.1475.0671.4683.57
MethodPrecision
Source1Source2Source3Source4Source5Source6Source7Ensemble
CNN70.1076.2476.2174.9279.0771.0875.0482.48
MS-CNN73.0679.7879.1574.1581.5772.6772.2882.98
MS-Bilinear73.7179.6279.7476.0483.7474.2372.5183.33
MethodRecall
Source1Source2Source3Source4Source5Source6Source7Ensemble
CNN79.3083.6982.8277.5284.2975.1388.2281.13
MS-CNN77.9778.8479.0578.0182.9071.3890.3983.57
MS-Bilinear76.8182.1680.9078.0482.3572.1190.2783.27
MethodF1-Measure
Source1Source2Source3Source4Source5Source6Source7Ensemble
CNN74.4279.7979.3876.2081.5973.0581.1081.80
MS-CNN75.4379.3179.1076.0382.3972.1980.3382.76
MS-Bilinear75.2380.8780.3277.0383.0473.1680.4283.30
Table 2 Baselines on YSneaker-small. All results are evaluated on the test set and reported with Accuracy, Precision, Recall, and F1-Measure. The results of best performance are bolded. (%)
MethodAccuracyPrecisionRecallF1-measure
DeepMIML86.9581.4093.5187.12
MI-CNN85.5282.1288.8985.37
mi-Net71.9881.2452.6963.92
MI-Net76.9175.6475.2375.43
MI-Net-DS74.5279.0862.4469.79
MI-Net-RC74.2871.7175.0273.32
MIL-Att87.0083.6391.3487.31
Bilinear87.0186.9487.4887.21
MA-CNN86.3383.1991.5087.14
SSA-Mean86.8280.2092.7886.03
SSA-Max88.8787.4190.1288.73
SSA88.7590.0086.6488.29
Table 3 Comparison on YSneaker-small. All results are evaluated on the test set and reported with Accuracy, Precision, Recall, and F1-Measure. The results of best performance are bolded. (%)
Fig. 5 An illustration of the local attention learning. The attention localization is marked with blue shadow.
×5 patches to show the raw images. The same examples are also displayed for DeepMIML and Bilinear.">
Fig. 6 t-SNE visualisation of the sampled data for (a) DeepMIML, (b) Bilinear, (c) SSA, and (d) SSA-Example, where each point in the patch corresponds to a sneaker. Each instance (sneaker) is with 2048-dimension vector, then they are projected by t-SNE to two dimensions. For SSA, we have zoomed into two dense cluster margins of authentic (marked with yellow) and counterfeit (marked with green) examples, and sampled 5<inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA88"><mml:mo>×</mml:mo></math></inline-formula>5 patches to show the raw images. The same examples are also displayed for DeepMIML and Bilinear.
Fig. 7 Objective function value convergence and corresponding classification accuracy vs. number of iterations of SSA.
𝝀.">
Fig. 8 Influence of the parameters <inline-formula><math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" id="MA90"><mml:mi>𝝀</mml:mi></math></inline-formula>.
[1]   Huang G., Liu Z., van der Maaten L., and Weinberger K. Q., Densely connected convolutional networks, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 2261-2269.
[2]   Lin T., Goyal P., Girshick R. B., He K., and Dollar P., Focal loss for dense object detection, in Proceedings of the International Conference on Computer Vision, Venice, Italy, 2017, pp. 2999-3007.
[3]   He K., Gkioxari G., Dollar P., and Girshick R. B., Mask R-CNN, in Proceedings of the International Conference on Computer Vision, Venice, Italy, 2017, pp. 2980-2988.
[4]   Ronneberger O., Fischer P., and Brox T., U-net: Convolutional networks for biomedical image segmentation, in Proc. Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015, pp. 234-241.
[5]   Lian J., Zhou X., Zhang F., Chen Z., Xie X., and Sun G., xdeepfm: Combining explicit and implicit feature interactions for recommender systems, in Proceedings of the International Conference on Knowledge Discovery and Data Mining, London, UK, 2018, pp. 1754-1763.
[6]   Wang S., He L., Cao B., Lu C., Yu P. S., and Ragin A. B., Structural deep brain network mining, in Proceedings of the International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, 2017, pp. 475-484.
[7]   Xu H., Yu Z., Yang J., Xiong H., and Zhu H., Dynamic talent flow analysis with deep sequence prediction modeling, Transactions on Knowledge and Data Engineering, vol. 31, no. 10, pp. 1926-1939, 2019.
[8]   Kingma D. P. and Welling M., Auto-encoding variational bayes, in Proceedings of the International Conference on Learning Representations, Banff, Canada, 2014, pp. 34-42.
[9]   Li Y. and Ye J., Learning adversarial networks for semi-supervised text classification via policy gradient, in Proceedings of the International Conference on Knowledge Discovery and Data Mining, London, UK, 2018, pp. 1715-1723.
[10]   Dizaji K. G., Wang X., and Huang H., Semi-supervised generative adversarial network for gene expression inference, in Proceedings of the International Conference on Knowledge Discovery and Data Mining, London, UK, 2018, pp. 1435-1444.
[11]   Lin T., Roy Chowdhury A., and Maji S., Bilinear CNN models for fine-grained visual recognition, in Proceedings of the International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1449-1457.
[12]   Zheng H., Fu J., Mei T., and Luo J., Learning multi-attention convolutional neural network for fine-grained image recognition, in Proceedings of the International Conference on Computer Vision, Venice, Italy, 2017, pp. 5219-5227.
[13]   Wah C., Branson S., Welinder P., Perona P., and Belongie S., The Caltech-UCSD Birds-200-2011 Dataset, Report, California Institute of Technology, CA, USA, 2011.
[14]   Zhang X., Xiong H., Zhou W., Lin W., and Tian Q., Picking deep filter responses for fine-grained image recognition, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 1134-1142.
[15]   Khosla A., Jayadevaprakash N., Yao B., and Li F.-F., Novel dataset for fine-grained image categorization: Stanford dogs, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 2011, p. 1.
[16]   Krause J., Jin H., Yang J., and Li F., Fine-grained recognition without part annotations, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 5546-5555.
[17]   Zhou Z.-H., Abductive learning: Towards bridging machine learning and logical reasoning, Science China Information Sciences, vol. 62, no. 7, pp. 76 101:1-76 101:3, 2019.
[18]   Ilse M., Tomczak J. M., and Welling M., Attention-based deep multiple instance learning, in Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 2018, pp. 2132-2141.
[19]   Zaheer M., Kottur S., Ravanbakhsh S., Poczos B., Salakhutdinov R. R., and Smola A. J., Deep sets, in Proc. of Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 3394-3404.
[20]   Gao Y., Beijbom O., Zhang N., and Darrell T., Compact bilinear pooling, in Proceedings of the International Conference on Computer Vision, Las Vegas, NV, USA, 2016, pp. 317-326.
[21]   Zhang N., Donahue J., Girshick R. B., and Darrell T., Part-based R-CNNs for fine-grained category detection, in Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 834-849.
[22]   Perronnin F. and Larlus D., Fisher vectors meet neural networks: A hybrid classification architecture, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3743-3752.
[23]   Fu J., Zheng H., and Mei T., Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4476-4484.
[24]   Pinheiro P. H. O. and Collobert R., From image-level to pixel-level labeling with convolutional networks, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1713-1721.
[25]   Feng J. and Zhou Z., Deep MIML network, in Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 2017, pp. 1884-1890.
[26]   Yang Y., Wu Y., Zhan D., Liu Z., and Jiang Y., Complex object classification: A multi-modal multi-instance multi-label deep network with optimal transport, in Proceedings of the International Conference on Knowledge Discovery and Data Mining, London, UK, 2018, pp. 2594-2603.
[27]   Xu K., Ba J., Kiros R., Cho K., Courville A. C., Salakhutdinov R., Zemel R. S., and Bengio Y., Show, attend and tell: Neural image caption generation with visual attention, in Proceedings of the International Conference on Machine Learning, Lille, France, 2015, pp. 2048-2057.
[28]   Li H., Min M. R., Ge Y., and Kadav A., A context-aware attention network for interactive question answering, in Proceedings of the International Conference on Knowledge Discovery and Data Mining, Halifax, Canada, 2017, pp. 927-935.
[29]   Pappas N. and Popescu-Belis A., Explaining the stars: Weighted multiple-instance learning for aspect-based sentiment analysis, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 455-466.
[30]   He K., Zhang X., Ren S., and Sun J., Deep residual learning for image recognition, in Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
[31]   Wang X., Yan Y., Tang P., Bai X., and Liu W., Revisiting multiple instance neural networks, Pattern Recognition, vol. 74, pp. 15-24, 2018.
[32]   Maaten L. v. d. and Hinton G., Visualizing data using t-SNE, Journal of Machine Learning Research, vol. 9, no. 11, pp. 2579-2605, 2008.
[1] Runyan Zhang, Fanrong Meng, Yong Zhou, Bing Liu. Relation Classification via Recurrent Neural Network with Attention and Tensor Layers[J]. Big Data Mining and Analytics, 2018, 01(03): 234-244.