Volume : 4, Issue : 4, APR 2020


Hariprabhu V, Kanmani P


The effectiveness can be achieved with electronic commerce-related concepts, while these models are very accurate, these often rely on the use of expensive computation hardware making it difficult to apply these models in real time scenarios, where their actual applications can be realised. In this paper, we carefully follow some of the core concepts of Image Captioning and its common approaches and present our simplistic encoder and decoder based implementation with significant modifications and optimizations which enable us Torun these models on low-end hardware of hand-held devices. We also compare our results evaluated using various metric with state-of-the-art models and analyze why and where our model trained on MSCOCO dataset lacks due to the trade off between computation speed and quality in the website.


Image Recognition Technology, Link Building, Local SEO Optimization.

Article : Download PDF

Cite This Article

Article No : 5

Number of Downloads : 1


  1. Aker and R. Gaizauskas. Generating image descriptions using dependency relational patterns. InACL, 2010. 2[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.arXiv:1409.0473, 2014. 1, 2
  2. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, 2014. 1, 2,
  3. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.3
  4. Elliott and F. Keller. Image description using visual de-pendency representations. In EMNLP, 2013. 2
  5. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic-ture tells a story: Generating sentences from images. In ECCV, 2010.
  6. Gerber and H.-H. Nagel. Knowledge representation for the generation of quantified natural language descriptions ofvehicle traffic in image sequences. In ICIP. IEEE, 1996. 2
  7. Gong, L. Wang, M. Hodosh, J. Hockenmaier, andS. Lazebnik. Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, 2014.2, 5
  8. Graves. Generating sequences with recurrent neural net-works.arXiv:1308.0850, 2013. 3
  9. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8), 1997. 2, 3
  10. Hodosh, P. Young, and J. Hockenmaier. Framing imagedescription as a ranking task: Data, models and evaluation metrics.JAIR, 47, 2013. 2, 4, 6, 7
  11. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InarXiv: 1502.03167, 2015. 2, 3, 4
  12. Kiros, R. Salakhutdinov, and R. S. Zemel.Unifyingvisual-semantic embeddings with multimodal neural lan-guage models. InarXiv:1411.2539, 2014. 2, 5, 6, 7
  13. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,and T. L. Berg. Baby talk: Understanding and generatingsimple image descriptions. InCVPR, 2011. 1, 2, 6
  14. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, andY. Choi. Collective generation of natural image descriptions.InACL, 2012. 2
  15. 15 Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. Treetalk:Composition and compression of trees for image descrip-tions.ACL, 2(10), 2014. 2, 5, 6
  16. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Com-posing simple image descriptions using web-scale n-grams.
  17. -Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-manan, P. Doll ́ar, and C. L. Zitnick. Microsoft coco: Com-mon objects in context.arXiv:1405.0312, 2014. 5
  18. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Explain images with multimodal recurrent neural networks. InarXiv:1410.1090, 2014. 2, 6, 7
  19. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. InICLR,2013. 6, 8
  20. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. C. Berg, K. Yamaguchi, T. L. Berg, K. Stratos, and H. D. III. Midge: Generating image descriptions from computer visiondetections. In EACL, 2012. 2