A Mask-RCNN based object detection and captioning framework for industrial videos
Manasi Namjoshi and Khushboo Khurana
Abstract
Video analysis of the surveillance videos is a tiresome and burdenous activity for a human. Automating the task of surveillance video analysis, specifically industrial videos could be very useful for productivity analysis, to assess the availability of raw materials and finished goods, fault detection, report generation, etc. To accomplish this task we have proposed a video captioning and reporting method. In video captioning, we generate summaries in understandable language that comprehend the video. These descriptions are generated by understanding the events and objects present in the video. The method presented in this paper constructs a captioned video summary, comprising of frames and their descriptions. Firstly, the frames are extracted from the video by performing uniform sampling. This reduces the task of video captioning to image captioning. Then, Mask- Region-based Convolutional Neural Network (RCNN) is utilized for detecting the objects like raw materials, products, humans, etc. from the sampled video frames. Further, a template-based sentence generation method is applied to obtain the image captions. Finally, a report is generated outlining the products present, and details relating to the production, like duration of the product being present, the number of products detected, the presence of operator at the workstation, etc. This framework can greatly help in bookkeeping, performing day-wise work-analysis, to keep track of employees working in a labor-intensive industry or factory, performing remote monitoring, etc., thereby reducing the human effort of video analysis. On the object classes for the created dataset, we have obtained an average confidence score of 0.8975, and an average accuracy of 95.62%. Moreover, as the captions are template-based the sentences generated are grammatically and meaningfully correct.
Keyword
Object detection, Mask-RCNN, Video captioning, Video analysis, Image captioning.
Cite this article
Namjoshi M, Khurana K.A Mask-RCNN based object detection and captioning framework for industrial videos. International Journal of Advanced Technology and Engineering Exploration. 2021;8(84):1466-1478. DOI:10.19101/IJATEE.2021.874394
Refference
[1]Gasparetto A, Scalera L. A brief history of industrial robotics in the 20th century. Advances in Historical Studies. 2019; 8(1):24-35.
[2]Chandan G, Jain A, Jain H. Real time object detection and tracking using Deep learning and openCV. In international conference on inventive research in computing applications 2018 (pp. 1305-8). IEEE.
[3]Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access. 2019; 7:128837-68.
[4]Minaee S, Luo P, Lin Z, Bowyer K. Going deeper into face detection: a survey. arXiv preprint arXiv:2103.14983. 2021.
[5]Ganokratanaa T, Aramvith S, Sebe N. Unsupervised anomaly detection and localization based on deep spatiotemporal translation network. IEEE Access. 2020; 8:50312-29.
[6]Elihos A, Balci B, Alkan B, Artan Y. Deep learning based segmentation free license plate recognition using roadway surveillance camera images. arXiv preprint arXiv:1912.02441. 2019.
[7]Yang X, Wang X. Recognizing license plates in real-time. arXiv preprint arXiv:1906.04376. 2019.
[8]Song H, Liang H, Li H, Dai Z, Yun X. Vision-based vehicle detection and counting system using deep learning in highway scenes. European Transport Research Review. 2019; 11(1):1-6.
[9]Cao J, Pang Y, Xie J, Khan FS, Shao L. From handcrafted to deep features for pedestrian detection: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021.
[10]Janahiraman TV, Subuhan MS. Traffic light detection using tensorflow object detection framework. In international conference on system engineering and technology (ICSET) 2019 (pp. 108-13). IEEE.
[11]Shen L, Margolies LR, Rothstein JH, Fluder E, Mcbride R, Sieh W. Deep learning to improve breast cancer detection on screening mammography. Scientific Reports. 2019; 9(1):1-12.
[12]Ali AR, Li J, O’shea SJ. Towards the automatic detection of skin lesion shape asymmetry, color variegation and diameter in dermoscopic images. Plos One. 2020; 15(6):1-21.
[13]Hopfield JJ. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences. 1982; 79(8):2554-8.
[14]Viola P, Jones M. Robust real-time object detection. International Journal of Computer Vision. 2001; 4(34-47):4.
[15]Lowe DG. Object recognition from local scale-invariant features. In proceedings of the seventh international conference on computer vision 1999 (pp. 1150-7). IEEE.
[16]Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In proceedings of the conference on computer vision and pattern recognition 2016 (pp. 779-88). IEEE.
[17]Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate object detection and semantic segmentation. In proceedings of the conference on computer vision and pattern recognition 2014 (pp. 580-7). IEEE.
[18]Girshick R. Fast R-CNN. In proceedings of the international conference on computer vision 2015 (pp. 1440-8). IEEE.
[19]Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems. 2015; 28:91-9.
[20]He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In proceedings of the international conference on computer vision 2017 (pp. 2961-9). IEEE.
[21]Amirian S, Rasheed K, Taha TR, Arabnia HR. Automatic image and video caption generation with deep learning: a concise review and algorithmic overlap. IEEE Access. 2020; 8:218386-400.
[22]Kojima A, Tamura T, Fukunaga K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision. 2002; 50(2):171-84.
[23]Hakeem A, Sheikh Y, Shah M. CASE^ E: a hierarchical event representation for the analysis of videos. In AAAI 2004 (pp. 263-8).
[24]Khan MU, Gotoh Y. Describing video contents in natural language. In proceedings of the workshop on innovative hybrid approaches to the processing of textual data 2012 (pp. 27-35).
[25]Das P, Xu C, Doell RF, Corso JJ. A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching. In proceedings of the conference on computer vision and pattern recognition 2013 (pp. 2634-41). IEEE.
[26]Khan MU, Al HN, Gotoh Y. A framework for creating natural language descriptions of video streams. Information Sciences. 2015; 303:61-82.
[27]Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, et al. Deep learning for generic object detection: a survey. International Journal of Computer Vision. 2020; 128(2):261-318.
[28]Bay H, Tuytelaars T, Van Gool L. Surf: Speeded up robust features. In European conference on computer vision 2006 (pp. 404-17). Springer, Berlin, Heidelberg.
[29]Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. In proceedings of the computer society conference on computer vision and pattern recognition. CVPR 2001. IEEE.
[30]Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995; 20(3):273-97.
[31]Ben-Hur A, Horn D, Siegelmann HT, Vapnik V. Support vector clustering. Journal of Machine Learning Research. 2001:125-37.
[32]Wang J, Shen X, Pan W. On transductive support vector machines. Contemporary Mathematics. 2007; 443:7-20.
[33]He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015; 37(9):1904-16.
[34]Wang X, Shrivastava A, Gupta A. A-fast-rcnn: Hard positive generation via adversary for object detection. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 2606-15). IEEE.
[35]Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In proceedings of the conference on computer vision and pattern recognition 2015 (pp. 3431-40). IEEE.
[36]Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al. SSD: single shot multibox detector. In European conference on computer vision 2016 (pp. 21-37). Springer, Cham.
[37]Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context. In European conference on computer vision 2014 (pp. 740-55). Springer, Cham.
[38]https://github.com/matterport/Mask_RCNN. Accessed 10 January 2021.
[39]https://www.robots.ox.ac.uk/~vgg/software/via. Accessed 11 February 2021.
[40]Bradski G. The openCV library. Dr. Dobbs Journal: Software Tools for the Professional Programmer. 2000; 25(11):120-3.
[41]Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. In proceedings of the conference on computer vision and pattern recognition 2017 (pp. 2117-25). IEEE.
[42]He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In proceedings of the conference on computer vision and pattern recognition 2016 (pp. 770-8).