Design and Implementation of a Video/Voice Process System for Recognizing Vehicle Parts Based on Artificial Intelligence
Abstract
:1. Introduction
2. Related Research
2.1. YOLO Algorithm
- A fast speed with a simple processing is achieved.
- The value of mAP is approximately twice that of other conventional real-time detection systems.
- YOLOv3 is expected to improve the service response speed of the platform with real-time object recognition.
- As conventional artificial intelligence object recognition algorithms were developed using Python, their source code may be vulnerable to security risks.
- As the source code was released based on C language, it can be secured through a compilation.
- As YOLOv3 was also implemented based on C language, it has high development accessibility and easy maintenance.
- As it is easy to improve performance and expand the object types through dataset training, it can adapt easily to changes in the environment of the development platform.
2.2. Application Service Using YOLOv3
2.3. MSSIM Algorithm
3. Proposed System Design
3.1. Video Process System Design
3.2. Voice Process System Design
3.3. Video and Voice Processing System Administrator Service System Design
4. Implementation of the Proposed System
4.1. Implementation of the Video Processing System
4.1.1. Data Training for Vehicle Part Recognition
4.1.2. Vehicle Part Recognition Results of Transmission Module Development
- It was developed to provide a ResetFul API by structuring three types of output files for both vehicle and vehicle part recognition.
- File name_yolo.json: The user can receive the result through a JSON file containing information such as the recognized vehicle parts, location, and time.
- File name_yolo.mp4: A file copy displaying the vehicle parts recognized in the original video file was developed to visually check the vehicle parts recognized in the video file.
- File name_yolo.progress: As this system is executed asynchronously in real time in response to web service requests, the user should be able to check the progress. To this end, it is developed to continuously update the file that displays the current progress, and check messages regarding errors and exceptions.
4.1.3. Proposal and Development for Improving Performance of Vehicle Part Recognition Module
4.2. Voice Process System Implementation
4.2.1. Voice File Extraction and Division Module Development
4.2.2. TEXT Conversion Module Development
4.2.3. Vehicle Terminology and Profanity Extraction Module Development
4.3. Video/Voice Process System Manager Service System Implementation
5. Proposed System Evaluation
5.1. Evaluation of System Performance Indicator Goals and Achievements
- Object recognition accuracy: Upload “vehicle image” and “non-vehicle image” to the server to check whether the object recognition in the image is correct.
- Video stabilization rate: Check whether the artificial intelligence has the correct recognition ability for the contents recorded in the video despite the shakiness during video recording.
- Context awareness rate: Measure the contents of speech recognition in the artificial intelligence server for the recorded voice.
- Profanity slang detection and filtering accuracy: A random user conducts a function test and conducts a questionnaire to measure the performance to detect and filter the content containing profanity and slang from the recorded voice.
- Content upload accuracy: Test the server upload accuracy of edited video.
5.2. Video Process System Performance Evaluation
5.2.1. Video Process System Performance Evaluation Environment
5.2.2. Video Process System Speed Evaluation
5.2.3. Evaluation of Graphics Card Utilization Rate of Video Process System
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pathak, A.R.; Pandey, M.; Rautaray, S. Application of deep learning for object detection. Procedia Comput. Sci. 2018, 132, 1706–1717. [Google Scholar] [CrossRef]
- Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Henry, C.; Ahn, S.Y.; Lee, S.W. Multinational License Plate Recognition Using Generalized Character Sequence Detection. IEEE Access 2020, 8, 35185–35199. [Google Scholar] [CrossRef]
- Kong, J. Analysis of Used Car E-Commerce Platform. In Proceedings of the 7th International Conference on Education and Management (ICEM 2017), Naples, Italy, 20–22 September 2017. [Google Scholar]
- Englmaier, F.; Schmöller, A.; Stowasser, T. Price Discontinuities in an Online Used Car Market. 2013. Available online: https://www.econstor.eu/handle/10419/79982 (accessed on 11 November 2020).
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Honolulu, HI, USA, 11–15 December 2011. [Google Scholar]
- Graves, A.; Mohamed, A.R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649. [Google Scholar]
- Alom, M.Z.; Taha, T.M.; Yakopcic, C.; Westberg, S.; Sidike, P.; Nasrin, M.S.; Van Esesn, B.C.; Awwal, A.A.S.; Asari, V.K. The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv 2018, arXiv:1803.01164. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv 2016, arXiv:1602.07261. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2012, 60, 84–90. [Google Scholar] [CrossRef]
- Abdusalomov, A.; Whangbo, T.K. Detection and Removal of Moving Object Shadows Using Geometry and Color Information for Indoor Video Streams. Appl. Sci. 2019, 9, 5165. [Google Scholar] [CrossRef] [Green Version]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Redmon, J. Darknet: Open Source Neural Networks in C. Available online: http://pjreddie.com/darknet/ (accessed on 11 November 2020).
- Ozturk, T.; Talo, M.; Yildirim, E.A.; Baloglu, U.B.; Yildirim, O.; Acharya, U.R. Automated detection of COVID-19 cases using deep neural networks with X-ray images. Comput. Biol. Med. 2020, 121, 103792. [Google Scholar] [CrossRef] [PubMed]
- Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
- Xie, L.; Ahmad, T.; Jin, L.; Liu, Y.; Zhang, S. A new CNN-based method for multi-directional car license plate detection. IEEE Trans. Intell. Transp. Syst. 2018, 19, 507–517. [Google Scholar] [CrossRef]
- Kim, K.J.; Kim, P.K.; Chung, Y.S.; Choi, D.H. Multi-scale detector for accurate vehicle detection in traffic surveillance data. IEEE Access 2019, 7, 78311–78319. [Google Scholar] [CrossRef]
- Wang, H.; Lou, X.; Cai, Y.; Li, Y.; Chen, L. Real-time vehicle detection algorithm based on vision and lidar point cloud fusion. J. Sens. 2019, 2019. [Google Scholar] [CrossRef] [Green Version]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [Green Version]
- Liu, B.; Wu, M.; Tao, M.; Wang, Q.; He, L.; Shen, G.; Chen, K.; Yan, J. Video Content Analysis for Compliance Audit in Finance and Security Industry. IEEE Access 2020, 8, 117888–117899. [Google Scholar] [CrossRef]
- Chen, M.J.; Bovik, A.C. Fast structural similarity index algorithm. J. Real-Time Image Process. 2011, 6, 281–287. [Google Scholar] [CrossRef]
- Zhang, T.; Xie, J.; Zhou, X.; Choi, C. The Effects of Depth of Field on Subjective Evaluation of Aesthetic Appeal and Image Quality of Photographs. IEEE Access 2020, 8, 13467–13475. [Google Scholar] [CrossRef]
- Gupta, S.K.; Soong, F.K.P. Speech Recognition. U.S. Patent 6,138,095, 3 September 1998. [Google Scholar]
- Addison, E.R.; Wilson, H.D.; Marple, G.; Handal, A.H.; Krebs, N. Text to Speech. U.S. Patent 6,865,533, 8 March 2005. [Google Scholar]
- Potkonjak, M. Voice to Text to Voice Processing. U.S. Patent 9,547,642, 17 January 2017. [Google Scholar]
- Yang, L.; Luo, P.; Change Loy, C.; Tang, X. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3973–3981. [Google Scholar]
- Lu, J.; Ma, C.; Li, L.; Xing, X.; Zhang, Y.; Wang, Z.; Xu, J. A vehicle detection method for aerial image based on YOLO. J. Comput. Commun. 2018, 6, 98–107. [Google Scholar] [CrossRef] [Green Version]
- Chen, R.C. Automatic License Plate Recognition via sliding-window darknet-YOLO deep learning. Image Vis. Comput. 2019, 87, 47–56. [Google Scholar]
- Tzutalin, D. LabelImg. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 11 November 2020).
- Sudha, D.; Priyadarshini, J. An intelligent multiple vehicle detection and tracking using modified vibe algorithm and deep learning algorithm. Soft Comput. 2020, 24, 17417–17429. [Google Scholar] [CrossRef]
- Sekeh, M.A.; Maarof, M.A.; Rohani, M.F.; Mahdian, B. Efficient image duplicated region detection model using sequential block clustering. Digit. Investig. 2013, 10, 73–84. [Google Scholar] [CrossRef]
- Seong, S.; Song, J.; Yoon, D.; Kim, J.; Choi, J. Determination of vehicle trajectory through optimization of vehicle bounding boxes using a convolutional neural network. Sensors 2019, 19, 4263. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Model | Train Dataset | mAP-50(%) | FPS | GPU |
---|---|---|---|---|
YOLOv3-416 | COCO Dataset | 55.3 | 35 | GeForce TITAN X |
YOLOv3-608 | COCO Dataset | 57.9 | 20 | GeForce TITAN X |
SSD300 | COCO Dataset | 41.2 | 46 | GeForce TITAN X |
SSD500 | COCO Dataset | 46.5 | 19 | GeForce TITAN X |
FPN FRCN | COCO Dataset | 59.1 | 6 | GeForce TITAN X |
Item | Contents |
---|---|
Object recognition Artificial Intelligence | System administrator to learn the videos on the vehicle and vehicle parts |
Create vehicle and vehicle parts object recognition model in video | |
Video analysis by system request | |
Video analysis results | Save object recognition analysis result log |
Transmit video analysis results to the system (JSON) | |
Others | Management of uploaded video and learning data |
Save uploaded video and learning data by directory |
Item | Contents |
---|---|
Supported language | Support Korean language recognition |
Support various foreign languages (including Southeast Asian countries) | |
Voice recognition result | Filter vehicle description and profanity |
Function to add vehicle description and profanity after system development | |
Save voice filtering result log | |
Transmit voice filtering results to the system (JSON) | |
Others | Utilizing STT modules from stable service providers for continuous service |
Item | Content and Object No. |
---|---|
Vehicle external parts | (0) headlight, (1) bumper, (2) hood, (3) grills, (4) window, (5) engine, (6) Trunk, (7) Break light, (8) hubcap/tire, (9) side-view mirrors, (10) windshield, (11) front door trim, (12) back door trim, (13) steering wheel |
Vehicle internal parts | (14) center fascia, (15) glove box, (16) instrument board, (17) gear lever, (18) air conditioner, (19) rear seats |
Format Variable Name | Format Definition |
---|---|
frame_id | Current frame number |
filename | File name of input video |
total_frame | Total number of frames in the input video |
fps | Frame rate of input video |
time | Video playback time of the current frame |
objects | An array was defined to display the recognized vehicle part object information (class_id, name), the position and size in the image (realativ_coordinates), and the confidence regarding the recognized object. |
Title 1 | Title 2 |
---|---|
Cause of sync error | The first and last frames in the original file are dropped |
Sync error fixes | (1) Copied and encoded the first frame dropped from the original file (2) Copied and encoded the frame dropped from the last frame. (3) Synchronized frame of the original and the copy with the recognized object |
Format Variable Name | Format Definition |
---|---|
index | Index of word extracted from audio file |
fourLetter | Variable determining whether the currently extracted text is slang |
startTime | The moment when the speech of the extracted text first started |
endTime | The moment at which the speech of the extracted text ended |
word | Text extracted from audio files |
carLetter | Variable that determines whether the currently extracted text has a vehicle-related terminology |
Item | Router Execution Source |
---|---|
Video Process System | $cmd = “./darknet detector demo data/obj.data data/yolo-obj.cfg backup/yolo-obj_final.weights \“upload/{$file}.mp4\” -out_filename \“upload/{$file}_yolo.mp4\””; $pid = backgroundExec($cmd); |
Voice Process System | $cmd. = “java -jar speech-+-cloud.jar wordoffsets \“../yolo/upload/{$file}.mp4\” {$lencode}”; $pid = backgroundExec($cmd); |
Evaluation Item (Main Performance Spec) | Unit | Proportion (%) | Development Target |
---|---|---|---|
1. Object recognition accuracy | % | 25 | 85% or above |
2. Video stabilization rate | % | 15 | 80% or above |
3. Context awareness rate | % | 25 | 35% or above |
4. Profanity slang detection and filtering accuracy | % | 15 | 80% or above |
5. Content upload accuracy | % | 20 | 90% or above |
Item | Specification | Remarks |
---|---|---|
CPU | Intel Core 9 Gen i7-9700k (4.90 GHz) | - |
Motherboard | ASUS PRIME Z390-A STCOM (Intel Z390/ATX) | - |
RAM | DDR4 64 GB (DDR4 16GB × 4) | Samsung DDR4 16 GB PC4-21300 |
OS | Ubuntu Desktop | version: 18.0.4 LTS |
LAN | port 1 (internal)—10/100 Mbps port 2 (external)—10/100 Mbps | - |
Storage | SSD: 512 GB/HDD: TB (2 TB × 2) | Total: 4.5 TB |
GPU | GPU 1—GeForce RTX 2080 Ti 11 GB | - |
Power | 1000W (+12 V Single Rail) | Micronics Performance II HV 1000 W Bronze |
Item | Total Speed (35,536 Frame) | Average FPS | Total Object (35,536 Frame) |
---|---|---|---|
YOLOv3 | 426.710524 ms | 83.29 FPS | 56,816 |
YOLOv3+MSSIM Type 1 | 394.329463 ms | 90.11 FPS | 57,150 |
YOLOv3+MSSIM Type 2 | 384.718328 ms | 92.37 FPS | 57,150 |
Item | YOLOv3 | YOLOv3+MSSIM Type 1 | YOLOv3+MSSIM Type 2 | |||
---|---|---|---|---|---|---|
Analysis Time | Average FPS | Analysis Time | Average FPS | Analysis Time | Average FPS | |
1 min | 43.35 ms | 82.95 FPS | 30.83 ms | 116.63 FPS | 30.06 ms | 119.64 FPS |
2 min | 86.66 ms | 83.00 FPS | 55.35 ms | 129.96 FPS | 54.37 ms | 132.29 FPS |
3 min | 131.39 ms | 82.11 FPS | 91.61 ms | 117.76 FPS | 89.26 ms | 120.86 FPS |
4 min | 174.82 ms | 82.26 FPS | 118.79 ms | 121.06 FPS | 115.86 ms | 124.12 FPS |
5 min | 218.91 ms | 82.15 FPS | 154.14 ms | 116.66 FPS | 151.00 ms | 119.09 FPS |
6 min | 262.88 ms | 82.07 FPS | 178.13 ms | 121.11 FPS | 174.50 ms | 123.64 FPS |
7 min | 306.38 ms | 82.17 FPS | 215.24 ms | 116.96 FPS | 209.89 ms | 119.95 FPS |
8 min | 351.28 ms | 81.91 FPS | 245.38 ms | 117.26 FPS | 240.48 ms | 119.64 FPS |
9 min | 394.16 ms | 82.11 FPS | 269.05 ms | 120.29 FPS | 263.54 ms | 122.80 FPS |
10 min | 437.08 ms | 82.28 FPS | 298.31 ms | 120.55 FPS | 291.93 ms | 123.18 FPS |
Test No. | Compared Algorithm | Average Execution Speed | Average fps | Average Execution Speed Difference | T-Value | p-Value |
---|---|---|---|---|---|---|
1 | YOLOv3 | 240.69 ms | 82.30 FPS | 75.0 ms | 5.7331 | 0.00028 |
YOLOv3+MSSIM Type 1 | 165.68 ms | 119.82 FPS | ||||
2 | YOLOv3 | 174.82 ms | 82.30 FPS | 78.6 ms | 5.7484 | 0.00027 |
YOLOv3+MSSIM Type 2 | 162.09 ms | 122.52 FPS | ||||
3 | YOLOv3+MSSIM Type 1 | 165.68 ms | 119.82 FPS | 3.59 ms | 5.9151 | 0.00022 |
YOLOv3+MSSIM Type 2 | 162.09 ms | 122.52 FPS |
Model | Train Dataset | mAP-50(%) | FPS | GPU |
---|---|---|---|---|
YOLOv3+MSSIM Type 1 | Our | 94.23 | 120 | GeForce RTX 2080 TI |
YOLOv3+MSSIM Type 2 | Our | 94.23 | 125 | GeForce RTX 2080 TI |
YOLOv3-416 | Our | 94.23 | 82 | GeForce RTX 2080 TI |
YOLOv3-416 [21] | COCO Dataset | 55.3 | 35 | GeForce TITAN X |
YOLOv3-608 [21] | COCO Dataset | 57.9 | 20 | GeForce TITAN X |
SSD300 [42] | COCO Dataset | 41.2 | 46 | GeForce TITAN X |
SSD500 [42] | COCO Dataset | 46.5 | 19 | GeForce TITAN X |
SSD321 [43] | COCO Dataset | 45.4 | 16 | GeForce TITAN X |
DSSD321 [43] | COCO Dataset | 46.1 | 12 | GeForce TITAN X |
FPN FRCN [1] | COCO Dataset | 59.1 | 6 | GeForce TITAN X |
Test No. | Compared Algorithm | Average GPU Utilization | Average GPU Utilization Differences | T-Value | p-Value |
---|---|---|---|---|---|
1 | YOLOv3 | 82.34% | 22.29% | 109.09 | p < 0.0001 |
YOLOv3+MSSIM Type 1 | 60.05% | ||||
2 | YOLOv3 | 82.34% | 23.05% | 116.28 | p < 0.0001 |
YOLOv3+MSSIM Type 2 | 59.29% | ||||
3 | YOLOv3+MSSIM Type 1 | 60.05% | 0.76% | 3.6293 | 0.0002 |
YOLOv3+MSSIM Type 2 | 59.29% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, K.; Jeong, I.; Cho, J. Design and Implementation of a Video/Voice Process System for Recognizing Vehicle Parts Based on Artificial Intelligence. Sensors 2020, 20, 7339. https://doi.org/10.3390/s20247339
Kim K, Jeong I, Cho J. Design and Implementation of a Video/Voice Process System for Recognizing Vehicle Parts Based on Artificial Intelligence. Sensors. 2020; 20(24):7339. https://doi.org/10.3390/s20247339
Chicago/Turabian StyleKim, Kapyol, Incheol Jeong, and Jinsoo Cho. 2020. "Design and Implementation of a Video/Voice Process System for Recognizing Vehicle Parts Based on Artificial Intelligence" Sensors 20, no. 24: 7339. https://doi.org/10.3390/s20247339