Assistive Robot for Automatic Sorting Using Voice-Guided Human-Machine Interaction
Abstract
This paper presents the development of a virtual environment for object classification and location by an assistive robot using voice-guided human-machine interaction to facilitate the storage or ordering of products by means of robotic systems, reducing potential risks due to excess load or repeatability to human operators. The proposed approach integrates computer vision techniques and transformer-based speech recognition models within a virtual simulation environment. Specifically, a ResNet18 neural network, selected for its low computational demand and high efficiency in classification and localization tasks, is used to accurately identify objects. As a contribution to the state of the art, a human-machine interaction environment is developed with natural language processing algorithms oriented toward industrial applications, where the sorting order is specified by voice commands captured and transcribed by a wav2vec-based speech-to-text algorithm, allowing users to interact naturally and efficiently with the robotic system. Experimental validation demonstrates the robustness of object detection and the reliability of speech recognition, highlighting the system’s effectiveness and potential applications in automated industrial scenarios.
Keywords: ResNet, Speech to text (STT), Human Robot collaboration, Automation.
Full Text:
PDFReferences
L. Monferdini, L. Tebaldi, and E. BOTTANI. Industry 4.0 to Industry 5.0: Opportunities, Challenges, and Future Logistics Perspectives Procedia Comput Sci, 2025, 253: 2941-2950. doi: 10.1016/j.procs.2025.02.018
M. E. LATINO A maturity model for assessing the implementation of Industry 5.0 in manufacturing SMEs: learning from theory and practice Technological Forecasting and Social Change, 2025, 214: 124045. https://doi.org/10.1016/j.techfore.2025.124045
A. Massaro, F. Santarsiro, G. Schiuma, Advanced electronic controller circuits enabling production processes and AI-driven KM in Industry 5.0. Journal of Industrial Information Integration, 2025, 45: 100841, https://doi.org/10.1016/j.jii.2025.100841https://doi.org/10.1016/j.jii.2025.100841
C. CHEN, K. ZHAO, J. LENG, C. LIU, J. FAN, P. ZHENG. Integrating the large language model and digital twins in the context of Industry 5.0: Framework, challenges, and opportunities Robotics and Computer-Integrated Manufacturing, 2025, 94: 102982. https://doi.org/10.1016/j.rcim.2025.102982
J. LI, X. HU, A. LUCIC, Y. WU, I.C.F.S. CONDOTTA, R. N. DILGER, N. AHUJA, and A. R. GREEN-MILLER, “Promote computer vision applications in pig farming scenarios: high-quality dataset, fundamental models, and comparable performance,” Journal of Computer Vision, vol. Journal of Integrative Agriculture, 2024, https://doi.org/10.1016/j.jia.2024.08.014
B. HELIAN, X. HUANG, M. YANG, Y. BIAN, and M. GEIMER Estimation of excavator bucket fill using a computer vision-based depth map and faster R-CNN Automation in Construction, 2024, 166: 105592, https://doi.org/10.1016/j.autcon.2024.105592
J. Liao, L. Guo, L. Jiang, C. Yu, W. Liang, K. Li, and F. Po A machine learning-based feature extraction method for image classification using ResNet architecture is proposed. Digital Signal Processing, 2025, 160: 105036. doi: 10.1016/j.dsp.2025.105036
H. Yu, H. Song, L. Xu, D. Li, Y. Chen, SED-RCNN-BE: A SE-Dual channel RCNN network optimized binocular estimation model for automatic size estimation of free swimming fish in aquaculture, J. Phys. Res. Commun. Expert Systems with Applications, 2024, 255(Part A): 124519, https://doi.org/10.1016/j.eswa.2024.124519
Z. Ren, F. Tian, S. Wang, S. Chen, Research on maize leaves surface action potential recognition method based on ResNet-18SE. Smart Agricultural Technology, 2025, 10: 100819. https://doi.org/10.1016/j.atech.2025.100819.
W. Du, M. Qian, S. He, L. Xu, X. Zhang, M. Huang, N. Chen. Improved ResNet method for urban flooding water depth estimation from social media images Measurement, 2025, 242(Part D): 116114. doi: 10.1016/j.measurement.2024.116114
A. KHATTAK, P. W. CHAN, F. CHEN, A. H. ALMALIKI. Deep ResNet Strategy for Classifying Wind Shear Intensity Near Airport Runway Computer Modeling in Engineering and Sciences, 2025, 142(2): 1565-1584. https://doi.org/10.32604/cmes.2025.059914https://doi.org/10.32604/cmes.2025.059914
X. Wang, J. Dai, X. Liu. A spatial-temporal neural network based on ResNet-Transformer for predicting railroad broken rails. Advanced Engineering Informatics, 2025, 65(Part A): 103126. https://doi.org/10.1016/j.aei.2025.103126https://doi.org/10.1016/j.aei.2025.103126
XI. Liu, H. Feng, Y. Wang, D. Li, K. Zhang. Hybrid ResNet and transformer model for efficient image reconstruction of electromagnetic tomography Flow Measurement and Instrumentation, 2025, 102: 102843. doi: 10.1016/j.flowmeasinst.2025.102843
Z. Wu, M. Li. ResNet-Swin Transformer based intrusion detection system for in-vehicle network. Expert Systems with Applications, 2025, 127547, https://doi.org/10.1016/j.eswa.2025.127547
L. F. Parra-Gallego, T. Arias-Vergara, J. R. Orozco-Arroyave. Multimodal evaluation of voicemail customer satisfaction using speech and language representations Digital Signal Processing, 2025, 156(Part B): 104820. DOI: 10.1016/j.dsp.2024.104820
J. X. Zhang, G. WAN, J. GAO, Z. H. Ling, Audio-visual representation learning via knowledge distillation from speech foundation models, Applied Speech Science, 89, e013–e018. Pattern Recognition, 2025, 162:111432. doi: 10.1016/j.patcog.2025.111432
S. AUROBINDO, R. PRAKASH, M. RAJESHKUMAR. Comparative analysis of different time-frequency image representations for the detection and severity classification of dysarthric speech using deep learning (DL) Results in Engineering, 2025, 25: 104561. https://doi.org/10.1016/j.rineng.2025.104561
A. Albuquerque, S. Chibuoyim Uche, E. Agu, Intoxication detection from speech using representations learned from self-supervised pre-training. Smart Health, 2025, 100562. https://doi.org/10.1016/j.smhl.2025.100562
T. NEUMAIER. The representation of threatening speech in Late Modern English trials Journal of Pragmatics, 2025, 237: 55-67. https://doi.org/10.1016/j.pragma.2025.01.004https://doi.org/10.1016/j.pragma.2025.01.004
A. Chakhtouna, S. Sekkate, A. Abib. Modeling Speech Emotion Recognition using ImageBind representations Procedia Comput Sci, 2024, 236: 428-435. https://doi.org/10.1016/j.procs.2024.05.050
P. FIATI. SMILE: A verbal and graphical user interface tool for speech-control of soccer robots in Ghana. Cognitive Robotics, 2021, 1: 25-28. https://doi.org/10.1016/j.cogr.2021.03.001
X. Kang. Speech emotion recognition algorithm of intelligent robot based on ACO-SVM. International Journal of Cognitive Computing in Engineering, 2025, 6: 131-142. https://doi.org/10.1016/j.ijcce.2024.11.008https://doi.org/10.1016/j.ijcce.2024.11.008
X. Zhou. Application of entertainment performance robots in a music network classroom based on speech sensor recognition and artificial intelligence Entertainment Computing, 2025, 52:100782. https://doi.org/10.1016/j.entcom.2024.100782
Z. YING Experience of an intelligent speech robot in an online music classroom based on deep learning and virtual reality Entertainment Computing, 2025, 52: 100795. doi: 10.1016/j.entcom.2024.100795
N. GRÁGEDA, C. Busso, E. Avarrado, R. GARCÍA, R. Muru, F. Huenupan, N. Benecerra Yoma. Speech emotion recognition in real static and dynamic human-robot interaction scenarios Computer Speech & Language, 2025, 89: 101666. doi: 10.1016/j.csl.2024.101666
S. Park, the M. Mark, B. Park, H. Hong. Speaker-Specific Emotion Representations in Wav2vec 2.0-Based Modules for Speech Emotion Recognition Computers, Materials and Continua, 2023, 77(1): 1009-1030. https://doi.org/10.32604/cmc.2023.041332https://doi.org/10.32604/cmc.2023.041332
A. Baevski, H. Zhou, A. Mouhamed, M. Aulique. Wav2vec 2.0: A Framework for the Self-Supervised Learning of Speech Representations Computer Science, Computation and Language, 2020. https://doi.org/10.48550/arXiv.2006.11477.
F. NAVEED, A. MASIH, J. MAHMOOD, M. AHMED, A. ALI, A. SADDIQA, M. S. HAMZA ABDULNABI, and E. AGBOZO. Sustainable AI for plant disease classification using ResNet18 in few-shot learning. Array, 2025, 26: 100395. https://doi.org/10.1016/j.array.2025.100395
R. JIMÉNEZ MERNONO, A. A. ESPITIA CUBILLOS, E. RODRÍGUEZ CARMONA. Interactive communication human-robot interface for reduced mobility people assistance. IAES International Journal of Artificial Intelligence, 2025, 14(2). http://doi.org/10.11591/ijai.v14.i2.pp917-924
Y. LECUN, L. BOTTOU, Y. BENGIO, P. HAFFNER. Gradient-Based Learning for Document Recognition Proceedings of the IEEE, 1998, 86(11): 2278-2324. https://doi.org/10.1109/5.726791
A. VASWANI, N. SHAZEER, N. PARMAR, J. USZKOREIT, L. JONES, A. N. GOMEZ, Ł. KAISER and POLOSUKHIN. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010, 2017. Curran Associates Inc., Red Hook, NY, USA,
R. JIMÉNEZ-MORENO and R. A. CASTILLO Deep learning speech recognition for a residential assistant robot International Journal of Artificial Intelligence, 2022, 12(2): 585-592. http://doi.org/10.11591/ijai.v12.i2.pp585-592
Refbacks
- There are currently no refbacks.


