Implementasi dan Evaluasi Sistem Keyword Spotting Real-Time berbasis Google Speech API pada Asisten Cerdas Sederhana di Google Colaboratory

Authors

  • Singgih Astaguna T Program Studi Rekayasa Elektro, Universitas Negeri Makassar Author
  • Muhammad Fauzan Yusuf Program Studi Rekayasa Elektro, Universitas Negeri Makassar Author
  • Izza Etthiyyah RN Program Studi Rekayasa Elektro, Universitas Negeri Makassar Author
  • Dessy Ana Laila Sari Program Studi Rekayasa Elektro, Universitas Negeri Makassar Author

DOI:

https://doi.org/10.61220/

Keywords:

Keyword Spotting, Google Speech API, Smart Assistant, Healthcare, Indonesian Language

Abstract

The development of voice recognition technology, particularly Keyword Spotting (KWS) , offers significant potential for critical applications such as emergency response systems in the healthcare sector. The use of cloud-based Application Programming Interfaces (APIs) like the Google Speech API provides an accessible development pathway for these applications , but its performance in specific contexts, such as for the Indonesian language, requires thorough evaluation. This study aims to implement and evaluate the performance of a real-time KWS system based on the Google Speech API within a simple smart assistant prototype designed for healthcare emergency scenarios. The system was implemented on Google Colaboratory using a Text-based KWS approach. Quantitative testing was conducted with two male participants who uttered five Indonesian target keywords ("Hi", "Tolong", "Aduh", "Oke", "Salam") , with 25 repetitions for each keyword, resulting in a total of 250 test samples. System performance was measured using the Success Rate metric, calculated from the classification of True Positives (TP)  and False Negatives (FN). The results show that the system was functionally implemented with a combined detection success rate of 62.8%. Significant performance variation was found both among keywords (highest success rate for 'Tolong' at 72.0% ) and between participants (77.6% vs. 48.0% ), highlighting speaker dependency. It is concluded that the Google API-based KWS approach is feasible for rapid prototyping, but the current accuracy and performance variability indicate the need for more robust and specific model development before it can be relied upon for critical healthcare applications.

References

[1] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan, “Speech Recognition Using Deep Neural Networks: A Systematic Review,” IEEE Access, vol. 7, pp. 19143–19165, 2019, doi: 10.1109/ACCESS.2019.2896880.

[2] H. Wijaya, “Teknologi Pengenalan Suara tentang Metode, Bahasa dan Tantangan: Systematic Literature Review,” bit-Tech, vol. 7, no. 2, pp. 533–544, Dec. 2024, doi: 10.32877/bt.v7i2.1888.

[3] I. López-Espejo, Z.-H. Tan, J. Hansen, and J. Jensen, “Deep Spoken Keyword Spotting: An Overview,” Nov. 2021, [Online]. Available: http://arxiv.org/abs/2111.10592

[4] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” May 2020, [Online]. Available: http://arxiv.org/abs/2005.08100

[5] A. Ermolina and V. Tiberius, “Voice-Controlled Intelligent Personal Assistants in Health Care: International Delphi Study,” J Med Internet Res, vol. 23, no. 4, p. e25312, Apr. 2021, doi: 10.2196/25312.

[6] R. M. Alsina-Pagès, J. Navarro, F. Alías, and M. Hervás, “homeSound: Real-Time Audio Event Detection Based on High Performance Computing for Behaviour and Surveillance Remote Monitoring.,” Sensors (Basel), vol. 17, no. 4, Apr. 2017, doi: 10.3390/s17040854.

[7] S. Pandya and H. Ghayvat, “Ambient acoustic event assistive framework for identification, detection, and recognition of unknown acoustic events of a residence,” Advanced Engineering Informatics, vol. 47, p. 101238, Jan. 2021, doi: 10.1016/j.aei.2020.101238.

[8] A. Jarin, A. Santosa, M. T. Uliniansyah, L. R. Aini, E. Nurfadhilah, and G. Gunarso, “Automatic speech recognition for Indonesian medical dictation in cloud environment,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 13, no. 2, p. 1762, Jun. 2024, doi: 10.11591/ijai.v13.i2.pp1762-1772.

[9] H. Fakhrurroja, C. Machbub, A. S. Prihatmanto, and A. Purwarianti, “Multimodal Interaction System for Home Appliances Control,” International Journal of Interactive Mobile Technologies (iJIM), vol. 14, no. 15, p. 44, Sep. 2020, doi: 10.3991/ijim.v14i15.13563.

[10] D. M. W. Powers, “Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.16061

[11] A. Uberi, “SpeechRecognition,” 2017. [Online]. Available: https://pypi.org/project/SpeechRecognition/

[12] J. Robert, “pydub,” 2017. [Online]. Available: https://github.com/jiaaro/pydub

[13] P. Bédard, “P. Bédard,” 2020. [Online]. Available: https://pypi.org/project/gTTS/

[14] T. Saito and M. Rehmsmeier, “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets,” PLoS One, vol. 10, no. 3, p. e0118432, Mar. 2015, doi: 10.1371/journal.pone.0118432.

[15] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit Lett, vol. 27, no. 8, pp. 861–874, Jun. 2006, doi: 10.1016/j.patrec.2005.10.010.

Downloads

Published

2025-06-08

Issue

Section

Articles