Eksplorasi Pengembangan Korpus Pembicaraan Spontan pada Video Instruksional Pertanian Perkotaan

Trisna Gelar, Aprianti Nanda


Urban farming instructional videos can help people understand farming activities. The availability of transcription will improve video quality and facilitate access for people with hearing impairments. Video transcription can be obtained automatically based on the translation but the transcription is highly dependent on the quality of the audio signal. Noise conditions or natural sounds when the speaker is speaking will affect transcription qualities.  In this study, Deepspeech is used to predict letters based on sound without needing to understand language rules. The Common Voice multilingual corpus help DeepSpeech transcript Indonesian language. However, it has not accommodated urban farming topics, thus an additional corpus is needed. 15 popular videos with closed captions and 9 E-Books with the theme of Horticulture (fruits, vegetables and medicinal plants) have been curated. Video data was extracted into audio and transcription according to specifications of system training data, while urban farming text data were transformed into language model and used to predict recognition results. The results of the evaluation showed that the number of epochs had an effect on improving transcriptions performance. The score from language model could interprets words with specific domains thus improves WER. Another finding is that the model cannot predict short words (one-two syllables) in the informal variety and at the end of the sentence.


korpus, eksplorasi, pembicaraan spontan, model akustik, model bahasa, pertanian perkotaan

Full Text:



H. Celie O’Neil, “Why you should lean into how-to content in 2018 ,” thinkwithgoogle, Des 20, 2017. https://www.thinkwithgoogle.com/marketing-strategies/video/self-directed-learning-youtube/ (diakses Des 31, 2021).

P. ten Hove dan H. van der Meij, “Like It or Not. What Characterizes YouTube’s More Popular Instructional Videos?,” Technical Communication, vol. 62, no. 1, hal. 44–62, Feb 2015, Diakses: Des 31, 2021. [Daring]. Tersedia pada: https://www.ingentaconnect.com/contentone/stc/tc/2015/00000062/00000001/art00005.

A. Håkansson dan K. Hoogendijk, “Transfer learning for domain specific automatic speech recognition in Swedish: An end-to-end approach using Mozilla’s DeepSpeech,” Chalmers tekniska högskola, 2020.

S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, dan S. Nakamura, “Development of Indonesian large vocabulary continuous speech recognition system within A-STAR project.,” Jan 2008, Diakses: Des 31, 2021. [Daring]. Tersedia pada: https://aclanthology.org/volumes/I08-8/.

D. Lestari, I. Koji, dan S. Furui, “A large vocabulary continuous speech recognition system for Indonesian language ,” in 15th Indonesian Scientific Conference in Japan Proceedings, 2006, hal. 17–22.

A. Hannun et al., “Deep Speech: Scaling up end-to-end speech recognition,” Des 2014.

R. Ardila et al., “Common Voice: A Massively-Multilingual Speech Corpus,” Mar 2020.

N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, dan M. Shah, “Video description: A survey of methods, datasets, and evaluation metrics,” ACM Computing Surveys, vol. 52, no. 6, hal. 1–37, Jan 2020, doi: 10.1145/3355390.

J. Kim, P. T. Nguyen, S. Weir, P. J. Guo, R. C. Miller, dan K. Z. Gajos, “Crowdsourcing step-by-step information extraction to enhance existing how-to videos,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Apr 2014, hal. 4017–4026, doi: 10.1145/2556288.2556986.

S. Yuki, T. Shinnosuke, dan S. Hiroshi, “SMASH Corpus: A Spontaneous Speech Corpus Recording Third-person Audio Commentaries on Gameplay,” in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 20, Mei 2020, hal. 6571–6577.

H. Kenneth, “KenLM: Faster and Smaller Language Model Queries,” in Proceedings of the 6th Workshop on Statistical Machine Translation, Jun 2011, hal. 187–197.

J.-U. Bang et al., “KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition,” Applied Sciences, vol. 10, no. 19, hal. 6936, Okt 2020, doi: 10.3390/app10196936.

L. Besacier, E. Barnard, A. Karpov, dan T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol. 56, hal. 85–100, Jan 2014, doi: 10.1016/j.specom.2013.07.008.

T. Gelar dan A. Nanda, “Klasifikasi Komentar Video Instruksional Populer Bertemakan Pekarangan Perkotaan menggunakan Auto-Keras,” Journal of Software Engineering, Information and Communicaton Technology (SEICT), vol. 1, no. 1, hal. 1–9, Des 2020.

Google Developers, “YouTube Data API,” Google Developers, Agu 31, 2021. https://developers.google.com/youtube/v3 (diakses Des 31, 2021).

Info Ragam Pertanian, “Playlist Pertanian KoTA 2021,” Youtube, Agu 31, 2021. https://www.youtube.com/playlist?list=PLcCat4rDFj3oB34O816CLu9pDSUwBZIK7 (diakses Des 31, 2021).

Kementrian Pertanian, “Repositori Buku Hortikultura Kementan,” Kementrian Pertanian, Agu 31, 2021. http://repository.pertanian.go.id/handle/123456789/7076 (diakses Des 31, 2021).

Mozilla Developer, “Training Your Own Model — DeepSpeech 0.9.3 documentation,” DeepSpeech’s documentation, Agu 2021. https://deepspeech.readthedocs.io/en/v0.9.3/TRAINING.html (diakses Des 24, 2021).

Mozilla Developer, “DeepSpeech Playbook,” Mozilla Developer, Agu 24, 2021. https://mozilla.github.io/deepspeech-playbook/ (diakses Des 24, 2021).

DOI: https://doi.org/10.17509/seict.v3i1.44548


  • There are currently no refbacks.

Copyright (c) 2022 Journal of Software Engineering, Information and Communication Technology (SEICT)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Journal of Software Engineering, Information and Communicaton Technology (SEICT), 
2774-1699 | p-ISSN:2744-1656) published by Program Studi Rekayasa Perangkat Lunak, Kampus UPI di Cibiru.

 Indexed by.