Eksplorasi Pengembangan Korpus Pembicaraan Spontan pada Video Instruksional Pertanian Perkotaan

Trisna Gelar, Aprianti Nanda


Urban farming instructional videos can help people understand farming activities. The availability of transcription will improve video quality and facilitate access for people with hearing impairments. Video transcription can be obtained automatically based on the translation but the transcription is highly dependent on the quality of the audio signal. Noise conditions or natural sounds when the speaker is speaking will affect transcription qualities.  In this study, Deepspeech is used to predict letters based on sound without needing to understand language rules. The Common Voice multilingual corpus help DeepSpeech transcript Indonesian language. However, it has not accommodated urban farming topics, thus an additional corpus is needed. 15 popular videos with closed captions and 9 E-Books with the theme of Horticulture (fruits, vegetables and medicinal plants) have been curated. Video data was extracted into audio and transcription according to specifications of system training data, while urban farming text data were transformed into language model and used to predict recognition results. The results of the evaluation showed that the number of epochs had an effect on improving transcriptions performance. The score from language model could interprets words with specific domains thus improves WER. Another finding is that the model cannot predict short words (one-two syllables) in the informal variety and at the end of the sentence.


korpus, eksplorasi, pembicaraan spontan, model akustik, model bahasa, pertanian perkotaan

DOI: https://doi.org/10.17509/seict.v3i1.44548


