Exploration of Spontaneous Speech Corpus Development in Urban Agriculture Instructional Videos

Video transcription can be obtained automatically based on the original language translation of the video maker's speech, but the quality of the transcription depends on the quality of the audio signal and the natural voice of the speaker. In this study, Deep Speech is used to predict letters based on acoustic recognition without understanding language rules. The Common Voice multilingual corpus helps Deep Seech to transcribe Indonesian. However, this corpus does not accommodate the special topic of urban agriculture, so an additional corpus is needed to build acoustic and language models with the urban agriculture domain. A total of 15 popular videos with closed captions and nine E-Books with the theme of Horticulture (fruit, vegetables and medicinal plants) were curated. The video data were extracted into audio and transcription according to specifications as training data, while the agricultural text data were transformed into language models, which were used to predict recognition results. The evaluation results show that the number of epochs has an effect on improving the transcription performance. The language model score used during prediction improved WER performance as it interpreted words with agricultural terms. Another finding was that the model was unable to predict short words with informal varieties and located at the end of the sentence.


METHODS
Quantitative methods are used in this research, the performance of the urban agriculture instructional video corpus model will be measured empirically.

Instructional Video
There are two types of instructional video delivery: the first is practising the activity in real time/spontaneity.The second is delivered using a structured narrative, including a timeline of sub-tasks (Chang et al., 2017).The second type is reported to have a higher level of satisfaction as the learner can interact step-by-step.Unfortunately, not all video tutorials on YouTube are of this quality.Each video content on Youtube is uploaded by beginners and professionals, so the quality of the content delivered is different (Suryani, 2022).

Spontaneous Conversation Corpus Development
Spontaneous speech is usually uttered by someone without planning what to say beforehand (Lapasau and Setiawati, 2020).In the context of instructional videos, spontaneous speech occurs when the producer performs live activities or one take of a video recording, informal vocabulary, reflective reactions, or local language.
The SMASH corpus (Lotfian and Busso, 2017) includes spontaneous speech from two Japanese male commentators making third-person audio commentary during a professional gaming competition.Each commentator made spontaneous comments while watching the game, not only about the fight, the commentators also entertained the audience.In addition The SMASH corpus building process consists of four main steps, namely, 1) Data Creation, researchers curated the Legendary Pokemon game competition event commented by two professionals and divided the match scene into three rounds.2) Transcription, performed automatically from audio to Japanese text using a cloud-based speech-to-text (STT) service (Google Cloud STT3).3) Refine transcription is done by two volunteers to correct typos or word errors that are generated automatically.4).The audio corpus and transcriptions were collected (see Figure 3).Figure 3. SMASH corpus creation process.

Automatic Speech Recognition
Automatic speech recognition (ASR) is a technology that enables interaction between humans and computers through voice (Dyarbirru and Hidayat, 2020).Google voice search applies an example of ASR technology that converts voice into text to perform everyday commands on mobile devices.In addition, in the education domain, ASR is used to transcribe teaching and learning activities into a set of indexed text so that it can be easily searched by students.
Figure 4 is the ASR architecture (Huang et al, 2016), there are four processes, namely; 1).The process starts with the transformation of the audio signal using signal processing and feature extraction techniques so that it conforms to the acoustic model (AM) format.2) The acoustic model combines knowledge about the sound can be acoustic or phonetic based on the features extracted in the previous stage and generates a score for the feature.3) The language model measures the likelihood or score of combinations of word sequences.4) The hypothesis generation process combines the AM and LM scores to produce a hypothesis of what word was spoken according to the audio signal.

Audio and Language Models in Mozilla Deep Speech
Common Voice is a multilingual dataset of transcribed, community-based, Creative Commons Zero (CC0) licensed audio talks built by Mozilla (Handoko and Suyanto, 2019).The Common Voice Indonesian dataset consists of 54 unique voices with a total of 5 hours of speech and 4 hours of validation (Dyarbirru and Hidayat, 2020).The data obtained is the result of crowdsourcing, for several languages the Mozilla Deep Speech and Common Voice models produce an average CER improvement of 5.99 ± 5.48 (Tachbelie et al, 2022).
A common way of statistical language modelling is to measure the probability of an n-word sequence (bigrams, trigrams, etc.) occurring in a sentence.The probability measure is based on maximum likelihood estimation which depends on the available training data.
In Deepspeech, KenLM (Ruder et al, 2019) is used to process n-grams which is efficient in both time and memory resources used.PROBING and TRIE, a data structure designed to optimise memory and CPU usage.trie results from KenLM will be used as the basis for building the Language Model.

Automatic Speech Recognition Performance Calculation
Word error rate (WER) and Character error rate (CER) are performance measures in automatic spontaneous speech recognition (Bang et al., 2020;Besacier et al., 2014), WER measures the performance of correctly recognised word order predictions, while CER is based on phoneme order.Both metrics are derived from the Levenshtein distance formula and are useful for evaluating improvements to acoustic models.
Word error rate (WER) formula (Andrew C et al., 2004) is used to measure the ratio of prediction errors at the word level, a good WER value is close to zero, expressed in Equation.
S is the number of substituted words, D is the number of deleted words, I is the number of inserted words and C is the number of correct words, and N is the number of words in the tested sentence (N=S+D+C).Character error rate (CER) formula is used to measure the ratio/percentage of prediction errors at the phoneme level, a good CER value is close to zero.

Acoustic Model Development
Figure 5 illustrates the processing of curated agricultural instructional videos into acoustic models.There are four main stages, namely; 1).Agricultural Video Curation with 2).Data Processing, transcription using Google Speech to Text API automatically and corrected after listening to the audio equivalent.While the audio is downloaded in mp3 format and converted according to the Common Voice dataset specifications.3).Once the audio corpus and transcriptions are collected, the file name, size and transcription text information are stored in a csv file.4) Acoustic models of the audio corpus and transcriptions will be built using Mozilla Deepspeech.

Urban Agriculture Video and Text Data Curation
The acoustic corpus was obtained from audio and transcriptions of popular Youtube Instructional Videos, while the language corpus was obtained from Urban Agriculture-themed Books.

Urban Agriculture Text Data Curation
Books with the theme of Horticulture (fruit, floriculture, vegetables and medicinal plants) Table 1 were taken from the repository of the Ministry of Agriculture of the Republic of Indonesia (see http://repository.pertanian.go.id/handle/123456789/7076 ) as a source of text for the development of the language model.Each book has a PDF format so that further processing is needed so that text specific to agricultural terms can be extracted.In addition, the authors conducted manual preprocessing such as making bookmarks to separate less relevant parts of the book such as covers, preface, table of contents, tables that are difficult to extract directly by machine and bibliography (Permatasari and Linawati, 2021).

Video Data Curation and Transcription
There are three data curation activities for urban agriculture instructional videos adopted from research (Gelar and Nanda, 2020), namely determining urban agriculture keywords, collecting videos on Youtube playlists, selecting fifteen videos based on the order of the highest views and likes using the Youtube API (Novendri et al, 2020), another requirement is that each video has a close caption feature so that transcription can be extracted automatically using the Google Speech to Text (Google STT) API.

Data Processing
To build a corpus, acoustic model and language model, agricultural data suitable for machine learning, such as converting video into audio and transcription and dividing data into training, test and validation data.As well as converting pdf-formatted electronic books into agricultural texts that can be processed into language models.Issue 1, June 2022 Pages 1-14 DOI: https://doi.org/10.17509/seict.v3i1.44548p-ISSN 2774-1656 e-ISSN 2774-1699

Agricultural Video to Audio Processing and Transcription
There are two main processes in this phase, namely: 1) Extracting audio and text transcriptions from curated videos and manually verifying the transcriptions using Google Forms.There are six procedures that have been implemented as follows: a. Downloading mp3 audio files from curated videos and retrieving transcription files with the help of the YouTube-dl library and YouTube Transcript API. b.Convert audio files according to common voice dataset library specifications (mono channel, sample rate 16 Khz and wav file format).c.Cutting audio files according to the timestamp on the transcription which consists of start, duration, and stop.d.Filtering audio with a speech duration of 2-4 seconds, because the observation results, other than this duration, the speech is short (1-2 words) or there is a long background music.e. Cleaning the transcription text data including case folding and removing punctuation marks.2) Create a Google Form (List video id, sequence, timestamp, file size, splitted audio, sentence, file path) verifying transcription on each audio file to improve the results of automatic transcription (converting numbers/symbols/units into words; for example 1 kg into one kilogram).Unify the audio and transcription into a file structure that meets the Deep Speech library specification.There are three procedures that have been implemented as follows: a. Collect audio files into 1 folder with the name clips.b.Combined the final results of the transcription review into one csv format file consisting of file path, file size and text (transcription).c.Separating the csv file into 3 files that are prepared for training, validation and testing.

Agricultural Book Text Processing
There are two main processes in this phase, namely: 1) Perform text extraction from PDF formatted agricultural books, perform image to text extraction to get raw text with the Indonesian language tesseract-ocr library found in easyocr (with bounding box text parameters, namely: min_size: 0, slope_ths: 0.1, y_center_ths: 0.5, height_ths: 0.5, widh_ths: 0.7, decoder: beam search).The extraction process is performed for each image in the bookmark, and then aggregated back into a text collection or agricultural corpus.2) Text data cleaning such as case folding, removing punctuation is done automatically and converting numbers, symbols and units into words and removing other irrelevant words (residual text that does not contain meaning) is done manually.The statistics of the processed agricultural text corpus are 6,257 sentences, 35,243 words and 252,366 characters.However, the extraction results cannot detect paragraphs (Table 3), so there are truncated sentences that affect the formation of n-grams.
Table 1.Example sentence of agriculture corpus.

No Sentence
1 mangifera indica mango fruit is one of the fruits that 2 popular in Indonesia in the period of mango harvest area tends to be 3 up with production ranging between thousand tonnes when compared with
2) The Language model optimisation process is used to obtain the optimal lm_alpha and lm_beta parameters, used when creating the scorer for 10 trials.
3) The scorers of vocab, lm.binary and parameters lm_alpha and lm_beta obtained in the previous process are used during evaluation to improve / give meaning to the character prediction.

Acoustic Data Modelling
The Acoustic model training process was carried out with 2 main scenarios, namely using the Indonesian Common Voice dataset and a combination of Common Voice (CV) and City

Data Evaluation
Based on Table 6, the performance comparison of acoustic and language models based on the number of epochs and the use of scorers (urban farming language model).The CER and WER values on both the CV dataset and the combined dataset decrease (get better) with each increase in the number of epochs (5, 10 and 15).The performance of WER with a scorer on the combined dataset is better because the language model can correctly predict meaningful words/sentences on the theme of urban agriculture, compared to the performance of CER measured by pronunciation accuracy, which is not better than the evaluation results without a scorer.
Table 7 is a head-to-head snapshot of the prediction results of 5 sentences using the combined dataset model, epoch 15 with and without scorer.It can be seen that despite having a better CER value, the words/sentences produced by the model without scorer do not contain the correct meaning.The model with scorer cannot predict short informal speech words such as this, well and can't as well as short words such as two, this, one at the end of the sentence.

CONCLUSION
There are four main conclusions in the research related to the stages of data curation, data processing, model development and evaluation.First, audio data curation and transcription of the 15 most popular YouTube videos that have Indonesian closed caption attributes.Curation of agricultural text data from 9 Horticulture theme books in PDF format from the Ministry of Agriculture repository, each book is given bookmarking and deleting irrelevant information.Second, Audio data and transcriptions from YouTube videos have been processed and prepared according to the format that will be used as training, validation and test data for acoustic model development while agricultural book data is converted into clean text and has been processed for language model development.Third, the city agricultural audio data and the Indonesian Common Voice dataset have been used to build the acoustic model, a total of 6 training scenarios have been implemented based on data usage and number of epochs.The agricultural language data has been transformed into a language Issue 1, June 2022 Pages 1-14 DOI: https://doi.org/10.17509/seict.v3i1.44548p-ISSN 2774-1656 e-ISSN 2774-1699 model score, used during evaluation, serving to give meaning to the predicted words.Fourth, the evaluation results show that the model with the highest epoch on each data is the best model, the WER and CER values always decrease with each epoch.CER performance on tests without language models is always better but the predicted sentences do not contain meaning.While the WER value in the test with the language model produces a better and meaningful value.Another finding is that the model cannot predict short informal words or sentence endings.
To improve the shortcomings, there are four possible solutions, namely; First, Adding short word spontaneous speech text such as (well, this, that and others) in addition to the text of agricultural terms.Second, Improving the procedure of automatically converting typical agricultural words/symbols/units using the Name Entity Recognition technique or agricultural dictionary lookup in a sentence, Third, Improving the word or paragraph beheading process to reduce truncated sentences when extracting pdf to text.Fourth, exploring fine tuning of acoustic models with hyperparameter optimisation, including the number of epochs, hidden layers, early stops, and others.

ACKNOWLEDGEMENTS
Praise be to God Almighty, this Independent Research Programme (PM) can be carried out well.The PM Scheme funds are sourced from DIPA Politeknik Negeri Bandung with an activity implementation agreement letter Number: 105.78/PL1.R7/PG.00.03/2021.We thank you for the participation of all parties for the implementation of this research.

Figure 1
is an overview of the research stages.

Figure 1 .
Figure 1.Research method of urban agriculture corpus development

Figure 2 .
Figure 2. The process of commentators commenting on game competitions.

Figure 5 .
Figure 5. Agricultural video processing into acoustic models

Figure 6 .
Figure 6.Processing of agricultural text into language models.

Table 1 .
Curation results of urban agriculture textbooks

Table 2 .
Writing the table of results of the Agricultural Instructional Video Curation

Table 4 .
Compilation of CER and WER comparison table of city agriculture corpus.

Table 5 .
Comparison of CER and WER of City Agriculture Corpus