Towards developing colloquial Indonesian language pedagogy: A corpus analysis

This study was motivated by the situation that many students studying Indonesian language have problems to understand and communicate in spoken Indonesian. This is because Indonesian is a diglossic language in which different sets of grammar and vocabulary are used between the high and low diglossic variants, whereas students are usually only taught the high diglossic variant. Only the high diglossic variant of formal Indonesian has an official status, while the low diglossic variant of colloquial Indonesian does not. Sneddon observed that in everyday speech the linguistic features of high and low diglossic variants are merging into a middle variant that Errington called Middle Indonesian. This study examines the extent to which a middle variant of spoken Indonesian has formed by quantifying the amount of high and low linguistic elements that are present in a corpus of everyday spoken Indonesian derived from audio-recordings and written texts containing spoken language. We collected and classified a 14,000+ word corpus of spoken Indonesian. With reference to published descriptions of high (formal) and low (colloquial) diglossia, each colloquial item in the corpus was counted and calculated as a ratio to the total N of the corpus. Colloquial features were found with an average proportion of 0.39 across the corpus, indicating that colloquial Indonesian lexicon and grammar may contribute as much as 39% to everyday spoken Indonesian. This result evidences the need to include this middle variant of spoken Indonesian in the design and resourcing of materials within the Indonesian language curriculum.


INTRODUCTION
Internationally, many students studying Indonesian as a foreign language have problems to understand and communicate in spoken Indonesian. This may be due to the lack of appropriate learning resources to teach informal spoken Indonesian to foreign learners. Coinciding with this lack of resources, a formal high diglossic variant of standard Indonesian is often misrepresented as the informal everyday spoken language of Indonesia for language teaching purposes. This is because Indonesian is a diglossic language (Errington, 1986;Sneddon, 2003a) in which different sets of grammar and vocabulary are used between the high and low diglossic variants, whereas students are usually only taught the high diglossic variant. Only the high diglossic variant of formal Indonesian (FI) has an official status, while the low diglossic variant of colloquial Indonesian (CI) does not (Smith-Hefner, 2007;Sneddon, 2003b). An understanding of the features of Indonesian diglossia is critical to redress the misrepresentation of the spoken language by Indonesian language teachers and resource developers.
"L" or "low" variety), a second, highly codified variety (labeled "H" or "high") is used in certain situations such as literature, formal education, or other specific settings, but not used for ordinary conversation (Errington, 2014;Ferguson, 1959). The reality of the Indonesian linguistic landscape is much more complex than the diglossic paradigm that is addressed in this article when regional languages and dialects are brought into consideration (Tamtomo, 2019). This article primarily addresses the Jakartan-origin middle variant that we hypothesise has become the common contemporary spoken language of Indonesian popular culture.
Research on Indonesian diglossia was pioneered by Errington (1986) and subsequent extensive research was continued by Sneddon (2001). Linguistic descriptions have been undertaken by Nothofer (1995), Sneddon (2001Sneddon ( , 2003Sneddon ( , 2006, Djenar (2006Djenar ( , 2008, Djenar & Ewing (2015), Tjung et al. (2006), Smith-Hefner (2007) and Kushartanti (2014). Many of these studies concentrated on the social and grammatical functions of selected lexical items. Sneddon (2003b) raised the possibility of a future merging of FI and CI into a middle variant. The gap in the research is that this merger is yet to be empirically investigated with a contemporary sample of spoken Indonesian. It is the objective of this current study, using both qualitative description and quantitative measures, to investigate Sneddon's FI-CI merging postulation. In this paper it is referred to as 'the M (middle) hypothesis' -that a middle variant has become the common spoken Indonesian (SI) language. To affirm the M hypothesis, CI must be an integral feature -alongside FI -in a corpus of informal spoken language.
Indonesian diglossia has arisen from the different Malay dialects that were spoken throughout the Malay Archipelago (Errington, 2014;Ewing, 2016;Gil, 1994;Manns, 2014). Formal Indonesian (FI) is derived from Royal Riau Malay court language which became the basis of Classical Malay literature and was well established as the language of literature by the time of European arrival in the 16th Century (Sneddon, 2003b). There were also several varieties of Market Malays, used by commoners in everyday transactions. Some of these varieties are the antecedents of colloquial Indonesian (CI). The CI variety that is treated in this study is the CI of Jakarta which is strongly influenced by Jakarta's Malay dialect Betawi Malay (Grijns, 1991;Sneddon, 2003a). Betawi Malay itself is a form of Malay that is influenced by Sundanese, Javanese, Balinese, Hokkien Chinese and Dutch, and these language features have in turn been inherited by Jakartan CI.
The emergence of Jakarta as the capital of independent Indonesia led to the formation of a language hybrid that we call spoken Indonesian (SI) in this article, an everyday spoken language that consists of FI and CI. This SI was largely driven by the 'new Jakartans', the post-independent generation of the capital who began fusing CI Betawi linguistic features with FI (Sneddon, 2003b). The Jakartan population, the youth especially, created many new words and phrases, even though the linguistic patterns, grammar, phonology and morphology did not evolve beyond those of Betawi Malay. It has been noted that children in Jakarta and the surroundings grow up speaking a register of Indonesian that leans strongly towards CI (Kushartanti, 2014).
While CI originated in the Jakartan speech community and its surroundings, in time, due to the prominence of Jakarta as the capital city and as an exporter of culture through its command of the media and literature, it spread to other parts of Indonesia (Sneddon, 2006). For example, outside the capital Jakartan CI can be commonly heard in radio broadcasts in regional cities such as Bandung, Denpasar and Padang as young speakers in regional cities use it during inter-ethnic interactions, as an in-group code and to project youth identity (Manns, 2014).

The taxonomies and coding of Indonesian diglossia
The FI-SI-CI taxonomy in this article corresponds to Sneddon's High, (hypothesized) Middle, and Low varieties. The FI-SI-CI coding we propose is a categorization system that establishes welldefined boundaries of each variant and allows for qualitative and quantitative linguistic analysis. FI, also referred to as standard Indonesian and known in Indonesian as bahasa Baku, is the language of formal spoken and written communication, such as government protocols and news presentations. The everyday spoken language is known by Indonesians as bahasa Sehari-hari. Indonesians certainly recognise the differences between formal and informal forms and switch between the two as the situation demands. However, often in practice there is not always a clear distinction between the use of formal and informal language (Djenar & Ewing, 2015;Sneddon, 2001). Speakers may make their informal speech somewhat more formal by incorporating some features of formal language and thus characteristics of FI are not excluded from informal conversation (Sneddon, 2001). Likewise, the formal language does not always conform to a standard form when used in social discourse. A politician may use less formal language in an unprepared speech to demonstrate his populist intentions when trying to connect to the masses. established and universally accepted. One problem in discussing Indonesian diglossia is the lack of universally agreed terms for the different diglossic language variants and sub-variants. The next section consolidates existing sociolinguistic terminologies into a workable coding system that allows for a systematic analysis of Indonesian diglossia.

Confusion in terminology
Firstly, it is important to clarify terminology used in relation to CI because consensus is lacking across the literature. Sneddon (2001) and Djenar & Ewing (2015) have used the term 'informal Indonesian ', and Smith-Hefner (2007) used the term 'spoken informal Indonesian', while Manns (2014) used the term 'Jakartan Indonesian'. Djenar (2006, p. 22) noted that there are many other terms used at different times by different writers in regard to the colloquial variety of Indonesian including bahasa tak baku "non-standard language", bahasa informal "informal language", bahasa gaul "social language", bahasa ABG "teen language", bahasa remaja "youth language", 'informal Jakartan Indonesian' and 'colloquial Jakartan Indonesian' (Kushartanti, 2014;Sneddon, 2006). Our view is that the terms mentioned above are often interchangeable and, in some cases, sub-variants of CI. The most common recent confusion amongst student researchers of Indonesian language is that bahasa gaul (social language) has been mistaken as CI. In this article, we classify bahasa gaul as a sub-variant of CI because bahasa gaul does not have different linguistic features to CI, aside from some extra lexical items created by younger speakers. Smith-Hefner (2007) stated that bahasa gaul functions within the linguistic parameters of CI with additional fad words. Like all living languages, it is constantly changing as new words or expressions become popular and fall out of use. At this point, it is worth clarifying the distinction between CI and SI. CI linguistic features pre-existed in Betawi Malay. SI on the other hand is a modern hybrid that we propose to be a derivative of both CI and FI. SI possesses no linguistic features of its own but is dependent on those of CI and FI. The presence of CI linguistic features in SI defines SI's function as an informal language variant.
This study analyses a corpus of everyday spoken Indonesian language derived from transcribed audio-recordings, such as interviews and films, and written texts containing spoken language, such as novels and short stories. Linguistic features were classified at the lexical and sub-lexical level as CI, FI, or neutral lexemes, and transcribed using the International Phonetic Association's (IPA) set of phonetic symbols. These linguistic features included lexis, phonology, morphology and semantics. The following questions guide this research: 1. In what ways are the linguistic features of CI unique and how can they be identified and described? 2. How prevalent are the linguistic features of CI in a corpus of everyday spoken Indonesian?

METHOD
A corpus-based analytic approach was the chosen research method because corpus-based research assumes the validity of linguistic forms and structures derived from linguistic theory (Biber, 2015). The primary goal of this research approach is to analyse the systematic patterns of variation and use for pre-defined linguistic features. The approach allowed us to ascertain how, and to what extent, pre-defined linguistic features form part of everyday spoken Indonesian. Previous descriptions of CI (Djenar, 2008;Djenar & Ewing, 2015;Kushartanti, 2014;Sneddon, 2006;) were used to classify the features of CI. These non-FI linguistic features were used to inform the qualitative description of CI using the IPA. Each CI item in the corpus was counted and quantitatively measured as a ratio in each data sample and to the total N of the corpus. Lexicon that are 'neutral', namely uninflected base words, are not counted as CI and make up the proportion of the remaining total (neutral + FI). The M hypothesis of Indonesian diglossia is expressed as a null-hypothesis H0: CI/SI = 0 and as an alternative hypothesis H1: CI/SI > 0. The SI in these hypotheses represents the entire N of the corpus of everyday language and the CI/SI ratio is used as a proportional measure to gauge the extent to which CI linguistic features form part of the everyday informal spoken Indonesian.

Data samples
The corpus used in this study is a sample of realworld language data and is therefore assumed to be representative (Chapman & Routledge, 2009;Stubbs, in Davies & Elder, 2008). The corpus was assembled and is available online (Nataprawira, 2017). Samples have been obtained from interview recordings with native Indonesian speakers compiled by Sneddon (2006) as well as samples of spoken texts from media, internet content, billboard advertisements and audio-visual media such as TV shows and films (Table 1).
The data samples were analysed as raw data, meaning that they were not modified from their original form. Audio-visual data samples were obtained from YouTube. The corpora were collected by transcribing parts of dialogues of films, comedies and TV shows. These text samples were selected because they provide a range of discourse registers (field, mode and tenor), including some spontaneous language use (comedies) that represents naturally occurring spoken dialogue.
Examples of audio-visual data sources include dialogues from the Opera Van Java comedy show, parts of films such as Buaya Gile and Jakarta Undercover. The billboard data samples were obtained from photographs of billboards. Table 1 shows the number of data samples, the number of lexical items each sample contained and the number of CI lexical items in each sample contained and the number of CI lexical items in each category.
As our research design used descriptive statistics, a measure of statistical power for the number of word tokens collected in the corpus was not required. Instead, we selected word tokens from a range of text types and spoken registers (14711 words across 48 data samples) to obtain a valid representation of SI language (Table 1).

Data analysis
Three methods of data analysis were used after collecting the raw corpus data (Figure 1).

Figure 1 The Mixed-Method Design of This Research
To address the research questions, a mixedmethod design consisting of qualitative and quantitative analysis was chosen. The qualitative component defines the CI linguistic features in the SI corpora (research question 1), which in turn are quantitatively measured to obtain an indication of the level of CI frequency and prevalence in SI (research question 2).

Method 1 -Differentiation: Identifying and collecting non-FI linguistic features.
The differentiation method used to investigate if CI was present in the SI corpus involved the identification of linguistic features that were not FI.
In this process, lexical items were first classified as FI or non-FI through a broad analysis of the phonological, morphological and semantic features of lexical items in the corpus. The description of FI in this study followed Sneddon (1996Sneddon ( , 2000, Quinn (2001) and Djenar (2003).

Method 2 -Qualitative analysis: Defining CI linguistic features using IPA.
Using the findings from Method 1, the CI linguistic features were categorized more discretely using the IPA. We referred to previous use of IPA in classifying the features of CI employed by Grijns (1981) in his study of variations in Betawi Malay.
The morphological analysis follows the common system used to describe affixation in Indonesian such as that employed by Boellstorff (2002). Using various existing descriptions of CI that have been provided by previous researchers, we devised guidelines to identify CI linguistic features. The guidelines included several indicators. Examples of these indicators are provided in the section -Qualitative results: CI in SI corpus: 1. Syntactical ellipsis is a common feature in daily speech (Sneddon, 2006). 2. Morphological variations that are different from FI (Fan, 1990;Kushartanti, 2014). 3. The phonological divergences from FI (Kushartanti, 2014). 4. Elisions and allomorphy (Kushartanti, 2014;Sneddon, 2006). 5. Alternative lexical items not present in FI (Djenar & Ewing, 2015;Sneddon, 2006). 6. Variation in semantic properties that fall outside of FI grammar (Djenar, 2008;Sneddon, 2006).

Method 3 -Quantitative analysis: Measuring the CI/SI ratio.
The aim of this research was to establish quantitatively the number of CI items in the SI corpus. Descriptive statistics were applied to test the null hypothesis that the CI/SI ratio in the corpus is equal to zero; H0: CI/SI= 0 and the alternative hypothesis that the CI/SI ratio in the corpus is greater than zero; H1: CI/SI > 0.

FINDINGS AND DISCUSSION Qualitative results: CI in SI corpus
The first method of data analysis indicated that there was a substantial amount of non-FI linguistic features in the SI corpus. These linguistic features have sub-components which consist of: non-FI lexicon, non-FI morphological features, non-FI null parameter / ellipsis, non-FI elisions, non-FI phonological realizations and non-FI semantic properties. The presence of CI and FI in the SI corpora supports Sneddon's (2006) assertion of the existence of a middle variant in spoken Indonesian. Concurrently, the notion that a pure form of FI is used as an informal spoken language can be rejected. CI can be positively verified to be an integral part of the everyday language. The second method was then applied which involved a discrete classification of non-FI items using the IPA. b. The adalah copula ellipsis in nominative structures such as: 2. Morphological features. Some scholars regard these following phonemic forms as allomorphy of the active me-prefix, but they could possibly also be independent morphemes inherited from Sundanese, Javanese and Balinese.

FI Lexical item
Make məmakai Syntax gloss m-(p)-ake English gloss/translation to use; to wear "to use; to wear" Note that the base word pakai this example also undergoes a phonological shift to [pake]. b. 'n' (/n/) -X _CI

FI Lexical item
Nangkep mənangkap CI phonology/morphology gloss n-(t)-angkəp English gloss/translation Catch "to catch" Note that a phonological change also takes place in the base word tangkap ⇨ tangkəp.

f. 'ng' (/ŋ/) -X'-in' (/-in/) & 'nge' (/ŋə/) -X'-in' (/-in/)
This is the active form of 1.3b. It is the CI variation of FI's me-X -kan and me-X -i. The example ngapain is a predication of WH-lexical item apa and has two semantic values: 4. An existing array of alternative lexical features different to FI, which is often preferred in speech rather than the FI variants (see Table 2). 5. The frequent use of discourse particles that are absent in FI as can be seen in Table 3.  Table 4.

Quantitative results: CI/SI
The third quantitative method of analysis involved counting every lexical item with CI markings in each of the data sample in the corpus and statistically analysing these in terms of the CI/SI corpora ratio. SPSS produced an overall mean CI/SI ratio of 0.39. The overall mean result of CI/SI ratio at 0.39 means that H0: CI/SI = 0 can be rejected and that H1: CI/SI > 0 can be accepted. Figure 2 illustrates the spread of each of the data samples as a CI/SI ratio. This is presented for the reader to provide a visual representation of the ratio for all corpora in their data set categories (AV: Audio-Visual; BB: Billboard; LIT: Literature; RI: Recordings of Interviews). Figure 2 shows that most data samples contained CI below 0.39, while most data samples containing CI above 0.39 ratios are only found in the AV and BB categories.
Interestingly, while the overall CI ratio in the RI category in this study is below 0.39, the RI data samples compiled by Sneddon (2006), show a much higher individual CI word count usage in comparison to the FI equivalent as shown in Table  5.

Correspondence analysis and the formal-informal spectrum
The distribution of the mean ratios for each data set (Table 6) shows that most of the data sets fall within the 0.2 -0.7 range with the 0.3-0.49 dimension holding the most entries. There were only three data sets that fell within the <0.2 and >0.7 dimensions.  The next analysis compares the dimensions of Table 5 with the formal-informal spectrum of the SI continuum ( Figure 3). The dimensions of the correspondence analysis are translated as intervallic variables in the formal-informal spectrum to show the spread of the data samples. The left-most 0 on the spectrum represents zero presence of CI while the right-most 1 on the spectrum represents usage containing exclusively CI. The bottom indicator marks the percentage the dimensions occupy as datasets from the corpus. Figure 3 represents this study's quantitative findings located along the informal language continuum of SI (Djenar & Ewing, 2015;Sneddon, 2006). Figure 3 shows that none of the corpora fell at the extreme end of the intervallic scale (0; 1), indicating that neither FI nor CI in their pure forms are used as an everyday language. The shaded range covering dimensions <0.2 ->0.7 is where the corpora data samples have spread with one RI data sample falling in the <0.2 dimension and one AV and one BB falling in the >0.7 dimension (Table 2). Datasets in the dimension 0.3-0.49 CI/SI ratio occupy the largest share (0.56) of the corpus ( Figure  3) suggesting that a formal-informal spectrum with a 0.3-0.49 CI/SI ratio is the most commonly encountered form of SI.

Figure 3 The Spread of Data in the SI Formal-Informal Spectrum
There are plausible reasons why three of the data sets fell outside the <0.2 and >0.7 range. The two data sets below <0.2 involved 1) an interview with an academic, and 2) an after-school-lesson advertisement. In the introduction of this transcript, Sneddon (2006) noted that the interview with the academic was 'somewhat formal and courteous'. Prior to that he has stated that it is usual amongst educated people, even when conversing in informal settings, that speech consisting of CI elements is likely to occur in only short segments and that FI will always dominate the register.' The more formal register in these data samples was likely to result from the education field and high-status tenor between the speakers, which in this case demonstrates the function of FI as a language of education and formality. This serves to remind us that foreign Indonesian language learners still need to be taught about the sociolinguistic implications for their choice of register and their need to be conscious of using FI in appropriate settings.
The audio-visual data set above >0.7 is a comedy scene from a film starring the late Betawi actor Benjamin. The heavily CI-influenced informal register reflects his Betawi cultural background. These data sets are provided in the Appendix as examples to demonstrate how CI and FI were coded in the corpus data sets. To see how all the data sets were coded see 1st Author (2017). Kohler and Mahnken (2010) have noted how the complexity of Indonesian language variants has been simplified in textbooks and consequently the spoken language is under-represented. This has resulted in learners of Indonesian language being illequipped to communicate in informal settings. Many informal dialogues in Indonesian language textbooks, which are usually designed or generated by the writer(s), are presented in FI. This contrasts with the results of this study which found that FI in its pure form is not used as an informal spoken language. The common practice of misrepresenting Indonesian as exclusively FI (Djenar, 2006) is partly due to a lack of understanding of the diglossic situation and because of the traditional educators' perception that the CI language is not appropriate to be taught because it is not 'good and proper' (baik dan benar) (Sneddon, 2006).

CONCLUSION
The main finding from this study is that linguistic features from informal spoken Indonesian CI are prevalent in everyday speech. Corpus data support Sneddon's observations that standard Indonesian FI has merged with CI to form an informal spoken Middle variant SI. This research shows that there are no set quantitative boundaries as to what defines the parameters of SI. This finding suggests that CI lexicon and grammar may contribute as much as 39% to everyday spoken Indonesian (SI).
The intention of this study was primarily to investigate the validity of existing observations and assertions by other scholars of the existence of SI, a middle variant, using qualitative and quantitative methods against corpora of informal language. Questions of SI use in relation to demographics are outside the scope of this article but provide opportunities for further research. The findings of this study may inform further research on SI such as geographic and demographic variations of SI, as well as diachronic CI studies, and the impact that modernity and world languages (notably English) have on SI.
This study and other similar studies on Indonesian linguistics and sociolinguistics form part of a shifting paradigm in the understanding of the spoken Indonesian language and subsequently changes in the teaching and learning of Indonesian language. A practical outcome of this research is the development of an SI language description which may inform the inclusion of CI in Indonesian language teaching materials to benefit students studying Indonesian as a foreign language.
Research suggests that utilising authentic texts in second language acquisition aides in developing native speaker competency (Gilmore, 2007). Many language learning texts that are created by publishers often do not reflect real-life language usage. Explicit teaching and learning of CI can provide explanations of the hitherto insufficiently understood CI lexis, speech acts, semantics and pragmatics, and allow for Indonesian language teachers to understand and utilise more authentic sources (e.g., contemporary real-life materials from TV, internet and films) as teaching resources.
The findings of this study lay the linguistic foundation for the development of a colloquial spoken Indonesian pedagogy. It is outside the scope of this article to detail this colloquial spoken Indonesian pedagogy here, but the reader can find such detail in the unpublished Doctoral thesis on which this paper is based (Nataprawira, 2017). For future publications on this subject, the authors intend to provide pedagogic models on how to teach and learn colloquial spoken Indonesian. Language aspects to include are authentic texts featuring common native speaker speech acts and explicit analysis of spoken lexis, collocation and intonation, and their semantic and pragmatic implications.