Question Generator System of Sentence Completion in TOEFL Using NLP and K-Nearest Neighbor

Test of English as a Foreign Language (TOEFL) is one of learning evaluation forms that requires excellent quality of questions . Preparing TOEFL questions using a conventional way certainly spends a lot of time . Computer technology can be used to solve the problem . Therefore , this research was conducted in order to solve the problem of making TOEFL questions with sentence completion type . The built system consists of several stages: (1) input data collection from foreign media news sites with excellent English grammar quality ; ( 2) preprocessing with Natural Language Processing (NLP) ; (3) Part of Speech (POS) tagging ; (4) question feature extraction ; (5) separation and selection of news sentences ; (6) determination and value collection of seven features ; (7) conversion of categorical data value (8) target classification of blank position word with K-Nearest Neighbor (KNN) ; (9) heuristic determination of rules from human experts ; and (10 ) options selection or distraction based on heuristic rules. After conducting the experiment on 10 news , it is obtained that 20 questions based on the results of the evaluation showed that the generated questions ha d very good quality with percentage of 81.93% ( after the assessment by the human expert ) , and 70% w the same blank position from the historical data of TOEFL questions. So, it can be concluded that the generated question has the following characteristics : the quality of the result follows the data training from the historical TOEFL questions , and the quality of the distraction is very good because it is derived from the heuristics of human experts.


INTRODUCTION
Educational evaluation is a process of describing, obtaining, and presenting useful information for assessing decision alternatives in learning (Stufflebeam, 1971).One of the forms of evaluation that requires quality questions is a Test of English as a Foreign Language (TOEFL ).It is the most well -known test in the field of ELT (English Language Teaching ) (Alderson and Hamp-Lyons, 1996).There are several types of TOEFL , which named as (i) listening comprehension ; (ii) structureandwritten expression, and (iii) reading comprehension.In the structure and written expression in TOEFL, the parameter to be tested relates to the understanding grammar in sentences .Two types of questions in the section have been known in the TOEFL test: completion and error detection .In the first type , it is only to fill in the blank questions in TOEFL , while the second test is to choose an underline word, which is the incorrect word in the sentence .Then , because of the need of updating TOEFL questions on a regular basis with the latest topics and many questions .This makes the TOEFL questions to be helpful automatically in the process of producing qualified questions , especially on the sentence completion type.By using the existing techniques in Machine Learning, the quality of the generated questions can be kept, in accordance with the standards in the previous TOEFL questions.Nilsson (1998) explained that machine learning is a field of science to make a machine or computer to be smart .Machine learning is the most important to make the process simpler .From the machine learning methods , there is one of the most well -known algorithms , namely KNN (a machine learning algorithm for system classification ).Then , as well-known, one of the techniques of processing data (namely NLP ) can help the techniques to perform text processing .It is a research and application area that explores.how computers can be used to understand and manipulate text (Chowdhury, 2003).
This study was focused on generating sentence completion types, in which they were generated from news articles using a combination of some following techniques , such as NLP, KNN, and heuristic techniques.The proposed system involving these methods compute articles that have good English grammar as input data ; then it produces some chosen sentence-completion questions with the answers.Some related works can be found in the literature.For example, research conducted by Aldabe et al. (2006) introduced ArikIturri, which is an application used for generating fill -in -the -blank questions using NLP combined by Corpora considering the morphology and syntaxes aspects.Text2Test proposed by Aquino et al. (2011 ) utilized : text processing , scoring , and question over generation to build questions .Araki et al. (2016 ) generated multiple choicetyped questions in the subject of biology.Questions are generated by using the question template in the wh -question format .A learning management system embedded by the examination paper generated automatically was proposed by Cen et al . (2010 ).A technique with its evaluation for generating multiple choice close questions in English grammar and vocabulary (Goto et al., 2010).

The Method For Generating Sentence Completion Typed Questions
As shown in Figure 1, the computational model for generating questions can be divided into two processes: learning step and testing step that involve different data sets (i.e., data training and data testing ).The first is used to generate templates of questions from historical TOEFL data as data training , while the second one uses news or articles as a data testing as a candidate question.
But , both stages consist of the same following processes : inputting data , preprocessing with regex , tokenization , POS tagging with Stanford Core NLP, calculating values according to defined features , and converting categorical into numerical values.After that , results from both stages are inputted into KNN for determining a word position as the blank .Some heuristics are defined to select reasonable dummy answers for a distraction .After completing these processes , we obtained full questions with optional answers.Additionally, we can explain these processes in detail in the following section.

Data Gathering for Training and Testing
There are two sets of data required in this system, which are for training and testing.Data training is taken from historical data of TOEFL question as a reference in determining the blank position on a question.Data training is required so that the generated-question quality can be maintained as to the quality of previous TOEFL questions.For example, in this research experiment we used data training taken from the book as follows:

Longman Complete Course for the TOEFL
Test: Preparation for the Computer and Paper Tests (Phillips, 2001).

Data Processing
Pre-process was done on two types of datasets (i.e., data training and data testing).The first stage of removal of punctuation (regex).The removed-punctuations were other than dots and underscores.A dot is used for a marker or separator between sentences.Whereas, underscore is used to mark the blank position during feature extraction.Then, the other pre-process stage is tokenization which is the stage to divide one sentence into a word.This stage is necessary to simplify the process of part-ofspeech tagging and feature extraction in the next stage.For example, given a complete sentence as follow: One of the most popular Indonesian products is Batik, it has been internationally recognized.So, after these processes, we obtain the following sequence: Thus, it can be seen that the complete sentence separated into word by word .After that , these processes are also applied to all data.

Part of Speech (POS) Tagging by
Stanford Core NLP At this stage, every word will come to part-of-speech tagging (POS Tagging) process to get information about the word class which will then be needed for feature extraction.There were many computational linguistic software.One of them used in this study is Stanford Core NLP which can be accessed on page https://stanfordnlp.github.io/CoreNLP/.It is a toolkit created for research purposes in the NLP field (Manning , et al., 2014).There are 8 English word classes, namely noun, verb, pronoun , preposition , adverb, conjunction , adjective , and articles .Moreover , there is a popular and commonly used tag set, which is Penn Treebank Tag set (Marcus et al ., 1993 ).The examples of the use of POS Tagging in the previous sentence on the preprocessing data is "CD|IN|DT|RBS|JJ|JJ| NNS |VBZ |NNP |PRP | VBZ |VBN |RB|VBN ", where CD means cardinal number .IN, DT, RBS , JJ, NNS , VBZ , NNP , PRP , VBN , and RB are the meaning in preposition , determiner , adverb, adjective , noun plural, present verb for the 3rd person , proper noun , personal pronoun, verb past participle, and adverb, respectively .Thus , each word has its own part-of-speech label.This will facilitate the process of generating questions in the next stages.

Separation and Selection of Sentences from News Articles
Basically, there are two steps in this section, as follows: separation and selection.The first one is the process of separating sentences in one long news text.It is done to make it easier in making a question since usually it only contains one sentence.So, it is that one large text contains thousands line of sentences will be separated from dot (.) to dot (.).The way of separation of this sentence is by using the regex command.It will search for the mentioned punctuation and then use the split function on any punctuation that has been found.
At the selection stage, the selection of sentences is done to simplify and shorten the classification process.Selection of sentences was done with considering two conditions, as follows: 1. Sentences consist of 10 to 30 words.This requirement has been discussed with the previous expert.2. Then sentences are randomly chosen from the first condition.
Based on the two requirements above, the sentences in the news which became the question candidate is expected to be more qualified.

Determination of Seven Features on Data Sets
This stage is the process of determining the feature to be used as a word attribute for the blank position classification.These features are important ones that facilitate the classification process.This feature consists of seven features as proposed by this study as follows (Hoshino and Nakagawa, 2005): • Post: It is a column filled by the Tag POS of the word in column 1 of that line.The post column is auto-filled using the Stanford CoreNLP library.
• Prev_Pos: It is a column containing the POS Tag of the previous word in one sentence.This is not different from other post columns.This column is auto-filled using the Stanford CoreNLP library.
• Next_Pos: It is a column containing the POS Tag of the next word in one sentence.This column is automatically filled using the Stanford CoreNLP library.
• Position: It is a column filled with a number which is the position of the word in that line in one sentence.For example, if in a sentence there are 10 words, then the order of words is 1-10, the column position in the first word will be filled by the number 1.
• Sentence: It is a column containing numbers to determine the number of words in a single sentence.If you have a sentence containing 10 words, this sentence column will contain the number 10 from the first word line to the last line of words.
• Word-Length: It is filled by the number of words that are repeated in one sentence.For example, when in a sentence there are 2 words of, then, this word-length column will contain number 2 in the word line.
• Word: It is a column containing words in sentences that have passed through the tokenization process in the system.
Moreover, Target: It is an output feature showing the index of the blank position.

Value Calculation of Seven Features on Data Sets
It is the process of collecting seven features for the set of data.In collecting these seven features, it took advantage of a process that has been done before.POS features, next word POS, and previous word POS are taken from tokenization and POS Tagging.In addition, the position feature uses the function count on each sentence to know the order of the words position in the sentence.The function of count is used to calculate how many words in a single sentence and then the result will be the value for sentence feature.Furthermore, the word length feature also utilizes the count function, but before that, the words in one sentence must be compared, so the word length of a word will increase if the word is repeated several times in one sentence.Meanwhile, the target is a determinant of word classification and features.The target of data training is automatically obtained by the system by detecting whether there is an underscore (_) at the end of the word.If there is an underscore, then the word is a blank position in the sentence, meaning the target is true.

Converting Categorical Data to Continuous
KNN is a distance calculation algorithm that necessarily requires numerical data to find the closest distance to be able to determine the target.So, the point we need in order to find the distance is to convert the categorical data into numerical data.The categorical data of the 7 features in this research are part -of -speech and word .These categorical data will be converted into continuous data using the following equation: where: -S is the calculation of categorical data initialization data -100 is the range that can be changed and determined as needed -P is the class of each categorical data , which is P1 = part of speech , P2 = previous word of part of speech , P3 = next word of part of speech.-x is the index of categorical data classes of n -V is the categorical data vector after it is converted to numeric (continuous) POS tags are initialized into numbers for easy calculation .The initialization is based on proximity between tags .The closer the tag, the tag has the proximity or Similarity as shown in Table 2.

= WRB
Since the three categorical data have the same data content, the calculation is as follows: The next step is to calculate V.
By applying the same questions as above, we calculate data with features as showed in Table 1.For example, the conversion of the word 'one' is as follows: 72,0,16,16) After the calculation is obtained, perform the calculation to all words in one sentence.Thus, the results will be obtained as showed in Table 3.This calculation applied to all data, both data training and data testing.

Determination of Blank Position with KNN
This stage is a step done to determine the target of each word in the data testing.This targeting is done by comparing word by word in the data testing of sentence with the word on data training.The word along with the seven features will be calculated the distance to determine the target classification of words that will be the blank position in the sentence.The distance calculation at KNN stage uses Euclidean Distance formula.So, the distance between words from the data training and the data testing can be calculated.For example, given in Table 4 to be data testing that will be calculated the distance to data training.
From Tables 3 and 4, we examine the Euclidean Distance .So, the word distance counted is the word 'One' and the word ' Industry': The calculation is performed to all data testing compared with data training.Thus, it will be obtained k the nearest distance on each word.As the example, k is 3. Thus, the most target of three closest distances for a word will be the target of the data testing.If two of the targets are false, then the target data testing will be false as well.From that target, the position of blank can be determined.True target to the word in one sentence will be a blank position.

Determination of Heuristics and Distractors
This stage is the process of heuristic determination to produce a distraction.The heuristic is structured for the purpose of producing qualified distractions.The rules for generating distractions are as follows: verb, preposition, pronoun, modal, determiner, conjunction, wh-pronoun, wh-determiner, and wh-possessive, and wh-adverb.For example, verbs have several types of tags, namely VB, VBD, VBG, VBN, VBP, and VBZ.Selecting distraction in a verb is taken from an online English dictionary using the Application Programming Interface (API) of Ultra lingua that can be accessed at http://api.ultralingua.com/.This API will generate all possible equivalents of a similar word from the verb with a feature called 'verb conjugation'.
After determining heuristics, the last stage is to generate incorrect answers to be distraction.Since a question with the type of sentence completion has four options with one correct answers and three wrong answers, we need to choose three distractions.Therefore, if the POS tagging of true answer is VBZ, then the distractions could be the corresponding verb with the POS tagging VBD, MD VB, and VB.For example in a question: The earth spins on its axis and … 23 hours, 56 minutes and 4.09 seconds for one complete rotation.
A. Needed B. Will need C. Need D. Needs

Experimental Design
At the experimental stage, the system will implement the model that has been created and produce the TOEFL questions with the type of sentence completion question.The numbers of generated question are 50 questions consisting of 1% of the data training which is 30 questions and 20 questions from 10 different news articles.As mentioned earlier, there are 10 news websites with different topics as data testing, which is listed in Table 5.
After conducting the experiments, there are three kinds of analyses that will be done, as follows: 1. Same blank position analysis: This analysis will prove the accuracy of the system in choosing the blank position.Generated-question from the data training will be calculated how many blank positions which are the same in order to get the level of accuracy.

Consistency of the answers analysis: This
analysis is an analysis that involving some experts to answer the generatedquestions.Two experts will answer the following questions and they will be checked whether the answers filled by the experts have the same answers with the provided answer key.
3. Evaluation and analysis on the quality of questions from the expert: The generated-questions will be evaluated by two experts in order to determine its quality.The assessment given by the experts is based on four metrics of assessment proposed by Araki et al (2016)  This assessment is determined from the overall aspects of both questions and distractions.The researchers made a scale of three points as follows: 1 (easy) means the generated question is considered easy, 2 (medium) means the generated question is considered sufficient, and 3 (hard) mean the generated question is considered very difficult.

Experimental Results
After executing the proposed system as explained in the previous section, we obtained 50 generated questions.Table 6 contains of some questions that have been generated.It can be seen that the yellow color on the word is the correct answer.

Discussion
In this section, it will be explained the analysis of the results obtained based on the experimental design in the previous section, namely same blank position analysis, consistency of the answer analysis, and evaluation and quality analysis of the questions by the experts.

The Same Blank Position Analysis
As the previous explanation, this analysis will prove the accuracy of the blank position between the generated sentences from the data training with the original question.
Table 7 shows the value of the same blank position.The number 1 indicated that the blank position on the generatedquestion by the system was equal to the blank position on the original question .There are 21 questions from 30 questions with 70 % has the same blank position .These results indicated that there were still generated -questions of the system with different blank positions .The difference may be caused by the smaller distance in the selected tag with the majority of true targets .Thus , the obtained blank position was not the same as the original question.

Consistency of the Answer Analysis
As mentioned in the previous explanation, this analysis will prove whether the experts answered the same questions according to the answer key generated by the system.The experts will answer 30 questions from data training, and 20 questions from data testing.
In Table 8, the two experts answered the number of questions from the data training correctly with different amounts.Expert 1 symbolized by E1 answered 25 questions out of 30 questions correctly.The percentage is 83%.Meanwhile, the expert 2 symbolized by E2 answered 21 true questions, with 70%.Whereas, as shown in Table 9, 20 questions generated from data testing, Expert 1 answered 19 questions according to the answer key, with a percentage of 95%.However, expert 2 only answered 16 questions according to the answer key with the value of 80%.
Based on the results of these two experts, it can be concluded that not all questions have a good quality of answer or a good distraction.This is proved by the difference of answers from two experts with the answer key.The difference can be caused by the ambiguity in the sentence, or the existence of a distraction so that it can generate two correct answers.From this assessment, it can be drawn the average consistency of this answer is 81.25%.

Question evaluation and quality analysis by Human Experts
Based on the experimental design, the questions were evaluated for quality with 4 assessment metrics: grammatical correctness (GC), answer existence (AE), distractor quality (DQ), and difficulty index ( DI ).The results of the evaluation wereassessed by the experts with the average assessment index of grammatical correctness 1.05 , answer existence 1.09, distractor quality 1. 71, and difficulty index 1.66.Then, it will be categorized that the quality of this question with five categories , which can be classified such as very good (between 80 and 100 %), good (between 60 and 80 % ), enough (between 40 and 60%), less (between 20 and 40 % ), and very less (less than 20 % ).The results of these calculations are presented in Table 10.

The Comparison with Previous Research
In this section, it will be compared the model and implementation of this study with the previous studies that have similar types of research.There were many studies related to this question generator.Some of them become the references for this study in developing the system model.Both the reference of the algorithm, the problem attributes, and the evaluation of the question quality.The comparison is shown in Table 11.b) The results of the question evaluation showed that the generated-question has excellent quality with a percentage of 81.93% after analyzed by the experts, 81.25% of consistency of the answer, and 70% of the same blank position.
Based on the results and analyses, this study contributes to be used as a tool for generating questions with the sentence completion on TOEFL automatically derived from news articles.

AUTHORS' NOTE
The author(s) declare(s) that there is no conflict of interest regarding the publication of this article.Authors confirmed that the data and the paper are free of plagiarism.

Figure 1 .
Figure 1.Flow model of question generator system.

Table 1 .
Example of data training of seven features values.

Table 3 .
The example of value conversion of Table1.

Table 2 .
Initialization values of POS tags.

Table 4 .
Example of seven features of value in data testing

Table 5 .
News sites data used in the study.

Table 5 (
continue).News sites data used in the study.

Table 6 .
Results: 50 questions and answers generated by the proposed system.

Table 7 .
The same of blank position analysis.

Table 8 .
The analysis by experts on data training (fitting step).

Table 9 .
The analysis by experts on data testing (testing step).

Table 10 .
The calculation results of each parameter.

Table 11 .
Comparison with other system.

Table 11 (
continue).Comparison with other system.Neighbor algorithm, and heuristics.Basically, the system contains two main processes: learning and testing.Both stages consist of inputting data, preprocessing with regex, tokenization, POS tagging with Stanford CoreNLP, calculating values according to defined features, and converting categorical into numerical values.After that, results from both stages are inputted into KNN for determining a word position as the blank.Some heuristics are defined to choose reasonable dummy answers for distraction.