Washback effects of multiple-choice, cloze and metalinguistic tests on EFL students writing

The washback effects of different test formats on the writing performance of students have always been of great importance. However, this area of research has not fully touched upon by researchers of second language testing. Despite the importance of the issue, there is a dearth of empirical studies to unravel the effects of different types of tests on learning. To shed some light on the current issue, the present study intends to look into the washback effects of tests on students who are learning and using some special grammatical points in writing tasks. In order to fulfil this pro ject, we made a set question in three formats of cloze, mult iple -choice and metalinguistic on a grammatical fo rm(i.e. present perfect and present perfect continuous)to use after each session of teaching (2 sessions of training) as an activity. The researchers devised and validated three tests on the target form; each test contained 20 questions and was in different formats of cloze, multiple -choice or metalinguistic. At the end of this two-session trainings, two focused writing tasks were implemented. The results indicated that supporting teaching grammatical points with metalinguistic tests yields the highest positive washback on students writing. Finally, some practical implications were suggested.


INTRODUCTION
It does not matter that in which context, school or university, the practice of language teaching is being conducted; teaching is always subdivided into four phases including planning, teaching and learning, and evaluation (Ellis, 2003). Teaching goals are set in the planning phase in order to help to find activities, which are capable of provid ing learners with meaningfu l learning processes. Then, when it comes to the teaching and learning phase, all that teachers must do is to engage their learners in suitable learning strategies (Biggs, 2003). Finally, teachers need to conduct an evaluation to find out about the efficiency of the utilized teaching and learning strategies for the accomplishment of the teaching goals. However, successful teaching cannot be imp lemented unless some kind of meaningfu l correspondence connects these ingredients. Furthermore, aligning learning act ivit ies and assessment strategies is a critical trait that needs to evolve in language teaching. Undoubtedly, such an alignment can be achieved when teaching goals, learning strategies, teaching strategies and evaluative tests all correspond to each other.
According to Ellis (2003), the educational purpose of assessment is to provide the language learners with feedback, motivation, gu idance and learning support. To achieve a successful assessment, there should be a clear sense of what the course is designed to accomplish (Palo mba & Banta, 1999). Once the learning outcomes have been clearly defined, the development o f assessment methods for determin ing whether these outcomes have been met or not become mo re attainable. Teaching methods typically make general statements about the assessment methods (e.g., essay test, peer assessment, learn ing contract, oral exa mination). On the other hand,they should contain details regarding the assessment method alongside a concrete set of assessment resources (e.g. tests, test items, peer assessment forms).
As far as English language teaching is concerned, assessment seems to be unavoidable since there should be some method to measure a person's language ability (Brown, 2004). As it was previously mentioned, maintenance of correspondence or alignment among four phases of teaching is inevitable; therefore, tests must be closely associated with pedagogical purposes (Bach man & Palmer, 1996). Accordingly, a considerable portion of language testing literature refers to the effects of tests on teaching and learning known as the washback effect (Hughes, 2003). According to Hughes (2003), washback refers to the positive or negative influence that tests have on teaching and learning. Despite its relatively easy definition, the bulk of studies in this area suggest that it is an extremely complex phenomenon as there is no consensus on its effects (Green, 2006;Rea-Dickins & Scott, 2007;Spratt, 2005;Watanabe, 2004).Since studies have been conducted to examine the washback effect are scarce (Safa & Goodarzi, 2014) especially regard ing writing skill, and to explore the test format wh ich has the most washback effect on students' writing skills ,the current study will be an attempt to fill such gap in the literature.

Different ways of defining washback
When it comes to applied linguistics, there are several ways to define the concept of 'washback'. In its most simp lified version, it refers to the positive or negative effects that tests may have on teaching, learning processes, students, teachers, policy makers and other stakeholders (Alderson & Wall, 1993;Hughes, 2003). Today, there is a growing concern for such an influence among both theoreticians and practitioners in the realm of language teaching, and it also has been reflected in the curriculu m, teaching materials, teaching methods, testing procedures and, in a nutshell, in the leaning process (Spratt, 2005). Despite having a seemingly straightforward definit ion, the literature suggests that washback is an extremely co mp lex phenomenon as there is no consensus on the subject (Green, 2006;Rea-Dickins & Scott, 2007;Spratt, 2005). In order to co me to a better understanding of this multid imensional phenomenon, scholars felt that the washback issue should be studied fro m various aspects such as different effects of it on different stakeholders .
One of the strongest determining factors that can enhance the washback influence of a test refers to the importance of that test in taking big decisions. Somet imes, tests have direct or indirect life -changing influences over careers' of the test takers that is they are high stake tests. A university entrance test is a good example of this notion from wh ich the concept of measurement-driven instruction emerges (Pearson, 1988). So me scholars believe that this phenomenon could be beneficial for teaching and learning with the assumption of having properly constructed and implemented tests (Qi, 2005). On the other hand, there are other scholars who are criticizing washback due to its tendency to narrow down the curriculu m (Madaus, 1988). They believe that test-driven instruction limits students' and teachers' creativity (Wall, 2000).
Although validity is a well-defined and properly inquired concept in testing, it is still one of the interesting areas for scholars who are interested in the washback issue. Morrow (1986) believes that a test's validity should be measured by the degree of its beneficial influence on learn ing and teaching. With this in mind, validity acquires a new educational purpose that could result in curricular and instructional changes (Pan, 2009). Ho wever, this perspective suffers fro m a serious weakness since scholars have not managed to introduce proper ways for the emp irical establishment of test validity in this perspective. To confront this problem, Alderson and Wall (1993) tried to introduce a more unified concept of valid ity in wh ich washback had been addressed as a part of the test validity: Whereas validity is a property of a test, in relation to its use, we argue that washback, if it exists -which has yet to be established -is likely to be a co mplex phenomenon which cannot be related directly to a test's validity. (Alderson &Wall, 1993, p. 116) Later, Messick (1996) utilized the term 'consequential valid ity' to propose a stronger argument and put this notion within a stronger theoretical framework. He suggests investigating "validity as a likely basis for washback" rather than "seeking washback as a sign of test validity" (p. 252). He believes that consequential valid ity entails facets such as: Ev idence and rationale for evaluating the intended and unintended consequences of score interpretation and use in both the short-and long-term, especially those associated with bias in scoring and interpretation, with unfairness in test use, and with positive or negative washback effects on teaching and learning. (Messick, 1996, p. 251)

Intended vs unintended washback
There is a co mmon misconception that we one can differentiate good tests from bad ones based on their beneficial or detrimental washback effects (Heaton, 1988). However, a deeper look at the nature of tests and the washback phenomenon reveals that the correspondence between the quality of a test and positive washback is not always operational (Hughes, 2003). Of course, this new perspective has not led to the omission of utilizing positive and negative washback in the related literature of the field. Instead, the purpose is to highlight the fact that washback might be independent of the quality of the test and there may be other factors in the scene (Messick, 1996). In language testing, negative washback has been usually attributed to tests' limiting in fluence on content and creativity. As such, 'teaching to the test' is considered as an unholy byproduct of some tests that would result in lack o f motivation and lack of knowledge. On the other hand, tests are usually able to enhance the learners' motivation and empower them with a sense of accomplishment (Pan, 2009).

Empirical studies on washback
The washback phenomenon had not received much attention from language testing researchers until the early 1990s. In 1993, Green wrote an article about the effects of established testing programs and introduced themselves as the pioneers of empirical research in the field (Green, 2013). Afterwards, many studies have been conducted to explore the washback effects of h ighstakes tests with a focus on content, teaching and learning. In the follo wing notes, some of these studies will be briefly reviewed. Alderson and Hamp-Lyons (1996) studied the washback effects of TOEFL (international proficiency test) in the USA and found a widespread tendency for teaching to the test in TOEFL classes. A few years later, Andrews, Fullilove, and Wong (2002) inquired the washback influence of national Hong Kong advanced English oral examination required for ad mission into the university and concluded that due to high stakes of tests, linguistic knowledge and test-oriented skills were still the main focus of instructors, contrary to the intentions of test constructors.
In 2004, in New Zealand, Read and Hayes used interviews, questionnaires, classroom observations and tests scores in order to study the washback effects of IELTS (international p roficiency test) for tertiary study and came to the conclusion that negative washback effects of such tests are more observable in intensive course (Read & Hayes, 2004).
One year later, Qi (2005) studied the washback effect of national matricu lation English test (NM ET) as part of university entrance test battery in middle schools of China and did not spot the presence of intended washback. However, Green (2006) whose research context's country was the same (China) found washback on course content in his study of IELTS academic writing for tert iary study. In 2009, Shih conducted an inquiry regarding the washback effects of GEPT (national English proficiency test) in Taiwan and found limited and teacher-specific washback on teaching practices in the context with GEPT requirement (Sh ih, 2009).

Washback and writing tests
It is possible for the washback effect to work fo r improving the learners' writing ab ility when the test design is in accordance with the identification of the ability which is supposed to be tested. Therefore, defining the constructwriting abilityis one of the most fundamental concerns in developing a test of writing. Writing is a very co mplex cognit ive activity and to come up with a thorough understanding of this process we need to refer to prev iously established models (Bach man & Palmer, 1996;Grabe & Kap lan, 1996;Hayes, 1996;Hayes & Flower, 1980).
It is possible to translate the writ ing ability to two sets of features. The first set includes relevance and adequacy of content, compositional organization, cohesion, and adequacy of vocabulary, and altogether, they are labe lled as co mmun icative effectiveness. The second set includes grammar, punctuation, and spelling , and altogether, they are labelled as accuracy. Accordingly, the washback effect can be pedagogically beneficial in writing classrooms if t wo general results are achieved. First, we need to be able to collect, identify, describe and classify the errors of students through their performance in a writ ing test and statistically determine their level in writing ability. Second, we must be capable of exp loring the effectiveness of adjusting the instructional program with the features of the second language which cause problems for the learners in developing the writing ability. Currie and Thanyapa (2010) studied the effect of the mu ltip le-choice item format on the measurement of knowledge of language structure. They conducted their study with a sample of one hundred and fifty -two university undergraduates. These students took a test of English structure first in constructed-response format and later in three, stem equivalent mu ltiple -choice formats. They found a significant and substantial increase in mean and generally in indiv idual scores between the two tests. However, a direct co mparison of the responses to the items in the two tests showed that only 26% of the responses were the same. This means that most of what the mult iple-choice items measured was directly dependent on the item format.

Different types of tests and their washback
In another study, Rauch and Hartig (2010) compared mu ltip le-choice with open-ended response formats of reading test items. They focused on the dimensionality of a reading co mprehension assessment with non-stem equivalent mu ltip le-choice items and open-ended items with German test data of 8523 9 th graders. Accordingly, they concluded that a twodimensional item response theory model with withinitem mu lti-dimensionality had a superior fit co mpared to a uni-dimensional model.
Mozaffari, Alavi and Rezaee (2017) investigated the impact of response format on the performance of grammar tests. They compared mu ltip le-choice items with their constructed response stem-equivalent in a test of grammar using the Rasch model in order to co mpare item difficulties, fit statistics, ability estimates and reliabilit ies of the two tests. By means of two independent sample t-tests, they investigated whether the differences among the item difficulty estimates and ability estimates of the two tests were statistically significant.
There have been some studies addressing the issue of different test methods and their washback effect on language learning (e.g. Brame & Biel, 2015;Hemmat i & Ghaderi, 2014;In'nami & Koizu mi, 2009;Khoshsima & Pourjam, 2014;Ko, 2010 Despite the importance of the issue, there is a dearth of empirical studies to unravel the effects of different types of tests on learning. To shed some light on the current issue, the present study intends to look into washback effects of tests on students who are learning and using some special g rammat ical (i.e. present perfect and present perfect continuous) points in writing tasks. To pursue this goal, tests in three different formats had been provided including context embedded (clo ze test), context reduced (mult iple -choice items), and metalinguistic tests(i.e.tests that make students consciously ponder about the grammatical point taught). Afterwards, the study was carried out in three phases: first, grammat ical points were taught to four different groups of students. Then, three groups received treatments by taking a test after the teaching phase, but the control group only received an extended time o f teaching. At last, all groups took a focused writing task in wh ich the target grammar forms are needed to be used.
In a nutshell, this study has been conducted in order to answer the following questions:Is there any washback effect regard ing writ ing skill fo r the students who take tests as a learn ing activity?Which test format can have the most washback effect on students' writing skills when using as a learning activity?

Subjects
The subjects of the current research were 120 upperintermediate students, both male and female, studying English as their second language at two private language institutes in Mazandaran, Iran ranging fro m 17 to 23. To ensure the homogeneity of their proficiency, an Oxfo rd Placement Test (OPT) (Allan, 2004) was administered to the students of four different classes, besides the fact that all of the participants were at the same level according to the institute's evaluation.The participants whose scores were one standard deviation above or below the mean were selected; the rest of the students were excluded fro m further analyses. Thus, the number of participants decreased to 108. Having eliminated outliers of the previous phase, the researchers measured writing ability of the students through writ ing section of TOEFL proficiency test from Long man Preparation Course for the TOEFL test (Phillips, 2004) prior to the beginning of the study. In the second phase, students' writings were measured in terms of their accuracy, fluency and syntactic complexity.
According to Kuiken and Vedder (2007), accuracy can be assessed as "the number of error-free T-units, error-free T-units per T-unit and the number of errors per T-unit" (p. 266). It was noted that since finding the first two criteria might be difficult in learners' production, the last one could provide mo re information about the general accuracy of L2 learners' writing. In the present study, the number of mo rphosyntactic, lexical, and spelling errors per T-units was counted to measure accuracy. Syntactic co mplexity was defined as "the number of clauses per T-unit, the nu mber o f dependent clauses per T-unit and the number of dependent clauses per total number of clauses" (Kuiken & Vedder, 2007, p. 266). In this study, the number of clauses per T-unit was considered to measure the syntactic complexity of participants ' writing performance. Regarding fluency, a measure used by Ishikawa (2006) was adopted. Fluency was assessed in the TOEFL writ ing posttest as a measure of words per T-units.
Two raters,who were MA holders, scored more than 600 in TOEFL test and had more than ten years experience of teaching, analyzed the paragraphs and a coefficient correlation o f 0.91 shows the reliability o f assessment. Subsequently, the homogeneity of the respondents in their writ ing ability was proofed through the mentioned statistical method in the prev ious phase. Through the above-mentioned process, the total number of 80 upper-intermed iate learners were chosen. Then, the participants were randomly assigned to three experimental groups and one control group. The nu mber of participants in each group was 20.

Instruments
In order to fulfil this project, an OPT test and writing section of a TOEFL test was used to ensure the homogeneity of the participants in terms of their general proficiency level and their writ ing capability. A set of researcher-made questions in three formats of cloze, mu ltip le-choice and meta linguistic on a grammatical form (i.e. present perfect and present perfect continuous) was used after each session of teaching as an activity. The researchers devised and validated three tests on the target form; each test was in a different format of cloze, mult iple-choice or meta linguistic and contained 20 questions (all in all 60 items). In the end, two focused writing tasks to guide participants toward using intended grammatical forms , which were extracted some textbooks (Ellis, 2003;Van Den Branden, 2006),were imp lemented to investigate the effects of different test formats accompanied by teaching on learners writing ability and their use of target forms.
To validate the three researcher-made sets of tests (clo ze, mu ltiple -choice and meta linguistic), the researchers piloted each format of the tests to a class of 30 learners and used a classical true score theory item analysis technique through which item facility, item discrimination and point-biserial correlation were computed for each item.Regarding items facility factor, following Tuckman (1978), items having the p-value of less than 0.33 or higher than 0.67 were considered misfit items for the present study. Tuckman (1978) believed that questions with the share of the right answer less than 0.33 or higher than 0.67 should be rejected. For an item discrimination index, both a pointbiserial co rrelation and an item discrimination index were calculated for each item. According to Henning (1987), a minimu m of 0.25 for point-biserial correlation and 0.40 for discrimination index are acceptable for an item to be included in the final version of a test. Accordingly, items with lo wer levels of correlat ion and discrimination were discarded. As a result of the abovementioned process, a total number of 60 tests were chosen out of a pool of 90 items to make three types of test formats (i.e. clo ze, mu ltiple-choice and metalinguistic). Each test type encompassed 20 items.

Procedure and design
The present study was carried out in two sessions , and each session lasted for 45 minutes. 30 minutes of each session was devoted to teaching a target grammatical form (i.e. present perfect and present perfect continuous). Then, the three experimental groups were given a test of 10 questions (each group received a different type of test on the s ame subject) and 15 minutes to answer and work on it; while the control group only continued the routine process of teaching. In sum, the three experimental groups received 60 minutes of instruction plus 30 minutes of working with two sets of tests containing 20 items. In contrast, the control group received 90 minutes of teaching for two sessions per se. This course was taught through using one same method of teaching (i.e. inductive teaching) and three different approaches (i.e. context embedded, context reduced, and metalinguistic) of testing as a support to the language learning process.
After two sessions of the above-mentioned intervention, all four groups were asked to comp lete two grammar-focused writing tasks.The participants were told that task complet ion is a part of the research, but they were not informed about the purpose of the study until after it finished.
Two experienced raters (both PhD holders in TEFL with more than 15-year experience of teaching) analyzed the paragraphs in terms of their accuracy and syntactic comp lexity(or awareness of target grammatical form). Fluency was eliminated fro m the current research as the essence of our interventions is mainly grammat ical.Cronbach' Alpha Coefficient correlation was used to ensure the inter-rater reliab ility and p-value of 0.96 shows an acceptable level o f agreement between the two raters.
The design of this study was quasi-experimental, including experimental and control groups with pretest and posttest. Test type was considered as the independent variable (with three levels of context embedded, context reduced and metalinguistic)and writing task co mpletion was considered as the dependent variable of the study. The learners' proficiency level was considered as a moderator variable. SPSS 19th (Statistical Package fo r the Social Sciences) software package was used for all the statistical analyses in this study. Significance of the observed differences in participants' posttest scores was investigated through ANOVA test.The results of this analysis are presented thoroughly in the follo wing paragraphs.

RESULTS
This study aimed to analy ze the effects of an independent variable in three levels (i.e. mu ltip le choice, cloze, and metalinguistic testing methods) on a dependent variable (i.e. accuracy and syntactic complexity in writ ing ability). To reach this aim, a total number of 120 homogenized learners were divided into three experimental and one control groups and went under a two-session intervention. Each of the three experimental groups worked on a test of 10 items in each session after the teaching phase, while the control group only received teaching for all sessions. At last, all groups took part in a writing posttest in which two raters judged their writ ings in terms of their accuracy and syntactic complexity (or awareness of target grammatical form). The descriptive results of the posttest are summarized in Table 1.  Table 1 shows the results of the posttest. Participants who have been treated by metalinguistic tests after the teaching part outperformed other groups and the control group (M= 17.33 & Std. Deviat ion= 1.82). Participants who took the cloze tests achieved a mean score of 12.26 (Std. Deviat ion 2.16) following by those who took the mu ltip le-choice items (M=12.50&Std. Deviation=2.62). As indicated by Tab le 1, the mean score of students who have been treated by mu ltip le-choice tests is even lower than the control group who received mere teaching of intended grammatical point (i.e. present perfect and present perfect continuous). In order to answer the first research question and show the significance of observed differences, ANOVA test was run; the results of which are presented in Table 2. A one-way between-subjects ANOVA was conducted to compare the effect of 3 d ifferent testing methods on learning and usage of grammat ical forms in a writing task. As it is shown in Table 2, There was a significant effect of testing methods on writing task at the p<.05 level for the three conditions [F(3, 116) = 30.851, p = .000]. Post hoc comparisons using the Tukey HSD test, which is summarized in Taken together, these results suggest that taking metalinguistic and cloze tests as a learning activity really does have a significant effect on learning and using those forms in writ ing tasks. However, it should be noted that mult iple-choice items were found not to have any significant effects on learners uptake and output of the intended forms. Accordingly, while MCs are an objective way of assessing students' mastery over a form in a context reduced situation but they are not a suggested method for assisting language learning based on the results of the current research, especially in boosting writing ability.

DISCUSSION
Assisting language learning through testing is not a myth, but there is a consensus on the positive effects of testing on teaching and learning (Andrews et al. 2002;Chap man & Snyder, 2000). The best portrait of the issue may be pointed by Elton and Laurillard (1979) as they believe "the quickest way to change student learning is to change the assessment system". Most of the studies on different effects of assessment on learning have been carried out through the lenses of washback studies, and most of these washback studies have been concerned about teachers, learners or stakeholders' perspectives on the concept. Washback effects as a result of d ifferent practical assessment methods and techniques have been remained fairly obscured though they are of crucial impo rtance to fully co mprehend the concept (McNamara, 2001). Accordingly, the present study aimed to investigate the washback effects of different grammar-focused test techniques on learners writing task comp letion. The results of this study suggested that there is a positive and significant washback effect on students' writ ing performance as a result of assisting teaching through different testing techniques.
The mentioned finding is in line with Brame and Biel (2015), Chehrazad and Ajideh (2012), Ko, (2010), Kro mann et al. (2009, Taleb zadeh and Bagheri (2012), Zarei and Neya, (2014), but it is a rather sharp contrast with Loch (2010). Talebzadeh and Bagheri (2012) reported a positive washback effect of clo ze tests on students' vocabulary learning. Brame and Biel (2015) declared that various testing format can enhance learning and they suggested that feedback on tests would enhance the beneficial positive washback effects of tests. Loch (2010) ,while accepting the joint effects of test format with other factors like text d ifficu lty or test takers characteristics, mentioned that "task type and native language use as test method variables, rarely have a statistically significant affect separately" (Loch, 2010, p. 924). These rather opposing results could be partly due to "gender, language spoken at home, and school track" (Rauch & Hartig, 2010, p. 35). Test usefulness factors (i.e. reliab ility, construct validity, authenticity, interactiveness, impact, and practicality ) may be in charge (Backman & Pulmer, 1996) which should be controlled in future studies.
As a post hoc test illustrated, metalinguistics items loaded the highest effect on students' writing performance, fo llo wed by cloze and mu ltiple -choice tests. Furthermore, there was not any significant effect on mu ltiple -choice items co mpared to the control group. Khoshsima and Pourjam (2013) and Mozaffari and Alavi (2017) reported opposing results in favour of mu ltip le-choice format tests but in these studies tests were the final goal and they do not relate tests to learning especially to skills such as writing. Alternatively, Mizu moto, Ikeda and Takeuchi (2016) accepted the significant positive effects of cloze tests on learning and proposed that "cloze tasks require greater cognitive processing than multiple -choice tasks in reading comprehension using brain imag ing. Overall, brain imag ing results supported this hypothesis, with greater mean cerebral activation for cloze tasks than for mu ltip le-choice tasks and control tasks." (Mizu moto, Ikeda, & Takeuchi, 2016, p. 74) The results indicated that supporting teaching grammatical points with metalinguistic tests would yield the highest positive washback on students writ ing. This is in line with the findings of Wang and Wang (2013) who found significant washback effects of exp licit teaching and metacognitive awareness with academic writing and reading among English language learners. The superiority of metacognitive tasks to enforce grammaticality in writing could have happened due to some reasons. As Swain, Lapkin, Knouzi, Su zuki, and Brooks (2009) concluded that: It reflects how each test activity draws on different knowledge sources and abilities that vary across students, and it reflects the different language learning histories experienced by our learners. In the delayed posttest stage, whereas the written responses tap into the ability to produce the verb form required by the voice of the sentence, the languaging in the stimulated recall taps into the depth of understanding. (Swain et al., 2009, p. 22) On top of that, Roehr (2006Roehr ( , 2007Roehr ( , 2008 in several studies emphasized the differences between linguistic and metalinguistic types of knowledge. He suggested that while linguistic knowledge is assumed to be "represented in terms of flexible and context-dependent categories which are subject to similarity-based processing", "explicit meta linguistic knowledge is characterized by stable and discrete Aristotelian categories which subserve cons cious, rule-based processing" (Swain et al., 2009, p. 67). Likewise, the results of the current study show that tapping into students metalinguistic knowledge through test techniques would ideally suit a foreign language learning situation and more importantly, in supporting teaching grammat icality in writing tasks.As another possible explanation, Roehr (2006Roehr ( , 2008 found a significant positive correlation between learners metalinguistic knowledge and their proficiency; furthermore, it has been reported that learners with higher levels of metalinguistic awareness tend to show higher levels of learn ing gain over those with less (Mitchell, Myles, & Marsden, 2013).

CONCLUSION
To put it in a nutshell, the present study addressed two research questions. Regarding the first one, we found a positive washback effect of tests on learning o f grammatical points and producing those forms in writing tasks. We mentioned a plethora of agreeing on studies but a few opposing ones. Some explanations for the contrary findings may be gender, learners' mother tongue or other factors of test usefulness.
With respect to the second research question, the results of post hoc test indicated that experimental groups which were assis ted by multip le-choice and metalinguistic tests significantly outperformed the control group in doing grammar-focused writing tasks while those who received multip le-choice tests did not show an imp rovement over the control group. This finding represents an update on former research. The results suggest that both metalinguistic and clo ze tests are suitable activ ities to support the production of grammatically correct written forms , but there should be revisions about the effects of mult iple -choice tests as some contrary evidence were probed. It is noted that cloze tests can induce higher loads of "cogn itive processing" than multip le-choice tests so it could have the edge. The superiority of metalinguistic tests, which are neglected or even prohib ited in most parts of language learning, could be explained by the difference in types of knowledge and its greater correlation with written modes of production. In addition, higher levels of metalinguistic knowledge cause higher levels of learning.
A number of imp lications are conceivable for the results of the current study. First and for most, all language teachers, students and material developers may need to reflect more on their perspectives on language testing and consider its possible negative and positive washback effects. Another point is if we assume that the main goal o f every language assessment activity is to foster learning, and if we believe that assisting language learning through judicious type of test can lead to linguistic and meta linguistic development, then it is reasonable to call a coherent and extensive effort by all teachers, material developers, and stakeholders to develop nationally and internationally validated tests. Consequently, it is s uggested to keep an eye on metalinguistic and cloze tests while teaching, studying or preparing course material fo r writ ing and grammar.
Needless to say, these proposals would benefit fro m further investigation. In particular, more controlled studies regarding test usefulness factors (i.e. reliability, construct validity, authenticity, interactiveness, impact, and practicality). Moreover, large scale longitudinal and qualitative studies are needed to fully document the underlying mental processes of these phenomena. Another point is the effects of cultural competence, schemata and background knowledge which should be investigated in relation to washback effects of different test formats.

LIMITATIONS
One of the greatest limitations of this study was the limited number of treat ment sessions that is two sessions of testing cannot fully represent the washback of effects of testing. It is hoped that future studies address this limitation.