Abstracts (Day 2)
A Taxonomy of adaptive testing Robert Mislevy This talk builds on foundational work on probabilistic frames of reference and principled assessment design to explore the role of adaptation in assessment. Assessments are characterized in terms of their claim status, observation status, and locus of control. The relevant claims and observations constitute a frame of discernment for the assessment. Adaptation occurs when the frame is permitted to evolve with respect to the claims or observations (or both); adaptive features may be controlled by the examiner or the examinee. In describing the various combinations of these characteristics, I note the advantages of an online format for supporting common and emerging assessment practices in light of adaptation. (Download Handout)
Error-coded ESL learners’ essays and automated diagnostic feedback Yong-Won Lee (with M. Chodorow and Claudia Gentile) Automated error-detection and feedback systems are becoming an important component of online writing practice services for ESL learners. In refining such systems, one promising idea is to identify typical error patterns for ESL learners with different first language (L1) backgrounds and use such information to improve the accuracy of detection and feedback. The main purposes of the presentation are: (a) to describe samples of error-coded TOEFL® essays obtained for our study, (b) to discuss patterns of writing errors observed in the essays for different L1 groups, and (c) to outline potential areas of application for such error-coded essays. Approximately 480 essays written for four TOEFL prompts were error-coded and corrected by human coders. Five major L1 groups (Arabic, Chinese, Japanese, Korean, and Spanish) were represented in the essays. A total of eight coders were recruited and trained to code each essay for writing errors. Each essay was initially coded by two coders using a restructured version of Dagneaux et al.’s (1996) coding scheme. A much larger set of essays was error-coded by an automated error detection system. Writing errors observed in the essays were classified into major types and counted for each type. For different L1 groups, frequencies of different types of errors were examined at different score levels. Our presentation will show patterns of unique and common errors for different L1 groups, types of errors that are more easily captured by the automated system or humans, and their implications for automated error detection and feedback in diagnostic writing assessment. (Download Handout)
Machine scoring in diagnostic assessment of L2 spoken language skills Masa Suzuki and Alistair van Moere One of the difficulties in providing diagnostic information about L2 learners’ speech is ensuring that each dimension of oral proficiency is measured separately as intended. Even when rating scale categories are well-defined, human raters tend to transfer judgment across traits when, for example, assessing pronunciation, fluency, vocabulary usage or grammatical accuracy (Fulcher, 2003). By utilizing a computerized scoring system, these traits can be maintained separately during the evaluation process in order to provide useful diagnostic information to teachers and learners. Building on automated speech processing systems and IRT-based scoring systems, Versant automated spoken language tests (e.g. English, Spanish) report an Overall score and four diagnostic subscores: Sentence Mastery, Vocabulary, Fluency, and Pronunciation. The advantages are that reliable (r=0.88 to 0.97) scores and score descriptors are provided via a website minutes after a test is completed. Moreover, samples of speech can be stored electronically for later retrieval. The challenge of utilizing automated speech processing systems is to train them to accurately recognize non-native utterances and then predict how human judges would evaluate those utterances. This presentation will show how speech recognizers are optimized for this purpose, and how acoustic parameters in test-takers’ responses (e.g. rate of speech and pausing) can be calibrated with human raters’ judgments of aspects of speech (such as fluency). Studies will be presented which suggest that such carefully developed automated scoring systems are sensitive to improvements in learners’ progress over time and can provide similar diagnostic judgments to those of human raters (r = 0.89 to 0.93).
Automated diagnostic writing tests Elena Cotos and Nick Pendar This presentation explores whether the use of natural language processing (NLP) techniques is necessary and/or viable for diagnostic computer-assisted language testing (CALT) of written learner language. We discuss the motivations of incorporating NLP in diagnostic CALT by investigating what it is that we want to diagnose in learner constructive responses, and identifying NLP techniques that can help yield information about the constructs of interest. Diagnostic language assessment can greatly benefit from a collaborative union of CALT and NLP. Currently, CALT can shed light on learners’ knowledge of vocabulary, and grammar, and to a less extent writing mechanics, and text organization skills (Chapelle & Douglas, 2006). This is achieved through multiple-choice, rearrangement, true/false, cloze, and matching tests, as well as short-response questions. Most current CALT applications mainly allow for inferences based on learners’ recognition and comprehensionof linguistic input and hardly concern language production (Holland et al., 1993). Therefore, diagnostic assessment would be incomplete if it only relied on technological applications available in the current state of CALT. NLP is now at a stage where it can be used/adapted for diagnostic testing of learner production skills. For instance, robust syntactic parsers and morphological analyzers can be adapted to assess grammar; text segmentation and categorization techniques can be used to assess text organization and discourse structure; and automated lexical analyses can provide information about learners’ vocabulary (Attali & Burstein, 2006). An automated system with these capabilities, coupled with an effective student model (Heift & Schültz, forthcoming) will not only serve as a better diagnostic tool; it can also enhance learner’s motivation by providing contextualized individual feedback (Cook & Fass, 1986). (Download Handout) References Attali, Y., and Burstein, J. (2006). Automated Essay Scoring With e-rater® V.2. The Journal of Technology, Learning, and Assessment, 4(3), 1-34. Chapelle, C. and Douglas, D. (2006). Assessing Language through Computer Technology. Cambridge Language Assessment Series. Cambridge: Cambridge. Cook, V., and Fass, D. (1986). Natural Language Processing by Computer and Language Teaching. System, 14(2), 163-170. Heift, T., and Schültze, M. (forthcoming). Errors and Intelligence in Computer-Assisted Language Learning: Parsers and Pedagogues. New York: Routledge. Holland, V. M., Maisano, R., Alderks, C., & Martin, J. (1993). Parsers in Tutors: What Are They Good For? CALICO Journal, 11(1), 28-46.
Decisions about automated scoring: What they mean for our constructs Nathan Carr This presentation discusses how decisions about the scoring criteria used in the automated scoring of constructed response items can affect the constructs. It will begin with a brief overview of three general approaches to automated scoring—natural language processing (NLP), exact word matching, and keyword matching—followed by a discussion of the benefits of automated scoring The presentation will then focus on the use of keyword matching in scoring comprehension items, and reasons why this approach is clearly superior to exact word matching, and can be preferable in some cases to the more powerful method of NLP. Using the classification scheme developed by Carr, Pan, and Xi (2002), the presentation will consider the ways in which automated scoring can affect the constructs of a test, dividing these ways into unintended alterations, purposeful/principled refinement, and mixed cases. This will be followed by a discussion of the decisions to be made about these issues, and their implications for the construct definition(s) used in a test. In particular, the presentation will address the effects of decisions regarding partial credit, misspellings, synonyms, paraphrases, and penalizing for extraneous information. This will be accompanied by a discussion of ways in which to implement keyword matching approaches in the context of web-based testing, with an emphasis on low-budget approaches, including the author’s ongoing development of a keyword matching automated scoring program which runs in Microsoft Excel. (Download Handout)
Towards Cognitive Response Theory in diagnostic language assessment Quan Zhang This paper focuses on one point concerning how the reliability of diagnostic inferences can be achieved and monitored during computer-based language testing. The author also expresses some doubts regarding the current practice of CAT, and meantime put forward some tentative improvement incorporated with principles regarding cognitive science. The author argues that with the advanced technology of computer programming and multiple media making, jumbled word (JW) test form will be a promising alternative for multiple choice (MC) question format to be used in large-scale computer-based tests. Once it is done, the practice of language testing should be guided under Cognitive Response Theory. This will undoubtedly bring computer-based language testing into an innovative change which will dramatically alter the face of the ongoing practice of computer-based language testing in general. It is believed that to implement and manage such significant change requires advanced computer programming under the guidance of cognitive science. The research presented here, though not 100% matured, calls for feedback from a larger community of experts of language testing and can be hopefully held as a good basis for further research towards computerized cognitive testing. (Download Handout)
What and how much evidence do we need? Xiaoming Xi As has been established in the literature, it requires evidence beyond the human-automated score agreement to establish the validity of an automated scoring system. Earlier validation work targeting automated scoring has addressed one or more of three areas: (1) demonstrating the correspondence between scores produced by automated scoring systems and by human scorers, (2) examining the relationship between automated scores and criterion measures external to the assessment, and (3) understanding the construct represented within the scoring processes that automated scoring systems use (Yang et al., 2002). Clauser, Kane and Swanson (2002) have articulated an argument-based approach which subsumes and coherently integrates these three areas of investigations to position automated scoring in a larger validity argument for the whole assessment. They argue that the decision to use automated scoring will not only impact the strength of the evaluation inference, which links test performance to observed test scores, but also the subsequent inferences in the validity argument. These inferences pertain to score generalizability, score interpretation, score-based decisions, and consequences of score use. This is described as the “ripple effects” of automated scoring that “extend through each step in the argument” in Clauser et al. (2002). In my talk, I will discuss how this approach could be applied and extended to anticipate the potential threats that may be introduced by an automated speech scoring system. I will also talk about how it can guide the development of the relevant evidence necessary to reduce each potential threat in support of the whole validity argument. I will compare and contrast priorities in efforts to validate automated speech scoring systems that support low-stakes diagnostic and practice uses and high-stakes decisions, focusing on the critical validity threats for each instance of use and the types and strengths of evidence required to discount or reduce such threats. A real example will be used to demonstrate the application of this approach in validating a speaking assessment that uses an automated scoring system. (Download Handout)
NLP-based CALL Maja Grgurovic and Nick Pendar A number of CALL scholars (e.g., Heift & Schültze, forthcoming; Nagata, 1995; Amaral & Meurers, 2006) agree that there is a great need for empirical evaluation of the effectiveness of NLP-CALL systems for language learning. However, published literature has seen few of these evaluations because authors often focus solely on the system development and do not report on learners’ performance and system use. Moreover, some NLP-CALL projects are discontinued shortly after their inception or their implementation is short-lived resulting in the lack of lasting published research (Dodigovic, 2005). In order to address the effectiveness of NLP-CALL programs, this presentation reviews studies on three fully-developed and mature parser-based CALL systems (Nihongo-CALI/Banzai, Nagata (1995); German Tutor/E-Tutor, Heift (2001); and GLOSSER, Nerbonne et al, (1998)), which have been empirically evaluated in at least two peer-reviewed CALL publications. The systems are surveyed based on the type of study (comparative or non-comparative) and their focus (system as a whole, a feature of the system, or learners). The research methods used in the studies are also discussed. Overall, the findings indicate that (a) NLP-CALL systems are more effective than traditional tools and non-NLP-based CALL programs, (b) more explicit feedback is more beneficial for learners, and (c) students generally take advantage of system features. These results are encouraging for the development of NLP-CALL applications; however, improvements are called for. In conclusion, the presentation provides specific suggestions on how systems and their evaluation can be improved in the future. (Download Handout) References Amaral, L., and Meurers, D. (2006). Where does ICALL fit into foreign language teaching? Paper presented at CALICO Conference, Honolulu, Hawaii. Dodigovic, M. (2005). Artificial Intelligence in Second Language Learning: Raising Error Awareness. New York: Multilingual Matters. Heift, T. (2001). Error-specific and individualized feedback in a web-based language tutoring system: Do they read it? ReCALL, 13(1), 99-109. Heift, T., and Schültze, M. (forthcoming). Errors and Intelligence in Computer-Assisted Language Learning: Parsers and Pedagogues. New York: Routledge. Nagata, N. (1995). An effective application of natural language processing in second language instruction. CALICO Journal, 13(1), 47-67. Nagata, N. (1996). Computer vs. workbook instruction in second language acquisition. CALICO Journal, 14(1), 53-75. Nerbonne, J., Dokter, D., and Smith, P. (1998). Morphological processing and computer-assisted language learning. Computer Assisted Language Learning, 11(5), 543-559.
Study on the quantitative analyses of learner data and their implications for designing CALL tasks Jinhee Choo and Doe Hyung Kim Learner data can reveal important information about student performance and the effectiveness of a CALL program in terms of its tasks and feedback. In our presentation, we initially demonstrate what sources of interaction from the CALL task interface can be collected to develop a more informative student model. Student models can be important contribution of CALL to SLA because both process—the interactions recorded on the computer—and the product—the results students achieve—have been suggested to be integral to understanding how students learn a language through CALL instruction. We then discuss some quantitative statistical analyses via regression and categorical data analysis models from two separate empirical studies that were conducted with Korean ESL learners within the context of a CALL program developed to help ESL learners increase their awareness of consistent errors in academic writing. Although there was no significant linear relationship between time spent on the program and improvement between the tests, a marginal correlation between these two variables was found and other variables such as gender appeared to affect performance and improvement of learners’ language learning to a various degree. Furthermore, a survival analysis conducted with data from a particular task resulted in a model that described how students reacted differentially to three different feedback types. Examples from this application and other ongoing projects will be shown to present ideas for a variety of informative process data collection methods within the task interface in addition to some limitations and implications of process data use.
An innovative record-keeping of learners’ language performance: Melissa Baralt It has recently been demonstrated that interaction within the CMC (computer-mediated communication) modality can provide for many of the same benefits as face-to-face interaction (De la Fuente, 2003; Smith, 2004, 2005; Shekary & Tahririian, 2006, Sachs & Suh, in press). Resembling ‘oral chat’, CMC chat promotes negotiation for meaning, the provision of recasts, question formation, and opportunities for restructured output. One of the main premises behind the incorporation of tasks within CMC chat is that any development acquired might eventually be transferred to the oral mode. These chat dialogues can be saved and reviewed for later analysis. In this way, CMC chat can serve as a unique tool for both L2 conversational practice as well as an assessment device. To begin, I will identify several of the benefits highlighted by empirical studies that have employed CMC chat. I will then briefly describe my study in which I analyzed beginning as well as advanced learners’ dialogues within mixed-proficiency dyads in ichat. After conducting a task together, students’ conversations were saved and ‘stored’ as documents for later review. In reviewing their ‘conversations’, students were able to identify errors, recognize reasons for instances of nonunderstanding that took place with their partner, and spot as well as correct any problems in their interlanguage. By saving ichat conversations as a “record-keeping” of learners’ interactional abilities, assessment is placed in the hands of the learners. Furthermore, these records are ideal for students to see how they have progressed over time. (Download Handout)
Modeling SLA processes using NLP Mathias Schulze Larsen-Freeman (1997), de Bot et al. (2005), Ellis & Larsen-Freeman (2006), and others have advocated and attempted the application of dynamic systems and chaos theoretical approaches to the description of second-language learning processes. These two theories have been used in other disciplines as a (philosophical) metaphor and/or as an analytical tool: multilingualism (Herdina & Jessner, 2002), developmental psychology including L1 acquisition (van Geert, 1994, 2000; van Geert et al., 2004), and general education (Haggis, 2005). In applying these theoretical approaches to an empirical analysis of existing longitudinal learner data, we are borrowing from these earlier studies and combine this approach with an NLP analysis of learner texts. In order to model student activity in L2 learning, we are proposing to analyze the complexity of texts produced by learners over time. We are measuring complexity at various levels of the text: word structure (categorial morphology (Hoeksma, 1985; Schulze, 2001)/multiple lexical inheritance (Briscoe et al., 1993)), phrase and sentence levels (HPSG phrase descriptors (Heift, 1998; Heift & Schulze, 2007; Pollard & Sag, 1994; Sag et al., 2003)). Learner discourse complexity will be approximated through a number of simple statistical measures (Foster & Skehan, 1996; Skehan & Foster, 2005; Tavakoli & Skehan, 2005). When parsing the existing learner texts, a variety of features at each level are retrieved and converted to (binary) numbers. The numbers of features at any given point in time are aggregated in a complexity measure of the learner text, the complexity measures are then plotted as time series and phase spaces and analyzed using approaches from chaos and dynamics systems theory. (Download Handout) References Briscoe, E. J., Copestake, Ann, & De Paiva, Valeria. (1993). Inheritance, defaults, and the lexicon. Cambridge England ; New York: Cambridge University Press. de Bot, Kees, Verspoor, Marjolijn, & Lowie, Wander. (2005). Dynamic Systems Theory and Applied Linguistics: The Ultimate "so what"? International Journal of Applied Linguistics, 15(1), 116-118. Ellis, Nick C., & Larsen-Freeman, Diane (Eds.). (2006). Language Emergence - Implications for Applied Linguistics. Special Issue of Applied Linguistics (27/4). Oxford: Oxford University Press. Foster, Pauline, & Skehan, Peter. (1996). The Influence of Planning and Task Type on Second Language Performance. Studies in Second Language Acquisition, 18(3), 299-323. Haggis, Tamsin. (2005). ‘Knowledge Must Be Contextual’: Some Possible Implications of Complexity and Dynamic Systems Theories for Educational Research. Paper presented at Complexity, Science and Society Conference, Liverpool. Heift, Trude. (1998). Designed Intelligence: A Language Teacher Model. Unpublished PhD Thesis, Simon Fraser University, Burnaby. Heift, Trude, & Schulze, Mathias. (2007). Errors and Intelligence in CALL. Parsers and Pedagogues. New York: Routledge. Herdina, Philip, & Jessner, Ulrike. (2002). A Dynamic Model of Multilingualism: Perspectives of Change in Psycholinguistics. Clevedon ; Buffalo ; Toronto: Multilingual Matters. Hoeksma, Jack. (1985). Categorial Morphology. New York: Garland. Larsen-Freeman, Diane. (1997). Chaos/Complexity Science and Second Language Acquisition. Applied Linguistics, 18(2), 141-165. Pollard, Carl J., & Sag, Ivan A. (1994). Head-Driven Phrase Structure Grammar. Chicago: University Chicago Press. Sag, Ivan A., Wasow, Thomas, & Bender, Emily M. (2003). Syntactic Theory: A Formal Introduction (2nd ed.). Stanford, Calif.: Center for the Study of Language and Information. Schulze, Mathias. (2001). Textana - Grammar and Grammar Checking in Parser-Based CALL. Unpublished PhD Thesis, UMIST, Manchester. Skehan, Peter, & Foster, Pauline. (2005). Strategic and on-line planning: The influence of surprise information and task time on second language performance. In Rod Ellis (Ed.), Planning and task performance in a second language. (pp. 193-216). Amsterdam, Netherlands: John Benjamins Publishing Company. Tavakoli, Parvaneh, & Skehan, Peter. (2005). Strategic planning, task structure, and performance testing. In Rod Ellis (Ed.), Planning and task performance in a second language. (pp. 239-273). Amsterdam, Netherlands: John Benjamins Publishing Company. van Geert, Paul. (1994). Dynamic systems of development: Change between complexity and chaos. Harvester Wheatsheaf, Hertfordshire, HP2 7EZ: England. van Geert, Paul. (2000). The dynamics of general developmental mechanisms: From Piaget and Vygotsky to dynamic systems models. Current Directions in Psychological Science, 9(2), 64-68. van Geert, Paul, Verhoeven, Ludo, & van Balkom, Hans. (2004). A dynamic systems approach to diagnostic measurement of SLI. In Classification of developmental language disorders: Theoretical issues and clinical implications. (pp. 327-348). Lawrence Erlbaum Associates, Publishers, Mahwah, NJ: US.
|
|