![]()
|
Abstract 1. Introduction While writing is an essential part of the educational process, many instructors find it difficult to incorporate large numbers of writing assignments in their courses due to the effort required to evaluate them. However, the ability to convey information verbally is an important educational achievement in its own right, and one that is not sufficiently well assessed by other kinds of tests. In addition, essay-based testing is thought to encourage a better conceptual understanding of the material and to reflect a deeper, more useful level of knowledge and application by students. Thus grading and commenting on written texts is important not only as an assessment method, but also as a feedback device to help students better learn both content and the skills of thinking and writing. Nevertheless, essays have been neglected in many computer-based assessment applications since there exist few techniques to score essays directly by computer. In this paper we describe a method for performing automated essay scoring of the conceptual content of essays. Based on a statistical approach to analyzing the essays and content information from the domain, the technique can provide scores that prove to be an accurate measure of the quality of essays. The text analysis underlying the essay grading schemes is based on Latent Semantic Analysis (LSA). Detailed treatments of LSA, both as a theory of aspects of human knowledge acquisition and representation, and as a method for the extraction of semantic content of text are beyond the scope of this article. These topics are fully presented elsewhere (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). LSA has been successfully applied towards a number of simulations of cognitive and psycholinguistic phenomena. These simulations have shown that LSA captures a great deal of the similarity of the meaning of words expressed in discourse. For example, LSA can account for the effects of semantic priming (Landauer & Dumais, 1997), can be used to select the appropriate level of text for readers (Wolfe, Schreiner, Rehder, Laham, Foltz, Kintsch, & Landauer, 1998), and can model the effects of the coherence of texts on readers' comprehension (Foltz, Kintsch & Landauer, 1998). Based on a statistical analysis of a large amount of text (typically thousands to millions of words), LSA derives a high-dimensional semantic space that permits comparisons of the semantic similarity of words and passages. Words and passages are represented as vectors in the space and their similarity is measured by the cosine of their contained angle in the semantic space. The LSA measured similarities have been shown to closely mimic human judgments of meaning similarity and human performance based on such similarity in a variety of ways. For example, after training on about 2,000 pages of English text, it scored as well as average test-takers on the synonym portion of TOEFL—the ETS Test of English as a Foreign Language (Landauer & Dumais, 1997). After training on an introductory psychology textbook, it achieved passing scores on two different multiple-choice exams used in introductory psychology courses (Landauer, Foltz, & Laham, in preparation). This similarity comparison made by LSA is the basis for performing automated scoring of essays through comparing the similarity of meaning between essays. |
|
![]() |
|
![]() Figure 1. Summary of reliability results (N = 1205 Essays on 12 Diverse Topics) |
![]() |
![]() While the holistic technique relies on comparing essays against a set of pre-graded essays, other techniques have been developed that also effectively characterize the quality of essays. A second technique is to compare essays to an ideal essay, or "gold standard" (Wolfe, et al., 1998). In this case, a teacher can write his or her ideal essay and all student essays are then judged on the basis of how closely they resemble the teacher's essay. In two final techniques, essays can be compared to portions of the original text, or compared to sub-components of texts or essays (Foltz, 1996; Foltz, Britt & Perfetti, 1996). In this final componential approach, individual sentences from a student's essay can be compared against a set of predetermined subtopics or sections of a textbook. This permits the determination of whether an essay sufficiently covers those subtopics. By determining whether a subtopic or section of a textbook is sufficiently covered in an essay, the computer can then provide feedback about which subtopics have been sufficiently understood and/or which sections of the textbook a student might need to review. The LSA derived scores based on this measure of the degree of coverage of subtopics in essays show equivalent correlations with human graders as graders correlate with each other 2.2. Anomalous Essay Checking While it is important to verify the effectiveness of computer-based essay grading, it is also important that such a grader be able to determine if it cannot grade an essay reliably. Thus, a number of additional techniques have been developed that can detect "anomalous" essays. If an essay is sufficiently different from the essays for which it has been trained, the computer can flag it for human evaluation. The IEA can flag essays that are highly creative, off topic, use unusual syntax, or violate standard formats or structures for essays. In addition, the computer can determine whether any essay is too similar to other essays or to the textbook. The program is thus able to detect different levels of plagiarism. If an essay is detected as being anomalous for any reason, the essay can be automatically forwarded to the instructor for additional evaluation. This permits teachers to remain aware of students who may be having difficulties or are cheating.
|
![]() |
![]() |
![]() ![]() 3. Experiences in the Classroom: Grading and Feedback Over the past two years, the Intelligent Essay Assessor has been used in a course in Psycholinguistics at New Mexico State University. Designed as a web-based application, it permits students to submit essays on a particular topic from their web browsers. Within about 20 seconds, students receive feedback with an estimated grade for their essay and a set of questions and statements of additional subtopics that are missing from their essays. Students can revise their essays immediately and resubmit. A demonstration is available at http://psych.nmsu.edu/essay. To create this system, LSA was trained on a portion (four chapters) of the psycholinguistics textbook used in the class. The holistic grading method was used to provide an overall grade for any essay. In this method, each essay was compared against 40 essays from previous years that had been graded by three different graders. A score was assigned based on the grades of the previous essays to which an essay was most similar. To verify the effectiveness of this approach for providing accurate grades, the average correlation among the three human graders was 0.73 while the average correlation of LSA’s holistic grade to the individual graders was 0.80. To provide feedback about missing information in each essay, two teaching assistants first identified the seven primary subtopics that should be covered for that essay topic. They then provided example sentences that corresponded to each of the subtopics. These example sentences were then represented in the semantic space. When a student submitted an essay, the individual sentences of the essay were compared to the subtopic sentences. If no sentence in the essay closely matched sentences for a particular subtopic, the computer provided additional feedback in the form of a question or comment to the student about the particular subtopic. For example, in their essays, students needed to describe whether a particular psycholinguistic model was a serial or parallel processing model. If the computer did not detect any mention of this, they would receive feedback asking them " Does the processing occur serially or in parallel?" Students were permitted to use the system independently to write their essays and were encouraged to revise and resubmit them to the computer until they were satisfied with their grades. After students received their score and feedback, they also received a text box in which they could edit and resubmit their essay for grading. In addition, the web page had links to the class notes so that students could get additional help with writing their essays. The average grade for the students’ first essays was 85 (out of 100). By the last revision, the average grade was 92. Students’ essays all improved from revision to revision, with the improvements in scores ranging from 0 to 33 points over an average of 3 revisions. An additional trial with a similar system took place in a Boulder, Colorado middle school (Kintsch, Steinhart, Matthews & Lamb, in press). In this system, called Summary Street, students were asked to summarize texts on alternative sources of energy, ancient civilizations, and the functioning of the heart.. Students received feedback about whether they had sufficiently covered subtopics from the texts they were to summarize. They could then revise and resubmit their essays. The results indicated that LSA's scores of the summaries were comparable to those given by the teacher and that LSA performed almost as well as the teachers at detecting which texts were the sources of the students' knowledge In both the undergraduate and middle school trials, students and teachers enjoyed and valued using the system. A survey of usability and preferences for the system in the psycholinguistics course showed that 98 percent of the students indicated that they would definitely or probably use such a system if it were available for their other classes. Overall, the results show that the IEA is successful at helping students improve the quality of their essays through providing immediate and accurate feedback. |
|
![]() |
![]() ![]() 5. Conclusions To summarize, the IEA is first trained on some domain representative text such as a textbook, samples of writings, or even a large number (300 or more) of essays on a topic. Second, the IEA is provided with one or more pre-graded essays on the topic to be graded. Once these pieces of information are processed, the IEA provides accurate characterizations of the quality of essays on the same topic. Domain representative texts are easily available in electronic form, and essays can now be easily converted to or directly entered by students in electronic form. Thus, it is highly feasible to develop automated essay graders for a large number of topics across many domains. Although the IEA requires a large amount of computer processing power, it can be run from a centralized server, allowing students to access it from any web browser. The IEA is currently available as a web-based grading service through Knowledge Analysis Technologies, LLC (KATech). Organizations wishing to have automated scoring of or evaluative comments added to essays can work with KATech personnel to have essay graders developed for their particular needs. The IEA is not sold as stand-alone software due to the fact that the hardware requirements are beyond the capabilities of standard desktop systems and some expertise is involved in the development of the semantic spaces. Instead, the graders are set up on secure KATech web servers which permit students and teachers to submit essays and receive back scores and comments. This approach further permits instructional designers to incorporate essay grading into their existing web-based software by just adding a link which forwards essays to the essay grader server. The servers can return scores and comments within just a few seconds. Demonstrations of the IEA and information about having graders developed may be found at: www.knowledge-technologies.com. The Intelligent Essay Assessor presents a novel technique for assessing the quality of content knowledge expressed in essays. It permits the automatic scoring of short essays that would be used in any kind of content-based courses. Based on evaluations of the IEA over a wide range of essay topics and levels, the IEA proves as reliable at evaluating essay quality as human graders. Its scores are also highly consistent and objective. While the IEA is not designed for assessing creative writing, it can detect whether an essay is sufficiently different and, therefore, should be graded by the instructor. The IEA can be applied for both distance education and for training in the classroom. In both cases it may be used for assessing students’ content knowledge as well as providing feedback to students to help improve their learning. Overall, the IEA permits increasing the number of writing assignments without overly increasing the grading load on the teacher. Because writing is a critical component in helping students acquire both better content knowledge and better critical thinking skills, the IEA can serve as an effective tool for increasing students' exposure to writing. |
![]() |
![]() |
6. Acknowledgements The web site http://www.knowledge-technologies.com provides essay scoring demonstrations as well as information on the commercial availability of the IEA for educational software. The web site http://lsa.colorado.edu provides other demonstrations of the essay grading, additional applications of LSA and links to many of the articles cited. This research was supported in part by a contract from the Defense Advanced Research Projects Agency-Computer Aided Education and Training Initiative to Thomas Landauer and Walter Kintsch, a grant from the McDonnell Foundation’s Cognitive Science in Educational Practice program to W. Kintsch, T. Landauer, & G. Fischer, and an NSF KDI Learning and Intelligent System's grant to Peter Foltz and Adrienne Lee. The "Intelligent Essay Assessor" has a patent pending: Methods for Analysis and Evaluation of the Semantic Content of Writing by P. W. Foltz, D. Laham, T. K. Landauer, W. Kintsch & B. Rehder, held by the University of Colorado. |
|
![]() |
7. References Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407. Foltz, P. W. (1996) Latent Semantic Analysis for text-based research. Behavior Research Methods, Instruments and Computers. 28(2), 197-202. Foltz, P. W., Britt, M. A., & Perfetti, C. A. (1996) Reasoning from multiple texts: An automatic analysis of readers' situation models. In G.W. Cottrell (Ed.) Proceedings of the 18th Annual Cognitive Science Conference.(pp. 110-115), Hillsdale, NJ: Lawrence Erlbaum Associates. Kintsch, E., Steinhart, D., Matthews, C., & Lamb, R (in press) Developing summarization skills through the use of LSA-based feedback. Interactive Learning Environments. Laham, D. (1997). Automated holistic scoring of the quality of content in directed student essays through Latent Semantic Analysis. Unpublished master’s thesis, University of Colorado, Boulder, Colorado. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240. Landauer, T. K., Foltz, P. W., & Laham, D. (In preparation). Latent Semantic Analysis passes the test: knowledge representation and multiple-choice testing. Manuscript in preparation. Landauer, T. K, Foltz, P. W. & Laham, D. (1998) An introduction to Latent Semantic Analysis. Discourse Processes, 25, 2&3, 259-284. Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E., (1997). How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans. In M. G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society (pp. 412-417). Mahwah, NJ: Erlbaum. Page, E. B. (1994). Computer grading of student prose using modern concepts and software. Journal of Experimental Education, 62 127-142. Wolfe, M., B. Schreiner, M. E., Rehder, B., Laham, D., Foltz, P. W., Kintsch, W. & Landauer, T. K (1998). Learning from text: Matching readers and texts by Latent Semantic Analysis. Discourse Processes, 25, 2&3, 309-336. |
![]() |
![]() |
![]()
|
![]() |