IMEJ Article - The Intelligent Essay Assessor: Applications to Educational Technology

The Intelligent Essay Assessor: Applications to Educational Technology
Peter W. Foltz, New Mexico State University
Darrell Laham, Knowledge Analysis Technologies
Thomas K. Landauer, University of Colorado

Abstract
The Intelligent Essay Assessor (IEA) is a set of software tools for scoring the quality of essay content. The IEA uses Latent Semantic Analysis (LSA), which is both a computational model of human knowledge representation and a method for extracting semantic similarity of words and passages from text. Simulations of psycholinguistic phenomena show that LSA reflects similarities of human meaning effectively. To assess essay quality, LSA is first trained on domain-representative text. Then student essays are characterized by LSA representations of the meaning of the words used, and they are compared with essays of known quality in regard to their degree of conceptual relevance and the amount of relevant content. Over many diverse topics, the IEA scores agreed with human experts as accurately as expert scores agreed with each other. Implications are discussed for incorporating automatic essay scoring in more general forms of educational technology.

1. Introduction

While writing is an essential part of the educational process, many instructors find it difficult to incorporate large numbers of writing assignments in their courses due to the effort required to evaluate them. However, the ability to convey information verbally is an important educational achievement in its own right, and one that is not sufficiently well assessed by other kinds of tests. In addition, essay-based testing is thought to encourage a better conceptual understanding of the material and to reflect a deeper, more useful level of knowledge and application by students. Thus grading and commenting on written texts is important not only as an assessment method, but also as a feedback device to help students better learn both content and the skills of thinking and writing. Nevertheless, essays have been neglected in many computer-based assessment applications since there exist few techniques to score essays directly by computer. In this paper we describe a method for performing automated essay scoring of the conceptual content of essays. Based on a statistical approach to analyzing the essays and content information from the domain, the technique can provide scores that prove to be an accurate measure of the quality of essays.

The text analysis underlying the essay grading schemes is based on Latent Semantic Analysis (LSA). Detailed treatments of LSA, both as a theory of aspects of human knowledge acquisition and representation, and as a method for the extraction of semantic content of text are beyond the scope of this article. These topics are fully presented elsewhere (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Landauer & Dumais, 1997; Landauer, Foltz, & Laham, 1998). LSA has been successfully applied towards a number of simulations of cognitive and psycholinguistic phenomena. These simulations have shown that LSA captures a great deal of the similarity of the meaning of words expressed in discourse. For example, LSA can account for the effects of semantic priming (Landauer & Dumais, 1997), can be used to select the appropriate level of text for readers (Wolfe, Schreiner, Rehder, Laham, Foltz, Kintsch, & Landauer, 1998), and can model the effects of the coherence of texts on readers' comprehension (Foltz, Kintsch & Landauer, 1998).

Based on a statistical analysis of a large amount of text (typically thousands to millions of words), LSA derives a high-dimensional semantic space that permits comparisons of the semantic similarity of words and passages. Words and passages are represented as vectors in the space and their similarity is measured by the cosine of their contained angle in the semantic space. The LSA measured similarities have been shown to closely mimic human judgments of meaning similarity and human performance based on such similarity in a variety of ways. For example, after training on about 2,000 pages of English text, it scored as well as average test-takers on the synonym portion of TOEFL—the ETS Test of English as a Foreign Language (Landauer & Dumais, 1997). After training on an introductory psychology textbook, it achieved passing scores on two different multiple-choice exams used in introductory psychology courses (Landauer, Foltz, & Laham, in preparation). This similarity comparison made by LSA is the basis for performing automated scoring of essays through comparing the similarity of meaning between essays.

2. Automated Scoring with LSA

While much of the emphasis on evaluating written work has examined mechanical features, such as grammar, spelling and punctuation, there are other factors involved in writing a good essay. For example at an abstract level, one can distinguish three properties of a student essay that are desirable to assess; the correctness and completeness of its conceptual knowledge, the soundness of arguments that it presents in discussion of issues, and the fluency, elegance, and comprehensibility of its writing. Evaluation of superficial mechanical and syntactical features is fairly easy to separate from the other factors, but the rest—content, argument, comprehensibility, and aesthetic style—are likely to be difficult to distinguish because each influences the other, if only because each depends on the choice of words.

Although previous attempts to develop computational techniques for scoring essays have focused primarily on measures of style (e.g., Page, 1994), indices of content have remained secondary, indirect and superficial. In contrast to earlier approaches, LSA methods concentrate on the conceptual content, the knowledge conveyed in an essay, rather than its style, or even its syntax or argument structure.

To assess the quality of essays, LSA is first trained on domain-representative text. Examples of domain-representative texts include textbooks, articles or samples of writing that a student would encounter during learning in that domain. Based on this training, LSA derives a high-dimensional semantic representation of the information contained in the domain. Words from the text are represented as vectors in this semantic space, with the semantic similarity between words characterized by the cosine of the angle between the vectors for those words. Similarly, student essays can be characterized by LSA vectors based on the combination of all their words. These vectors can then be compared with vectors for essays or for texts of known content quality. The angle between the two vectors represents the degree to which the two essays discuss information in a similar manner. For example, an ungraded essay can be compared to essays that have already been graded. If the angle between two essays is small then those essays should be similar in content. Thus, the semantic or conceptual content of two essays can be compared and a score derived based on their similarity. Note that two essays can be considered to have almost identical content, even if they contain few or none of the same words, as long as they convey the same meaning. This feature of the comparison illustrates that LSA is doing more than just keyword matching; it is matching based on the conceptual content in the essays.

2.1. Evaluating the Effectiveness of Automated Scoring

Based on comparing conceptual content, several techniques have been developed for assessing essays. Details of these techniques have been published elsewhere and summaries of particular results will be provided below. One technique is to compare essays to ones that have been previously graded. A score for each essay is determined by comparing the essay against all previously graded essays. The scores from some number of those pre-graded essays (typically 10) that are most similar to the essay are weighted by their cosine similarity to the essay and used to determine the score for the essay. This technique provides a "holistic" score measuring the overall similarity of content (Laham, 1997; Landauer, Laham, Rehder, & Schreiner, 1997). The score is holistic because the grade for an essay is determined based on how well the overall meaning matches that of previously graded essays. In this way, it is similar to the holistic scoring approach used by human graders who judge each essay as a whole, rather than providing scores on the essay's individual components.

The holistic method has been tested on a large number of essays over a diverse set of topics. The essays have ranged in grade level, from middle school, high school, college to college graduate level essays. The topics have included essays from classes in introductory psychology, biology, history, as well as essays from standardized tests, such as analyses of arguments, and analyses of issues from the Educational Testing Service (ETS) Graduate Management Achievement Test (GMAT). For each of these sets of essays, LSA is first trained on a set of texts related to the domain. Then the content of each of the new essays is compared against the content of a set of pre-graded essays on the topic. In each case, the essays were also graded by at least two course instructors or expert graders, for example professional readers from Educational Testing Service, or other national testing organizations. Across the datasets, LSA's performance produced reliabilities within the range of their comparable inter-rater reliabilities and within the generally accepted guidelines for minimum reliability coefficients. For example, in a set of 188 essays written on the functioning of the human heart, the average correlation between two graders was 0.83, while the correlation of LSA's scores with the graders was 0.80. A summary of the performance of LSA's scoring compared to the grader-to-grader performance across a diverse set of 1205 essays on 12 topics is shown in Figure 1. The results indicate that LSA's reliability in scoring is equivalent to that of human graders.

In a more recent study, the holistic method was used to grade two additional questions from the GMAT standardized test. The performance was compared against two trained ETS graders. For one question, a set of 695 opinion essays, the correlation between the two graders was 0.86, while LSA's correlation with the ETS grades was also 0.86. For the second question, a set of 668 analyses of argument essays, the correlation between the two graders was 0.87, while LSA's correlation to the ETS grades was 0.86. Thus, LSA was able to perform at the same reliability levels as the trained ETS graders.

Figure 1. Summary of reliability results (N = 1205 Essays on 12 Diverse Topics)

While the holistic technique relies on comparing essays against a set of pre-graded essays, other techniques have been developed that also effectively characterize the quality of essays. A second technique is to compare essays to an ideal essay, or "gold standard" (Wolfe, et al., 1998). In this case, a teacher can write his or her ideal essay and all student essays are then judged on the basis of how closely they resemble the teacher's essay. In two final techniques, essays can be compared to portions of the original text, or compared to sub-components of texts or essays (Foltz, 1996; Foltz, Britt & Perfetti, 1996). In this final componential approach, individual sentences from a student's essay can be compared against a set of predetermined subtopics or sections of a textbook. This permits the determination of whether an essay sufficiently covers those subtopics. By determining whether a subtopic or section of a textbook is sufficiently covered in an essay, the computer can then provide feedback about which subtopics have been sufficiently understood and/or which sections of the textbook a student might need to review. The LSA derived scores based on this measure of the degree of coverage of subtopics in essays show equivalent correlations with human graders as graders correlate with each other

2.2. Anomalous Essay Checking

While it is important to verify the effectiveness of computer-based essay grading, it is also important that such a grader be able to determine if it cannot grade an essay reliably. Thus, a number of additional techniques have been developed that can detect "anomalous" essays. If an essay is sufficiently different from the essays for which it has been trained, the computer can flag it for human evaluation. The IEA can flag essays that are highly creative, off topic, use unusual syntax, or violate standard formats or structures for essays. In addition, the computer can determine whether any essay is too similar to other essays or to the textbook. The program is thus able to detect different levels of plagiarism. If an essay is detected as being anomalous for any reason, the essay can be automatically forwarded to the instructor for additional evaluation. This permits teachers to remain aware of students who may be having difficulties or are cheating.

3. Experiences in the Classroom: Grading and Feedback

Over the past two years, the Intelligent Essay Assessor has been used in a course in Psycholinguistics at New Mexico State University. Designed as a web-based application, it permits students to submit essays on a particular topic from their web browsers. Within about 20 seconds, students receive feedback with an estimated grade for their essay and a set of questions and statements of additional subtopics that are missing from their essays. Students can revise their essays immediately and resubmit. A demonstration is available at http://psych.nmsu.edu/essay.

To create this system, LSA was trained on a portion (four chapters) of the psycholinguistics textbook used in the class. The holistic grading method was used to provide an overall grade for any essay. In this method, each essay was compared against 40 essays from previous years that had been graded by three different graders. A score was assigned based on the grades of the previous essays to which an essay was most similar. To verify the effectiveness of this approach for providing accurate grades, the average correlation among the three human graders was 0.73 while the average correlation of LSA’s holistic grade to the individual graders was 0.80.

To provide feedback about missing information in each essay, two teaching assistants first identified the seven primary subtopics that should be covered for that essay topic. They then provided example sentences that corresponded to each of the subtopics. These example sentences were then represented in the semantic space. When a student submitted an essay, the individual sentences of the essay were compared to the subtopic sentences. If no sentence in the essay closely matched sentences for a particular subtopic, the computer provided additional feedback in the form of a question or comment to the student about the particular subtopic. For example, in their essays, students needed to describe whether a particular psycholinguistic model was a serial or parallel processing model. If the computer did not detect any mention of this, they would receive feedback asking them " Does the processing occur serially or in parallel?"

Students were permitted to use the system independently to write their essays and were encouraged to revise and resubmit them to the computer until they were satisfied with their grades. After students received their score and feedback, they also received a text box in which they could edit and resubmit their essay for grading. In addition, the web page had links to the class notes so that students could get additional help with writing their essays. The average grade for the students’ first essays was 85 (out of 100). By the last revision, the average grade was 92. Students’ essays all improved from revision to revision, with the improvements in scores ranging from 0 to 33 points over an average of 3 revisions.

An additional trial with a similar system took place in a Boulder, Colorado middle school (Kintsch, Steinhart, Matthews & Lamb, in press). In this system, called Summary Street, students were asked to summarize texts on alternative sources of energy, ancient civilizations, and the functioning of the heart.. Students received feedback about whether they had sufficiently covered subtopics from the texts they were to summarize. They could then revise and resubmit their essays. The results indicated that LSA's scores of the summaries were comparable to those given by the teacher and that LSA performed almost as well as the teachers at detecting which texts were the sources of the students' knowledge

In both the undergraduate and middle school trials, students and teachers enjoyed and valued using the system. A survey of usability and preferences for the system in the psycholinguistics course showed that 98 percent of the students indicated that they would definitely or probably use such a system if it were available for their other classes. Overall, the results show that the IEA is successful at helping students improve the quality of their essays through providing immediate and accurate feedback.

4. Implications of Computer-Based Essay Grading for Education

IEA can be applied to a variety of applications within education. At a minimal level, it can be used as a consistency checker, in which the teacher grades the essays and then the IEA re-grades them and indicates discrepancies between the two grades. Because the IEA is not influenced by fatigue, deadlines, or biases, it can provide a consistent and objective view of the quality of the essays. The IEA can further be used in large scale standardized testing or large classes, by either providing consistency checks or serving as an automatic grader.

At a more interactive level, the IEA can help students improve their writing through assessing and commenting on their essays. The IEA can provide comments regarding missing information, questions that a student should address, or suggestions about where to go back in a text to find the appropriate information. By providing instantaneous feedback about the quality of their essays, as well directed comments, students can use the IEA as a tool to practice writing content-based essays. Due to a lack of sufficient teachers and aides in many large section courses, writing assignments and essay exams have often been replaced by multiple choice questions. The IEA permits students to receive writing practice without requiring all essays to be evaluated by instructors. Because the IEA's evaluations are immediate, students can receive feedback and make multiple revisions over the course of one session. This overall approach is consistent with the goals of "Writing Across the Curriculum" because it makes it easier for faculty outside of English or the humanities to introduce more writing assignments in their courses. Thus, the IEA can serve to improve learning through writing.

Finally, the IEA can be integrated with other software tools for education. In most software for distance education, tools for administering writing assignments are often neglected in favor of tools for creating and grading multiple choice exams. The IEA permits writing to be a more central focus. For example, in web-based training systems, the IEA can be incorporated as an additional module, which would permit teachers to add writing assignments for their web courses. Essays can be evaluated on a secure server and scores and comments can be returned directly either to the students or to the teachers. Textbook supplements can similarly use the IEA for automated study guides. At the end of chapters, students can be asked to write essay questions addressing topics covered in the chapter. Based on an analysis of their essays, the software can suggest sections of the textbook that the students need to review before they take their exams.

5. Conclusions

To summarize, the IEA is first trained on some domain representative text such as a textbook, samples of writings, or even a large number (300 or more) of essays on a topic. Second, the IEA is provided with one or more pre-graded essays on the topic to be graded. Once these pieces of information are processed, the IEA provides accurate characterizations of the quality of essays on the same topic. Domain representative texts are easily available in electronic form, and essays can now be easily converted to or directly entered by students in electronic form. Thus, it is highly feasible to develop automated essay graders for a large number of topics across many domains. Although the IEA requires a large amount of computer processing power, it can be run from a centralized server, allowing students to access it from any web browser.

The IEA is currently available as a web-based grading service through Knowledge Analysis Technologies, LLC (KATech). Organizations wishing to have automated scoring of or evaluative comments added to essays can work with KATech personnel to have essay graders developed for their particular needs. The IEA is not sold as stand-alone software due to the fact that the hardware requirements are beyond the capabilities of standard desktop systems and some expertise is involved in the development of the semantic spaces. Instead, the graders are set up on secure KATech web servers which permit students and teachers to submit essays and receive back scores and comments. This approach further permits instructional designers to incorporate essay grading into their existing web-based software by just adding a link which forwards essays to the essay grader server. The servers can return scores and comments within just a few seconds. Demonstrations of the IEA and information about having graders developed may be found at: www.knowledge-technologies.com.

The Intelligent Essay Assessor presents a novel technique for assessing the quality of content knowledge expressed in essays. It permits the automatic scoring of short essays that would be used in any kind of content-based courses. Based on evaluations of the IEA over a wide range of essay topics and levels, the IEA proves as reliable at evaluating essay quality as human graders. Its scores are also highly consistent and objective. While the IEA is not designed for assessing creative writing, it can detect whether an essay is sufficiently different and, therefore, should be graded by the instructor. The IEA can be applied for both distance education and for training in the classroom. In both cases it may be used for assessing students’ content knowledge as well as providing feedback to students to help improve their learning. Overall, the IEA permits increasing the number of writing assignments without overly increasing the grading load on the teacher. Because writing is a critical component in helping students acquire both better content knowledge and better critical thinking skills, the IEA can serve as an effective tool for increasing students' exposure to writing.

6. Acknowledgements

The web site http://www.knowledge-technologies.com provides essay scoring demonstrations as well as information on the commercial availability of the IEA for educational software. The web site http://lsa.colorado.edu provides other demonstrations of the essay grading, additional applications of LSA and links to many of the articles cited. This research was supported in part by a contract from the Defense Advanced Research Projects Agency-Computer Aided Education and Training Initiative to Thomas Landauer and Walter Kintsch, a grant from the McDonnell Foundation’s Cognitive Science in Educational Practice program to W. Kintsch, T. Landauer, & G. Fischer, and an NSF KDI Learning and Intelligent System's grant to Peter Foltz and Adrienne Lee.

The "Intelligent Essay Assessor" has a patent pending: Methods for Analysis and Evaluation of the Semantic Content of Writing by P. W. Foltz, D. Laham, T. K. Landauer, W. Kintsch & B. Rehder, held by the University of Colorado.

7. References

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407.

Foltz, P. W. (1996) Latent Semantic Analysis for text-based research. Behavior Research Methods, Instruments and Computers. 28(2), 197-202.

Foltz, P. W., Britt, M. A., & Perfetti, C. A. (1996) Reasoning from multiple texts: An automatic analysis of readers' situation models. In G.W. Cottrell (Ed.) Proceedings of the 18th Annual Cognitive Science Conference.(pp. 110-115), Hillsdale, NJ: Lawrence Erlbaum Associates.

Kintsch, E., Steinhart, D., Matthews, C., & Lamb, R (in press) Developing summarization skills through the use of LSA-based feedback. Interactive Learning Environments.

Laham, D. (1997). Automated holistic scoring of the quality of content in directed student essays through Latent Semantic Analysis. Unpublished master’s thesis, University of Colorado, Boulder, Colorado.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240.

Landauer, T. K., Foltz, P. W., & Laham, D. (In preparation). Latent Semantic Analysis passes the test: knowledge representation and multiple-choice testing. Manuscript in preparation.

Landauer, T. K, Foltz, P. W. & Laham, D. (1998) An introduction to Latent Semantic Analysis. Discourse Processes, 25, 2&3, 259-284.

Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E., (1997). How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans. In M. G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society (pp. 412-417). Mahwah, NJ: Erlbaum.

Page, E. B. (1994). Computer grading of student prose using modern concepts and software. Journal of Experimental Education, 62 127-142.

Wolfe, M., B. Schreiner, M. E., Rehder, B., Laham, D., Foltz, P. W., Kintsch, W. & Landauer, T. K (1998). Learning from text: Matching readers and texts by Latent Semantic Analysis. Discourse Processes, 25, 2&3, 309-336.

********** End of Document **********

IMEJ multimedia team member assigned to this paper
Daniel Pfeifer