On the Limitations of Human-Computer Agreement in Automated Essay Scoring

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Scoring essays is generally an exhausting and time-consuming task for teachers. Automated Essay Scoring (AES) facilitates the scoring process to be faster and more consistent. The most logical way to assess the performance of an automated scorer is by measuring the score agreement with the human raters. However, we provide empirical evidence that a well-performing essay scorer from the quantitative evaluation point of view are still too risky to be deployed. We propose several input scenarios to evaluate the reliability and the validity of the system, such as off-topic essays, gibberish, and paraphrased answers. We demonstrate that automated scoring models with high human-computer agreement fail to perform well on two out of three test scenarios. We also discuss the strategies to improve the performance of the system.
Original languageEnglish
Title of host publicationEDM
Publication statusPublished - 2021

Keywords

  • Automated Essay Scoring
  • Testing Scenarios
  • Reliability and Validity

Cite this