Scoring essays is generally an exhausting and time-consuming task for teachers. Automated Essay Scoring (AES) facilitates the scoring process to be faster and more consistent. The most logical way to assess the performance of an automated scorer is by measuring the score agreement with the human raters. However, we provide empirical evidence that a well-performing essay scorer from the quantitative evaluation point of view are still too risky to be deployed. We propose several input scenarios to evaluate the reliability and the validity of the system, such as off-topic essays, gibberish, and paraphrased answers. We demonstrate that automated scoring models with high human-computer agreement fail to perform well on two out of three test scenarios. We also discuss the strategies to improve the performance of the system.
|Title of host publication||EDM|
|Publication status||Published - 2021|
- Automated Essay Scoring
- Testing Scenarios
- Reliability and Validity