> Gemini lowered its grades by an average of 2 points after seeing Claude's and OpenAI's more rigorous assessments. It couldn't justify giving 17s when Claude was pointing to specific gaps in the experimentation discussion.
This is to be expected. The big commercial LLMs generally respond with text that agrees with the user.
> But here's what's interesting: the disagreement wasn't random. Problem Framing and Metrics had 100% agreement within 1 point. Experimentation? Only 57%.
> Why? When students give clear, specific answers, graders agree. When students give vague hand-wavy answers, graders (human or AI) disagree on how much partial credit to give. The low agreement on experimentation reflects genuine ambiguity in student responses, not grader noise.
The disagreement between the LLMs is interesting. I would hesitate to conclude that "low agreement on experimentation reflects genuine ambiguity in student responses." It could be that it reflects genuine ambiguity on the part of the graders/LLMs as to how a response should be graded.
This is to be expected. The big commercial LLMs generally respond with text that agrees with the user.
> But here's what's interesting: the disagreement wasn't random. Problem Framing and Metrics had 100% agreement within 1 point. Experimentation? Only 57%.
> Why? When students give clear, specific answers, graders agree. When students give vague hand-wavy answers, graders (human or AI) disagree on how much partial credit to give. The low agreement on experimentation reflects genuine ambiguity in student responses, not grader noise.
The disagreement between the LLMs is interesting. I would hesitate to conclude that "low agreement on experimentation reflects genuine ambiguity in student responses." It could be that it reflects genuine ambiguity on the part of the graders/LLMs as to how a response should be graded.