What biases affect LLM-as-a-judge evaluations, and how do you mitigate position bias?
LLM judges suffer from position bias (favoring a given ordering), verbosity bias (preferring longer answers), and self-enhancement bias (favoring their own style); position bias alone can cause large inconsistency. Mitigations include evaluating both orderings and only counting consistent wins, using rubrics, averaging multiple judges, and calibrating against human labels.
How to think about it
LLM judges suffer from position bias (favoring a given ordering), verbosity bias (preferring longer answers), and self-enhancement bias (favoring their own style); position bias alone can cause large inconsistency. Mitigations include evaluating both orderings and only counting consistent wins, using rubrics, averaging multiple judges, and calibrating against human labels.