How do you evaluate LLM outputs, and what is LLM-as-a-judge?
LLM evaluation combines reference-based metrics like BLEU and ROUGE, task benchmarks like MMLU and HumanEval, and human or model-based judgment of qualities like helpfulness and faithfulness. LLM-as-a-judge uses a strong model to score or compare outputs against a rubric, scaling human-like evaluation cheaply but requiring care because the judge can be unreliable.
How to think about it
LLM evaluation combines reference-based metrics like BLEU and ROUGE, task benchmarks like MMLU and HumanEval, and human or model-based judgment of qualities like helpfulness and faithfulness. LLM-as-a-judge uses a strong model to score or compare outputs against a rubric, scaling human-like evaluation cheaply but requiring care because the judge can be unreliable.