2026.03 ej llm img01A

Web search improves scores modestly (about 2-4%) but introduces risks from irrelevant or misleading retrieved content.

Open-weight (self-hostable) models are improving quickly, approaching frontier performance on MCQs (~81% in EPRI’s tests).

Multiple-choice questions (MCQs) provide a strong, but incomplete, baseline: leading models score ~83-86% on MCQs.

Open-ended questions expose a reliability gap: accuracy drops by ~27 percentage points compared to multiple-choice questions, and top models score only ~46-71% on expert-level items.

Four Key Insights

MULTIPLE-CHOICE

OPEN-WEIGHT

OPEN-ENDED

WEB SEARCH