Web search improves scores modestly (about 2-4%) but introduces risks from irrelevant or misleading retrieved content.
Open-weight (self-hostable) models are improving quickly, approaching frontier performance on MCQs (~81% in EPRI’s tests).
Multiple-choice questions (MCQs) provide a strong, but incomplete, baseline: leading models score ~83-86% on MCQs.
Open-ended questions expose a reliability gap: accuracy drops by ~27 percentage points compared to multiple-choice questions, and top models score only ~46-71% on expert-level items.
Four Key Insights
2
1
4
3
MULTIPLE-CHOICE
OPEN-WEIGHT
OPEN-ENDED
WEB SEARCH