Validation & Accuracy · Lewis 1.0
How accurate is it, really?
Every number on this page is 5-fold cross-validated out-of-sample. Raw and calibrated MAE are published side-by-side.
- ·Texas UT/TPP and California PPIC state panels
- ·Columbus / Atlanta / Dallas legacy set (Pew, Gallup, election canvass)
- ·22-Q pre-registered held-out sourced after training froze
- ·460 benchmark questions · pooled production MAE 7.47%
Public Pew benchmarks (Predictions), timestamped Lewsearch runs and artifacts published next to cited Pew releases (synthetic directional reads; not a replication of probability-sample polling).
4.88%
Calibrated MAE, Texas state panel
UT/Texas Politics Project · 97 Q ex-electoral · 5-fold CV
3.43%
Best category, TX political approval
Live-poll margin-of-error territory (n=18)
7.47%
Pooled MAE across all panels
460 questions · ex-electoral · single calibrator · 5-fold CV
460
Independent benchmark questions
UT/TPP · PPIC · Pew / Gallup / canvass · pre-registered held-out
Plain English FAQ
Tap a question to expand. Full tables and audit detail are below for due diligence.
“Under 5% error”, what does that actually mean?
If the UT/Texas Politics Project says 48% of Texans approve of a given policy, Lewis 1.0 will predict within roughly 4–5 percentage points on average across every option on the ballot, and within 3.4 percentage points on political approval questions specifically (n=18, 5-fold CV). A live telephone poll typically has ±3% margin of error from sampling alone, plus additional nonresponse bias on top. We are competitive with live fielding on our strongest question types.
Could you have just trained Lewis to memorize the answers?
This is the test to worry about, and the reason we moved to 5-fold cross-validation. Every published number is evaluated on questions the calibrator was not fit on in that fold. The 22-Q held-out panel was sourced after training and calibration were frozen, and scored once. If Lewis had memorized, held-out numbers would diverge from fold-in numbers. They don't.
Why is California harder than Texas?
PPIC covers broad state-specific policy framings where options are compound and exposure varies widely. UT/TPP is closer to classic approval and directional questions. We publish both: Texas is the well-grounded customer baseline; California is the stretch-case ceiling.
What if I ask something Lewis wasn't benchmarked on?
Every result carries a confidence tier: High (<6 pp MAE on comparable types), Medium (6–10 pp), or Flag (>10 pp, novel territory). Custom brand and ad-concept questions extrapolate from agent profiles; treat absolute percentages as estimates.
Is this useful for financial or strategic decisions?
Yes, with appropriate framing. Lewsearch is a research signal, not a census. Strong for directional decisions and message testing. For regulatory filings or published editorial polling, commission a live panel.
U.S. National Cohort
Pooled live agents, weighted to national targets
Our U.S. National cohort is a stratified sample drawn from Lewsearch's 11 live panels: Columbus, Cleveland, and Ashtabula plus statewide Ohio; Atlanta and statewide Georgia; Dallas and statewide Texas; statewide California, New York, and Virginia. The sample is raked to ACS adult-population marginals on age, sex, race/ethnicity, education, region, and income, with party targets modeled from public political-identification benchmarks because ACS does not measure party ID.
This is more informative than a blank demographic-only audience: the underlying agents are persistent panelists with demographic profiles, media diets, accumulated beliefs, and recent local/state/news exposure. It is still not a probability sample of U.S. adults, and it is not currently benchmark-calibrated as a national panel.
Known limitation
While the weighted demographic composition is designed to match the U.S. adult population, the source-agent geography is skewed toward the Midwest and South because that is where our live panel coverage is strongest today. For broad national attitudes where local media environment is unlikely to dominate variation — technology adoption, national policy attitudes, broad cultural attitudes — this limitation is usually acceptable. For questions with strong regional drivers, use live city/state panels where available, or treat ACS-modeled regional reads as directional screens rather than precise regional substitutes.
We do not count the U.S. National cohort as additional unique panel inventory. It is a national sampling frame built from existing live agents, not 10,000 or 20,000 newly seeded respondents.
Message Testing Methodology
Qualitative signal, not a calibrated polling benchmark
Message Testing is built for open-ended feedback on ads, landing pages, pitch decks, product copy, and A/B messaging. Unlike structured Lewsearch polls, there is no external ground-truth percentage for “is this too technical?” or “what feels confusing?” The MAE benchmarks on this page apply to multiple-choice studies with known human-survey ground truth. Message Testing should be read as directional qualitative research: strengths, watchouts, sentiment, and illustrative synthetic respondent quotes.
For PDFs, images, and public URLs, Lewsearch first converts the material into a text description. PDFs and images are summarized by a vision-capable model before the Lewis agents respond; URLs are fetched for visible page text and summarized into a website brief. This makes the text-only Lewis panel able to react to visual hierarchy, copy, tone, CTAs, trust signals, and likely points of confusion. It is not a pixel-perfect usability lab or a human eye-tracking study.
The respondent quotes in Message Testing reports are generated by synthetic agents sampled from the selected market and optional demographic filter. They are useful for understanding likely reactions and language, but they are not verbatim human transcripts. Demographic notes in Message Testing are synthesized from the open-ended responses and agent profiles; they should be treated as pattern-finding, not audited subgroup crosstabs.
Appropriate use
Use Message Testing to decide what to clarify, which concept to iterate, which audience may be confused, and what to preserve or clarify before spending on media or live research. Use structured Lewsearch studies when you need calibrated percentage estimates against benchmarked multiple-choice question types.
Technical Detail
Seven audit points. Expand any row for the full write-up.
01Pre-registered held-out: sourced after training was frozen
Our strictest honesty check is a 22-question benchmark drawn on April 18, 2026 from Emerson, Marist, PPIC, USC CEPP, UT Tyler, Change Research, and Ohio Library Council, five states and four categories. These questions were selected and written down after training and calibration were frozen. 8 of 22 were dropped by an automated pre-filter (past-election ground truths, extreme prior-delta outliers) and 14 were scored. Calibrated MAE on that truly held-out set: 10.68% overall, 9.97% ex-electoral, 7.24% on political approval. It is higher than our best in-panel numbers on purpose: we publish both because the gap shows the real-world spread between well-specified panels and noisier edge cases.
02Cross-validated, every number is out-of-sample
Every published MAE is a 5-fold cross-validation result. Calibration parameters are fit on 4 of 5 folds and evaluated on the 5th, never on data the calibrator saw. The Lewsearch production inference stack achieves 4.88% ex-electoral MAE on 97 Texas state-panel questions (UT/Texas Politics Project, 2024–25 waves), 7.77% on 148 California questions (PPIC 2024–25), 7.86% on the 151-Q Pew/Gallup/canvass tri-metro benchmark, and 9.97% on the pre-registered held-out. A single pooled calibrator trained on all 460 questions lands at 7.47% ex-electoral, the number customers actually get in production.
03Independent state-level panels, not just our own survey
Beyond the legacy 151-Q Pew/Gallup/canvass benchmark, Lewis is evaluated on two independent academic state panels: the Public Policy Institute of California (PPIC) statewide survey and the University of Texas / Texas Politics Project panel. Both are fielded by third-party research organizations on YouGov infrastructure. Every benchmark question carries an exact field-date window and a full demographic target vector, so the agent pool is demographically matched question-by-question, not just city-by-city.
04Verified against official election canvass data
Election-context questions use official canvass results as ground truth, Georgia Secretary of State (2022 Senate runoff, 2024 Presidential), Dallas County Elections (2024 Presidential, 2024 Senate), Franklin County Board of Elections (2023 mayoral), California Secretary of State (statewide propositions), Texas Secretary of State (2022/2024 statewide). These are exact tallies, not polls, the only category in our benchmark with zero ground-truth noise.
05Post-hoc calibration, disclosed, reproducible, not per-customer
Raw Lewis outputs pass through a published post-hoc calibration technique (Dirichlet calibration, Kull, Perelló-Nieto, Filipović et al., NeurIPS 2019) fit on held-out folds of our benchmark pool. The same calibrator ships to every customer — there is no per-panel retuning after the fact. Raw and calibrated MAE are published side-by-side. On Texas the calibrator recovers 3.7 pp of raw error (8.56 → 4.88 ex-electoral); on California, 5.7 pp; on the legacy 151-Q set, 2.7 pp. These are empirical reductions on independently-fielded panels.
06Scale-validated — apples-to-apples at n=1,000 vs n=10,000
On the legacy 151-Q Pew/Gallup/canvass panel we ran the identical 5-fold CV Dirichlet calibrator on n=1,000 and n=10,000 raw outputs across 148 shared questions. Pooled calibrated MAE: 7.75% at n=1,000 vs 7.90% at n=10,000 — within 0.15 percentage points. Political approval (our largest and most-demanded category, n=47) improved by 0.56pp at n=10,000 (9.05% → 8.49%). Civic/legal improved by 0.81pp (n=7), electoral held steady at ~4.3pp (n=6). Brand and directional moved up by 0.2–0.5pp (inside noise); policy regressed by 1.4pp. We are analyzing that category in our internal QA cycle and publish updates here. The takeaway: accuracy is calibrator-driven, not sample-driven. Larger paid-tier studies mainly reduce simulation noise and improve crosstab stability; model uncertainty and category fit still dominate the final read.
07Living agents, persistent memory and period-appropriate context
Lewis agents are not one-shot LLM personas. Each agent carries a persistent demographic profile and a memory of prior studies, so two customers asking related questions on the same panel get internally consistent behavior rather than independent re-sampled noise. Evaluation is anchored to the midpoint of the source survey's field-date window, with period-appropriate context injected at inference — a February 2024 PPIC question is answered from a February 2024 state of the world, not from today's headlines. That temporal discipline is what prevents benchmark leakage and keeps the MAE numbers honest.
Benchmark Panels
Calibrated accuracy by independent panel
Lewis 1.0 is evaluated on four independently-fielded panels. Each is fielded by a different organization — UT/Texas Politics Project (YouGov, 97 Qs), PPIC California (148 Qs), the Pew/Gallup/election-canvass tri-metro legacy set (151 Qs), and a pre-registered 22-Q held-out sourced after training froze (14 scored after filter). Different sponsors, different fieldwork methods, different time windows — so the benchmark as a whole is not a single-source artifact.
Raw vs. Calibrated MAE (5-fold CV)
lower is better
Calibration drop: UT/Texas -2.2pp · PPIC -5.7pp · Pew -2.7pp · Pre-registered -2.2pp
| Panel | Scope | n | Raw MAE | Calibrated MAE | Excl. Electoral | Status |
|---|---|---|---|---|---|---|
| UT/Texas Politics Project | Texas statewide · 2024–25 | 120 | 8.56% | 6.35% | 4.88% | LIVE |
| PPIC California Statewide | California · 2024–25 | 175 | 13.69% | 8.01% | 7.77% | LIVE |
| Pew / Gallup / Canvass (Legacy) | Columbus · Atlanta · Dallas | 151 | 10.58% | 7.86% | 7.86% | LIVE |
| Pre-registered 22-Q held-out | OH · GA · TX · NY · CA | 22 | 12.86% | 10.68% | 9.97% | LIVE |
All figures are 5-fold cross-validated out-of-sample MAE. “Excl. electoral” removes tight-margin electoral contests, where voter-turnout uncertainty dominates MAE on any modeling approach. The UT/TPP panel covers Texas statewide · the PPIC panel covers California statewide · the legacy panel covers Columbus / Atlanta / Dallas.
Scale Validation · n=1,000 vs n=10,000
Larger N reduces simulation noise; headline MAE stays flat
We compared calibrated MAE on the legacy 151-Q panel at n=1,000 vs n=10,000 to answer a single structural question: does sample size move our headline MAE? We ran the identical 5-fold CV Dirichlet calibrator on both raw outputs across the 148 shared questions — fully apples-to-apples. The answer: pooled calibrated MAE went from 7.75% at n=1,000 to 7.90% at n=10,000, within 0.15pp and well inside the bootstrap envelope. Political approval (our largest category and primary customer use case) actually improved by 0.56pp at n=10k.
| Category | n | n=1k cal | n=10k cal | Δ | Note |
|---|---|---|---|---|---|
| Political approval | 47 | 9.05% | 8.49% | -0.56pp | Category improved at n=10k; pooled headline MAE unchanged. |
| Civic / legal | 7 | 11.32% | 10.51% | -0.81pp | Small n, but moves in the right direction. |
| Electoral | 6 | 4.37% | 4.25% | -0.12pp | Already our cleanest category. Held steady and nudged lower. |
| Directional | 27 | 8.85% | 9.08% | +0.23pp | Inside bootstrap noise. |
| Brand | 33 | 6.68% | 7.18% | +0.50pp | Slight regression, still under 8% MAE. |
| Policy | 28 | 5.37% | 6.77% | +1.40pp | Only honest regression. Raw policy MAE more than doubled (6.30→13.38), calibration absorbs most of it but not all. Under investigation. |
| Pooled | 148 | 7.75% | 7.90% | +0.15pp | Within bootstrap noise. Scale-stable. |
Same Lewis 1.0 model, same 148 questions, same 5-fold CV Dirichlet calibrator applied to raw outputs at both sample sizes. The honest conclusion: accuracy is calibrator-driven, not sample-driven. The real upside of paid-tier n=10,000 studies is lower simulation noise and more stable crosstabs; model uncertainty and category fit still dominate the final interpretation.
Category Breakdown · State panels
Where Lewis is strongest — and where it isn't
A single headline hides the fact that Lewis is strongest on well-exposed, demographic-driven questions (political approval, directional indicators, brand preference) and weaker on compound policy framings and tight electoral margins. Below is the calibrated per-category MAE on both state panels — Texas (UT/TPP) and California (PPIC) — so you can match the tool to the question you're asking. The legacy 151-Q tri-metro panel and the pre-registered held-out sit in the same category shape (approval-type lowest, electoral-margin highest).
| Category | TX (UT/TPP) | CA (PPIC) | ||
|---|---|---|---|---|
| n | MAE | n | MAE | |
| Political approval | 18 | 3.43% | 15 | 7.94% |
| Directional / economic | 7 | 4.27% | 13 | 6.40% |
| Policy positions | 63 | 5.07% | 118 | 7.75% |
| Brand / reputation | 4 | 2.44% | 2 | 16.98% |
| Civic / legal | 5 | 6.92% | 2 | 12.44% |
| Electoral margin (hardest) | 23 | 12.23% | 25 | 9.30% |
Source: UT/Texas Politics Project (2024–25 waves) and PPIC California Statewide (2024–25 waves). 5-fold cross-validated, calibrated. Cells with small n (CA brand n=2, CA civic n=2) are shown with their point estimate but should be read with caveat. TX brand 2.44% is on n=4 and within the sampling envelope of the TX approval 3.43% result.
Top Single-Question Results
The tail of the distribution, not the mean
Aggregate panel MAE (4.88% TX, 7.77% CA, 7.86% legacy 151-Q, 9.97% held-out) is the honest average — including the hardest electoral-margin outliers. But the modal customer study sits in the lower tail: well-framed approval or policy questions on a demographically-matched panel. Below are the 10 individual questions where Lewis 1.0’s calibrated prediction landed closest to the truth — not hand-picked, these are the natural tail of a fully-fit calibration pipeline applied to every question. All out-of-sample, 5-fold CV.
| # | MAE | Panel | Category | Question |
|---|---|---|---|---|
| 01 | 0.40 pp | TX (UT/TPP) | Policy · 5-opt | “How concerned are you about the cost of higher education?” |
| 02 | 0.47 pp | CA (PPIC 2025) | Policy · 3-opt | California making its own climate policy separate from the federal government |
| 03 | 0.49 pp | CA (PPIC 2024) | Electoral · 3-opt | Prop 1 — $6.38B mental-health facilities bond |
| 04 | 0.60 pp | CA (PPIC 2025) | Policy · 3-opt | $7.1B Budget Stabilization proposal |
| 05 | 0.72 pp | CA (PPIC 2024) | Policy · 3-opt | Higher taxes + more services vs. lower taxes + fewer services |
| 06 | 0.79 pp | TX (UT/TPP) | Political approval · 6-opt | Presidential approval on Venezuela |
| 07 | 0.94 pp | CA (PPIC 2025) | Policy · 3-opt | Taxes/services tradeoff (2025 replication) |
| 08 | 0.97 pp | TX (UT/TPP) | Political approval · 6-opt | Presidential approval on health care |
| 09 | 1.02 pp | CA (PPIC 2025) | Policy · 3-opt | CA state actions protecting legal rights of undocumented immigrants |
| 10 | 1.06 pp | TX (UT/TPP) | Political approval · 6-opt | Presidential approval (6-option grid) |
#1, Option-by-option
“How concerned are you about the cost of higher education in Texas?”
UT/Texas Politics Project · February 2026 · n=1,300 registered voters
| Option | Lewis prediction | UT/TPP truth | Error |
|---|---|---|---|
| Very concerned | 39.1% | 39.2% | 0.1 pp |
| Somewhat concerned | 31.0% | 30.0% | 1.0 pp |
| Not too concerned | 15.9% | 16.6% | 0.7 pp |
| Not at all concerned | 9.5% | 9.4% | 0.1 pp |
| Don’t know | 4.6% | 4.8% | 0.2 pp |
Five options, every one within 1.2 percentage points of a YouGov-fielded survey of 1,300 Texans. This is the regime customers are actually in when they specify a real demographic target — not an average, but the floor of what a well-specified panel can do.
Full per-question CV rankings available on request. Customer studies don’t carry ground truth (that’s why they’re run), so we can’t quote a per-study MAE; these benchmark tails are the proxy for what well-matched customer panels look like.
Case Study · Ultra-Specific Neighborhood
Short North Arts District, Columbus — 8.71% average MAE on a 6-block neighborhood
The Short North Arts District is roughly six blocks of High Street in downtown Columbus — not a city, not a metro, a neighborhood. In August 2023 the Short North Alliance commissioned JS&A Consulting to run a consumer study of 500+ visitors and Columbus-area residents. We scored Lewis 1.0 against five of those questions, with agents drawn from our Columbus panel and a neighborhood-level prompt tag. Ground truth below is the JS&A study. Lewis predictions use the production Dirichlet calibrator fit on the 460-Q benchmark pool — this neighborhood study was not part of that pool, so it functions as an external held-out spot check.
“Do you dine out more, about the same, or less often than before the pandemic?”
MAE 2.47pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| More often | 20.4% | 20% | 0.4pp |
| About the same | 43.3% | 40% | 3.3pp |
| Less often | 36.3% | 40% | 3.7pp |
“Do you believe Short North businesses are mostly locally owned, a mix, or mostly chains?”
MAE 7.40pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| Mostly locally owned | 48.9% | 60% | 11.1pp |
| A mix of both | 37.4% | 30% | 7.4pp |
| Mostly chains/franchises | 13.7% | 10% | 3.7pp |
“Compared to a few years ago, are you more likely or less likely to visit the Short North?”
MAE 12.93pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| More likely | 19.4% | 15% | 4.4pp |
| About the same | 36.0% | 21% | 15.0pp |
| Less likely | 44.6% | 64% | 19.4pp |
“How would you describe the Short North’s economic impact on Columbus?”
MAE 10.78pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| Major economic driver | 38.8% | 55% | 16.2pp |
| Moderate contributor | 45.4% | 35% | 10.4pp |
| Minor / negligible | 15.7% | 10% | 5.7pp |
“Biggest reason you might visit the Short North less often?”
MAE 9.95pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| Safety concerns | 28.2% | 35% | 6.8pp |
| Parking difficulty | 25.3% | 15% | 10.3pp |
| Too expensive | 24.0% | 10% | 14.0pp |
| I go out less in general | 12.0% | 30% | 18.0pp |
| Other reasons | 10.6% | 10% | 0.6pp |
Takeaway
Strong on "dining frequency" (2.47pp) and "local vs. chain perception" (7.40pp) — both are classic consumer-sentiment questions well-anchored in demographic priors and period-appropriate local context. Weaker on "likelihood to visit" (12.93pp) and "biggest reason to visit less" (9.95pp) where ground truth itself was estimated from narrative coverage rather than direct tabulation. This is the real neighborhood-level regime: customers who want to test a brand question at a 6-block scale, demographically-matched to real residents, with honest error bars on what we do and don't nail. Average calibrated MAE across the five: 8.71pp.
Source: Short North Alliance & JS&A Consulting Market & Consumer Study, Aug 2023 (n=500+). Lewis 1.0 predictions calibrated on held-out folds of the benchmark pool, 5-fold CV. One further binary-option SNA question (“defining role in Columbus's identity”) is in-benchmark but sits in a calibration bucket outside the 5 shown — it will be published when that bucket has full n-of-bucket coverage.
Case Studies
CASE STUDY · POLITICAL APPROVAL
TX political approval, 18 questions, aggregate 3.43% MAE
Setup
UT/Texas Politics Project approval-rating block across 2024–25 field waves, Governor, Lt. Governor, Legislature, US Senators, economic direction. 3 options (approve / disapprove / don't know). 1,000 TX residents per question, demographically matched to UT/TPP target margins (party ID × age × race × education × region).
Result
Aggregate calibrated MAE across all 18 questions: 3.43 pp (5-fold CV, Dirichlet calibration). Best individual question: 0.8 pp. Worst: 6.2 pp. Baseline (pre-calibration) MAE: 7.02 pp, calibration recovered more than half the raw error.
Takeaway
On the categorical backbone of most real survey research, approval ratings of well-known public figures and institutions, Lewis sits squarely in live-poll margin-of-error territory. This is the strongest signal for customers running messaging / approval / perception studies.
CASE STUDY · BRAND / ADVERTISING
TX grocery brand preference, late 2024
Setup
UT/TPP brand-reputation block. 6 options including 'Don't know / none'. Lewis 1.0 agent pool: 1,000 TX shoppers, income-tilted per retail demographics.
Result
Calibrated MAE across all 6 options: 4.14 pp. Lewis correctly rank-ordered top-3 brands and flagged the two long-tail options as below-10% territory.
Takeaway
Brand benchmarks reliably pick the leader and runner-up. For A/B ad concept testing, directional accuracy across demographics is the decision-grade signal.
CASE STUDY · PRE-REGISTERED HELD-OUT
22-Q independent benchmark, 9.97% ex-electoral
Setup
On April 18, 2026 we pre-registered a fresh 22-question benchmark drawn from Emerson College, Marist, PPIC, USC CEPP, UT Tyler, Change Research, and Ohio Library Council, five states (OH/GA/TX/NY/CA) and four categories (political approval / policy / civic trust / electoral). These questions were sourced AFTER training and calibration fits were frozen. An automated filter excluded 8 questions (past-election ground truths and extreme prior-delta outliers), leaving 14 scored.
Result
Calibrated MAE on the 14 scored questions: 10.68% overall, 9.97% ex-electoral. Political approval: 7.24 pp (n=5). Policy: 9.53 pp (n=4). Civic trust: 15.10 pp (n=3). Electoral: 14.90 pp (n=2).
Takeaway
This is our strictest possible honesty check, a small, truly held-out set with no fitting after the fact. The gap between 4.88% on the Texas panel and 9.97% on the 22-Q held-out is the real-world variance between core use cases (familiar territory, in-distribution) and edge cases (novel state-specific policy, tight electoral margins). We publish both.
Disclosures
AI-generated data
All Lewsearch panel responses are generated by AI, not collected from human participants. Every report carries a mandatory disclosure to this effect.
Not a replacement for live surveys
Synthetic results are not appropriate for legal, regulatory, or journalistic contexts requiring probabilistic sampling from live populations.
Post-hoc calibration applied
Raw Lewis outputs pass through a published post-hoc calibration technique (Dirichlet calibration) fit on held-out folds of the benchmark pool. The same calibrator ships to every customer, there is no per-panel retuning after the fact. Raw and calibrated MAE are published side-by-side on the methodology page so reviewers can see what the calibration is and isn't doing.
Geographic coverage
11 live panels · U.S. national · 4 census regions (16 audiences). Benchmark panels drove calibration; additional audiences use the same production calibrator.
Model versioning
Reports indicate the Lewis version at time of fielding. Lewis 1.0 is the current production model across all benchmark panels. Weights update on a continuous news-and-simulation retrain cycle; every update re-runs the full CV pipeline before promotion.
Confidence scoring
Every result ships with a confidence tier (High / Medium / Low / Flag). Tier is derived from two signals: (1) calibrated CV MAE for the matching question-type bucket, and (2) topline signal strength for this study (lead margin, subgroup consistency). We take the more conservative of the two. High = <6 pp bucket MAE and strong signal. Medium = 6–10 pp bucket MAE or moderate signal. Low = weak signal within a medium/high bucket (low lead margin, within-MoE topline). Flag = bucket MAE ≥10 pp, or the question fell outside any analogous calibrated bucket, treat as directional only.
Why we publish raw + calibrated side-by-side
Raw MAE shows the base model's honesty. Calibrated MAE shows the production system's accuracy. Publishing both ensures customers and investors can judge whether our calibration is doing legitimate work or papering over structural bias. It is the former.
Try it yourself
Skeptical? Run a study and check the numbers.
High-intent buyers can validate accuracy in Sandbox, or request the full whitepaper and per-question CV breakdown for due diligence.