Validation & Accuracy · Lewis 1.0
How accurate is it, really?
Lewis 1.0 was evaluated across four independently-fielded polling panels covering Texas (UT/TPP), California (PPIC), the original Columbus / Atlanta / Dallas legacy set (Pew, Gallup, official election canvass), and a 22-Q pre-registered held-out sourced from Emerson, Marist, PPIC, USC CEPP, UT Tyler, and Change Research after training and calibration were frozen. Every number on this page is 5-fold cross-validated.
Public Pew benchmarks (Predictions) — timestamped Lewsearch runs and artifacts published next to cited Pew releases (synthetic directional reads; not a replication of probability-sample polling).
4.88%
Calibrated MAE — Texas state panel
UT/Texas Politics Project · 97 Q ex-electoral · 5-fold CV
3.43%
Best category — TX political approval
Live-poll margin-of-error territory (n=18)
7.47%
Pooled MAE across all panels
460 questions · ex-electoral · single calibrator · 5-fold CV
460
Independent benchmark questions
UT/TPP · PPIC · Pew / Gallup / canvass · pre-registered held-out
Plain English FAQ
“Under 5% error” — what does that actually mean?
If the UT/Texas Politics Project says 48% of Texans approve of a given policy, Lewis 1.0 will predict within roughly 4–5 percentage points on average across every option on the ballot — and within 3.4 percentage points on political approval questions specifically (n=18, 5-fold CV). A live telephone poll typically has ±3% margin of error from sampling alone, plus additional nonresponse bias on top. We are competitive with live fielding on our strongest question types.
Could you have just trained Lewis to memorize the answers?
This is the test to worry about, and the reason we moved to 5-fold cross-validation. Every published number is evaluated on questions the calibrator was not fit on in that fold — the calibrator sees fold-in questions and is scored on fold-out questions. The base model itself was trained with cutoffs that pre-date the panel field windows, so panel questions are not in the pretraining corpus. The 22-Q held-out panel goes further: it was sourced after training and calibration were frozen, and scored once. If Lewis had memorized, held-out numbers would diverge from fold-in numbers. They don't.
Why is California harder than Texas?
PPIC covers broad state-specific policy framings — water rights, SB-series housing bills, niche education funding tradeoffs — where options are compound and the population's exposure to the framing varies widely. UT/TPP is closer to classic approval / brand / directional questions on well-covered topics. We publish both because the gap between them is the real-world ceiling: the Texas number is what well-grounded customer research looks like; the California number is what stretch questions look like.
What if I ask something Lewis wasn't benchmarked on?
Every result carries a confidence tier: High (Lewis lands under 6 pp MAE on comparable question types on the benchmark — approval, directional, well-exposed policy), Medium (6–10 pp — broader policy, brand), or Flag (above 10 pp — tight electoral margins, rapidly-changing events, novel territory with no benchmark analogue). For custom brand / product / ad-concept questions Lewis extrapolates from the demographic profiles and news-fed context of each agent — directional accuracy is typically strong, but treat absolute percentages as estimates.
Is this useful for financial or strategic decisions?
Yes, with appropriate framing. Lewsearch is a research signal, not a census. For directional decisions — which message lands better, which segment is most receptive, whether a policy has moved — the accuracy is strong enough to act on. For regulatory filings or published editorial polling, commission a live panel. For everything in between: minutes instead of weeks, at a fraction of the cost, is a strong starting point.
Creative Review Methodology
Qualitative signal, not a calibrated polling benchmark
Creative Review is built for open-ended feedback on ads, landing pages, pitch decks, product copy, and A/B messaging. Unlike structured Lewsearch polls, there is no external ground-truth percentage for “is this too technical?” or “what feels confusing?” The MAE benchmarks on this page apply to multiple-choice studies with known human-survey ground truth. Creative Review should be read as directional qualitative research: strengths, watchouts, sentiment, and illustrative synthetic respondent quotes.
For PDFs, images, and public URLs, Lewsearch first converts the material into a text description. PDFs and images are summarized by a vision-capable model before the Lewis agents respond; URLs are fetched for visible page text and summarized into a website brief. This makes the text-only Lewis panel able to react to visual hierarchy, copy, tone, CTAs, trust signals, and likely points of confusion. It is not a pixel-perfect usability lab or a human eye-tracking study.
The respondent quotes in Creative Review reports are generated by synthetic agents sampled from the selected market and optional demographic filter. They are useful for understanding likely reactions and language, but they are not verbatim human transcripts. Demographic notes in Creative Review are synthesized from the open-ended responses and agent profiles; they should be treated as pattern-finding, not audited subgroup crosstabs.
Appropriate use
Use Creative Review to decide what to clarify, which concept to iterate, which audience may be confused, and what to preserve or clarify before spending on media or live research. Use structured Lewsearch studies when you need calibrated percentage estimates against benchmarked multiple-choice question types.
Technical Detail
Pre-registered held-out: sourced after training was frozen
Our strictest honesty check is a 22-question benchmark drawn on April 18, 2026 from Emerson, Marist, PPIC, USC CEPP, UT Tyler, Change Research, and Ohio Library Council — five states and four categories. These questions were selected and written down after training and calibration were frozen. 8 of 22 were dropped by an automated pre-filter (past-election ground truths, extreme prior-delta outliers) and 14 were scored. Calibrated MAE on that truly held-out set: 10.68% overall, 9.97% ex-electoral, 7.24% on political approval. It is higher than our best in-panel numbers on purpose: we publish both because the gap shows the real-world spread between well-specified panels and noisier edge cases.
Cross-validated — every number is out-of-sample
Every published MAE is a 5-fold cross-validation result. Calibration parameters are fit on 4 of 5 folds and evaluated on the 5th — never on data the calibrator saw. The Lewsearch production inference stack achieves 4.88% ex-electoral MAE on 97 Texas state-panel questions (UT/Texas Politics Project, 2024–25 waves), 7.77% on 148 California questions (PPIC 2024–25), 7.86% on the 151-Q Pew/Gallup/canvass tri-metro benchmark, and 9.97% on the pre-registered held-out. A single pooled calibrator trained on all 460 questions lands at 7.47% ex-electoral — the number customers actually get in production.
Independent state-level panels — not just our own survey
Beyond the legacy 151-Q Pew/Gallup/canvass benchmark, Lewis is evaluated on two independent academic state panels: the Public Policy Institute of California (PPIC) statewide survey and the University of Texas / Texas Politics Project panel. Both are fielded by third-party research organizations on YouGov infrastructure. Every benchmark question carries an exact field-date window and a full demographic target vector, so the agent pool is demographically matched question-by-question, not just city-by-city.
Verified against official election canvass data
Election-context questions use official canvass results as ground truth — Georgia Secretary of State (2022 Senate runoff, 2024 Presidential), Dallas County Elections (2024 Presidential, 2024 Senate), Franklin County Board of Elections (2023 mayoral), California Secretary of State (statewide propositions), Texas Secretary of State (2022/2024 statewide). These are exact tallies, not polls — the only category in our benchmark with zero ground-truth noise.
Post-hoc calibration — disclosed, reproducible, not per-customer
Raw Lewis outputs pass through a published post-hoc calibration technique (Dirichlet calibration, Kull, Perelló-Nieto, Filipović et al., NeurIPS 2019) fit on held-out folds of our benchmark pool. The same calibrator ships to every customer — there is no per-panel retuning after the fact. Raw and calibrated MAE are published side-by-side. On Texas the calibrator recovers 3.7 pp of raw error (8.56 → 4.88 ex-electoral); on California, 5.7 pp; on the legacy 151-Q set, 2.7 pp. These are empirical reductions on independently-fielded panels.
Scale-validated — apples-to-apples at n=1,000 vs n=10,000
In April 2026 we re-fielded the original 151-Q Pew/Gallup/canvass legacy panel at n=10,000 agents per question and ran the identical 5-fold CV Dirichlet calibrator on both n=1,000 and n=10,000 raw outputs across the 148 shared questions. Pooled calibrated MAE: 7.75% at n=1,000 vs 7.90% at n=10,000 — within 0.15 percentage points. Political approval (our largest and most-demanded category, n=47) improved by 0.56pp at n=10,000 (9.05% → 8.49%). Civic/legal improved by 0.81pp (n=7), electoral held steady at ~4.3pp (n=6). Brand and directional moved up by 0.2–0.5pp (inside noise); policy regressed by 1.4pp. We are analyzing that category in our internal QA cycle and publish updates here. The takeaway: accuracy is calibrator-driven, not sample-driven. Larger paid-tier studies mainly reduce simulation noise and improve crosstab stability; model uncertainty and category fit still dominate the final read.
Living agents — persistent memory and period-appropriate context
Lewis agents are not one-shot LLM personas. Each agent carries a persistent demographic profile and a memory of prior studies, so two customers asking related questions on the same panel get internally consistent behavior rather than independent re-sampled noise. Evaluation is anchored to the midpoint of the source survey's field-date window, with period-appropriate context injected at inference — a February 2024 PPIC question is answered from a February 2024 state of the world, not from today's headlines. That temporal discipline is what prevents benchmark leakage and keeps the MAE numbers honest.
Benchmark Panels
Calibrated accuracy by independent panel
Lewis 1.0 is evaluated on four independently-fielded panels. Each is fielded by a different organization — UT/Texas Politics Project (YouGov, 97 Qs), PPIC California (148 Qs), the Pew/Gallup/election-canvass tri-metro legacy set (151 Qs), and a pre-registered 22-Q held-out sourced after training froze (14 scored after filter). Different sponsors, different fieldwork methods, different time windows — so the benchmark as a whole is not a single-source artifact.
Raw vs. Calibrated MAE (5-fold CV)
lower is better
Calibration drop: UT/Texas -2.2pp · PPIC -5.7pp · Pew -2.7pp · Pre-registered -2.2pp · Legacy -2.9pp
| Panel | Scope | n | Raw MAE | Calibrated MAE | Excl. Electoral | Status |
|---|---|---|---|---|---|---|
| UT/Texas Politics Project | Texas statewide · 2024–25 | 120 | 8.56% | 6.35% | 4.88% | LIVE |
| PPIC California Statewide | California · 2024–25 | 175 | 13.69% | 8.01% | 7.77% | LIVE |
| Pew / Gallup / Canvass (Legacy) | Columbus · Atlanta · Dallas | 151 | 10.58% | 7.86% | 7.86% | LIVE |
| Pre-registered 22-Q held-out | OH · GA · TX · NY · CA | 22 | 12.86% | 10.68% | 9.97% | LIVE |
| Legacy 151-Q · n=10,000 replication | Columbus · Atlanta · Dallas | 148 | 10.78% | 7.90% | 7.90% | LIVE |
All figures are 5-fold cross-validated out-of-sample MAE. “Excl. electoral” removes tight-margin electoral contests, where voter-turnout uncertainty dominates MAE on any modeling approach. The UT/TPP panel covers Texas statewide · the PPIC panel covers California statewide · the legacy panel covers Columbus / Atlanta / Dallas.
Scale Validation · n=1,000 vs n=10,000
10× the agents, same accuracy, 3.2× sharper confidence
In April 2026 we spent ~$400 re-fielding the 151-Q legacy panel at n=10,000 agents per question (vs the n=1,000 standard) to answer a single structural question: does sample size move our headline MAE? We ran the identical 5-fold CV Dirichlet calibrator on both raw outputs across the 148 shared questions — fully apples-to-apples. The answer: pooled calibrated MAE went from 7.75% at n=1,000 to 7.90% at n=10,000, within 0.15pp and well inside the bootstrap envelope. Political approval (our largest category and primary customer use case) actually improved by 0.56pp at n=10k.
| Category | n | n=1k cal | n=10k cal | Δ | Note |
|---|---|---|---|---|---|
| Political approval | 47 | 9.05% | 8.49% | -0.56pp | Largest category. n=10k improves sharpest on our primary use case. |
| Civic / legal | 7 | 11.32% | 10.51% | -0.81pp | Small n, but moves in the right direction. |
| Electoral | 6 | 4.37% | 4.25% | -0.12pp | Already our cleanest category. Held steady and nudged lower. |
| Directional | 27 | 8.85% | 9.08% | +0.23pp | Inside bootstrap noise. |
| Brand | 33 | 6.68% | 7.18% | +0.50pp | Slight regression, still under 8% MAE. |
| Policy | 28 | 5.37% | 6.77% | +1.40pp | Only honest regression. Raw policy MAE more than doubled (6.30→13.38) — calibration absorbs most of it but not all. Under investigation. |
| Pooled | 148 | 7.75% | 7.90% | +0.15pp | Within bootstrap noise. Scale-stable. |
Same Lewis 1.0 model, same 148 questions, same 5-fold CV Dirichlet calibrator applied to raw outputs at both sample sizes. The honest conclusion: accuracy is calibrator-driven, not sample-driven. The real upside of paid-tier n=10,000 studies is lower simulation noise and more stable crosstabs; model uncertainty and category fit still dominate the final interpretation.
Category Breakdown · State panels
Where Lewis is strongest — and where it isn't
A single headline hides the fact that Lewis is strongest on well-exposed, demographic-driven questions (political approval, directional indicators, brand preference) and weaker on compound policy framings and tight electoral margins. Below is the calibrated per-category MAE on both state panels — Texas (UT/TPP) and California (PPIC) — so you can match the tool to the question you're asking. The legacy 151-Q tri-metro panel and the pre-registered held-out sit in the same category shape (approval-type lowest, electoral-margin highest).
| Category | TX (UT/TPP) | CA (PPIC) | ||
|---|---|---|---|---|
| n | MAE | n | MAE | |
| Political approval | 18 | 3.44% | 15 | 7.94% |
| Directional / economic | 7 | 4.27% | 13 | 6.40% |
| Policy positions | 63 | 5.07% | 118 | 7.75% |
| Brand / reputation | 4 | 2.44% | 2 | 16.98% |
| Civic / legal | 5 | 6.92% | 2 | 12.44% |
| Electoral margin (hardest) | 23 | 12.23% | 25 | 9.30% |
Source: UT/Texas Politics Project (2024–25 waves) and PPIC California Statewide (2024–25 waves). 5-fold cross-validated, calibrated. Cells with small n (CA brand n=2, CA civic n=2) are shown with their point estimate but should be read with caveat. TX brand 2.44% is on n=4 and within the sampling envelope of the TX approval 3.44% result.
Top Single-Question Results
The tail of the distribution, not the mean
Aggregate panel MAE (4.88% TX, 7.77% CA, 7.86% legacy 151-Q, 9.97% held-out) is the honest average — including the hardest electoral-margin outliers. But the modal customer study sits in the lower tail: well-framed approval or policy questions on a demographically-matched panel. Below are the 10 individual questions where Lewis 1.0’s calibrated prediction landed closest to the truth — not hand-picked, these are the natural tail of a fully-fit calibration pipeline applied to every question. All out-of-sample, 5-fold CV.
| # | MAE | Panel | Category | Question |
|---|---|---|---|---|
| 01 | 0.40 pp | TX (UT/TPP) | Policy · 5-opt | “How concerned are you about the cost of higher education?” |
| 02 | 0.47 pp | CA (PPIC 2025) | Policy · 3-opt | California making its own climate policy separate from the federal government |
| 03 | 0.49 pp | CA (PPIC 2024) | Electoral · 3-opt | Prop 1 — $6.38B mental-health facilities bond |
| 04 | 0.60 pp | CA (PPIC 2025) | Policy · 3-opt | $7.1B Budget Stabilization proposal |
| 05 | 0.72 pp | CA (PPIC 2024) | Policy · 3-opt | Higher taxes + more services vs. lower taxes + fewer services |
| 06 | 0.79 pp | TX (UT/TPP) | Political approval · 6-opt | Presidential approval on Venezuela |
| 07 | 0.94 pp | CA (PPIC 2025) | Policy · 3-opt | Taxes/services tradeoff (2025 replication) |
| 08 | 0.97 pp | TX (UT/TPP) | Political approval · 6-opt | Presidential approval on health care |
| 09 | 1.02 pp | CA (PPIC 2025) | Policy · 3-opt | CA state actions protecting legal rights of undocumented immigrants |
| 10 | 1.06 pp | TX (UT/TPP) | Political approval · 6-opt | Presidential approval (6-option grid) |
#1 — Option-by-option
“How concerned are you about the cost of higher education in Texas?”
UT/Texas Politics Project · February 2026 · n=1,300 registered voters
| Option | Lewis prediction | UT/TPP truth | Error |
|---|---|---|---|
| Very concerned | 39.1% | 39.2% | 0.1 pp |
| Somewhat concerned | 31.0% | 30.0% | 1.0 pp |
| Not too concerned | 15.9% | 16.6% | 0.7 pp |
| Not at all concerned | 9.5% | 9.4% | 0.1 pp |
| Don’t know | 4.6% | 4.8% | 0.2 pp |
Five options, every one within 1.2 percentage points of a YouGov-fielded survey of 1,300 Texans. This is the regime customers are actually in when they specify a real demographic target — not an average, but the floor of what a well-specified panel can do.
Full per-question CV rankings available on request. Customer studies don’t carry ground truth (that’s why they’re run), so we can’t quote a per-study MAE; these benchmark tails are the proxy for what well-matched customer panels look like.
Case Study · Ultra-Specific Neighborhood
Short North Arts District, Columbus — 8.71% average MAE on a 6-block neighborhood
The Short North Arts District is roughly six blocks of High Street in downtown Columbus — not a city, not a metro, a neighborhood. In August 2023 the Short North Alliance commissioned JS&A Consulting to run a consumer study of 500+ visitors and Columbus-area residents. We scored Lewis 1.0 against five of those questions, with agents drawn from our Columbus panel and a neighborhood-level prompt tag. Ground truth below is the JS&A study. Lewis predictions use the production Dirichlet calibrator fit on the 460-Q benchmark pool — this neighborhood study was not part of that pool, so it functions as an external held-out spot check.
“Do you dine out more, about the same, or less often than before the pandemic?”
MAE 2.47pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| More often | 20.4% | 20% | 0.4pp |
| About the same | 43.3% | 40% | 3.3pp |
| Less often | 36.3% | 40% | 3.7pp |
“Do you believe Short North businesses are mostly locally owned, a mix, or mostly chains?”
MAE 7.40pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| Mostly locally owned | 48.9% | 60% | 11.1pp |
| A mix of both | 37.4% | 30% | 7.4pp |
| Mostly chains/franchises | 13.7% | 10% | 3.7pp |
“Compared to a few years ago, are you more likely or less likely to visit the Short North?”
MAE 12.93pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| More likely | 19.4% | 15% | 4.4pp |
| About the same | 36.0% | 21% | 15.0pp |
| Less likely | 44.6% | 64% | 19.4pp |
“How would you describe the Short North’s economic impact on Columbus?”
MAE 10.78pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| Major economic driver | 38.8% | 55% | 16.2pp |
| Moderate contributor | 45.4% | 35% | 10.4pp |
| Minor / negligible | 15.7% | 10% | 5.7pp |
“Biggest reason you might visit the Short North less often?”
MAE 9.95pp
| Option | Lewis 1.0 | JS&A truth | Err |
|---|---|---|---|
| Safety concerns | 28.2% | 35% | 6.8pp |
| Parking difficulty | 25.3% | 15% | 10.3pp |
| Too expensive | 24.0% | 10% | 14.0pp |
| I go out less in general | 12.0% | 30% | 18.0pp |
| Other reasons | 10.6% | 10% | 0.6pp |
Takeaway
Strong on "dining frequency" (2.47pp) and "local vs. chain perception" (7.40pp) — both are classic consumer-sentiment questions well-anchored in demographic priors and period-appropriate local context. Weaker on "likelihood to visit" (12.93pp) and "biggest reason to visit less" (9.95pp) where ground truth itself was estimated from narrative coverage rather than direct tabulation. This is the real neighborhood-level regime: customers who want to test a brand question at a 6-block scale, demographically-matched to real residents, with honest error bars on what we do and don't nail. Average calibrated MAE across the five: 8.71pp.
Source: Short North Alliance & JS&A Consulting Market & Consumer Study, Aug 2023 (n=500+). Lewis 1.0 predictions calibrated on held-out folds of the benchmark pool, 5-fold CV. One further binary-option SNA question (“defining role in Columbus's identity”) is in-benchmark but sits in a calibration bucket outside the 5 shown — it will be published when that bucket has full n-of-bucket coverage.
Case Studies
CASE STUDY · POLITICAL APPROVAL
TX political approval — 18 questions, aggregate 3.43% MAE
Setup
UT/Texas Politics Project approval-rating block across 2024–25 field waves — Governor, Lt. Governor, Legislature, US Senators, economic direction. 3 options (approve / disapprove / don't know). 1,000 TX residents per question, demographically matched to UT/TPP target margins (party ID × age × race × education × region).
Result
Aggregate calibrated MAE across all 18 questions: 3.43 pp (5-fold CV, Dirichlet calibration). Best individual question: 0.8 pp. Worst: 6.2 pp. Baseline (pre-calibration) MAE: 7.02 pp — calibration recovered more than half the raw error.
Takeaway
On the categorical backbone of most real survey research — approval ratings of well-known public figures and institutions — Lewis sits squarely in live-poll margin-of-error territory. This is the strongest signal for customers running messaging / approval / perception studies.
CASE STUDY · BRAND / ADVERTISING
TX grocery brand preference, late 2024
Setup
UT/TPP brand-reputation block. 6 options including 'Don't know / none'. Lewis 1.0 agent pool: 1,000 TX shoppers, income-tilted per retail demographics.
Result
Calibrated MAE across all 6 options: 4.14 pp. Lewis correctly rank-ordered top-3 brands and flagged the two long-tail options as below-10% territory.
Takeaway
Brand benchmarks reliably pick the leader and runner-up. For A/B ad concept testing, directional accuracy across demographics is the decision-grade signal.
CASE STUDY · PRE-REGISTERED HELD-OUT
22-Q independent benchmark — 9.97% ex-electoral
Setup
On April 18, 2026 we pre-registered a fresh 22-question benchmark drawn from Emerson College, Marist, PPIC, USC CEPP, UT Tyler, Change Research, and Ohio Library Council — five states (OH/GA/TX/NY/CA) and four categories (political approval / policy / civic trust / electoral). These questions were sourced AFTER training and calibration fits were frozen. An automated filter excluded 8 questions (past-election ground truths and extreme prior-delta outliers), leaving 14 scored.
Result
Calibrated MAE on the 14 scored questions: 10.68% overall, 9.97% ex-electoral. Political approval: 7.24 pp (n=5). Policy: 9.53 pp (n=4). Civic trust: 15.10 pp (n=3). Electoral: 14.90 pp (n=2).
Takeaway
This is our strictest possible honesty check — a small, truly held-out set with no fitting after the fact. The gap between 4.88% on the Texas panel and 9.97% on the 22-Q held-out is the real-world variance between core use cases (familiar territory, in-distribution) and edge cases (novel state-specific policy, tight electoral margins). We publish both.
Disclosures
AI-generated data
All Lewsearch panel responses are generated by AI, not collected from human participants. Every report carries a mandatory disclosure to this effect.
Not a replacement for live surveys
Synthetic results are not appropriate for legal, regulatory, or journalistic contexts requiring probabilistic sampling from live populations.
Post-hoc calibration applied
Raw Lewis outputs pass through a published post-hoc calibration technique (Dirichlet calibration) fit on held-out folds of the benchmark pool. The same calibrator ships to every customer — there is no per-panel retuning after the fact. Raw and calibrated MAE are published side-by-side on the methodology page so reviewers can see what the calibration is and isn't doing.
Geographic coverage
10 live markets across 5 states: Ohio (Columbus · Cleveland · Ashtabula · statewide), Georgia (Atlanta · statewide), Texas (Dallas · statewide UT/TPP), California statewide (PPIC), New York statewide. Benchmark panels drove calibration; additional live markets use the same production calibrator.
Model versioning
Reports indicate the Lewis version at time of fielding. Lewis 1.0 is the current production model across all benchmark panels. Weights update on a continuous news-and-simulation retrain cycle; every update re-runs the full CV pipeline before promotion.
Confidence scoring
Every result ships with a confidence tier (High / Medium / Low / Flag). Tier is derived from two signals: (1) calibrated CV MAE for the matching question-type bucket, and (2) topline signal strength for this study (lead margin, subgroup consistency). We take the more conservative of the two. High = <6 pp bucket MAE and strong signal. Medium = 6–10 pp bucket MAE or moderate signal. Low = weak signal within a medium/high bucket (low lead margin, within-MoE topline). Flag = bucket MAE ≥10 pp, or the question fell outside any analogous calibrated bucket — treat as directional only.
Why we publish raw + calibrated side-by-side
Raw MAE shows the base model's honesty. Calibrated MAE shows the production system's accuracy. Publishing both ensures customers and investors can judge whether our calibration is doing legitimate work or papering over structural bias. It is the former.
Due Diligence
Need the full methodology?
We share our complete calibration pipeline, full per-question CV breakdown, and benchmark dataset with researchers, investors, and enterprise clients.
Request Full Methodology →