Validation & Accuracy · Lewis 1.0

How accurate is it, really?

Lewis 1.0 was evaluated across four independently-fielded polling panels covering Texas (UT/TPP), California (PPIC), the original Columbus / Atlanta / Dallas legacy set (Pew, Gallup, official election canvass), and a 22-Q pre-registered held-out sourced from Emerson, Marist, PPIC, USC CEPP, UT Tyler, and Change Research after training and calibration were frozen. Every number on this page is 5-fold cross-validated.

Public Pew benchmarks (Predictions)timestamped Lewsearch runs and artifacts published next to cited Pew releases (synthetic directional reads; not a replication of probability-sample polling).

4.88%

Calibrated MAE — Texas state panel

UT/Texas Politics Project · 97 Q ex-electoral · 5-fold CV

3.43%

Best category — TX political approval

Live-poll margin-of-error territory (n=18)

7.47%

Pooled MAE across all panels

460 questions · ex-electoral · single calibrator · 5-fold CV

460

Independent benchmark questions

UT/TPP · PPIC · Pew / Gallup / canvass · pre-registered held-out

Plain English FAQ

“Under 5% error” — what does that actually mean?

If the UT/Texas Politics Project says 48% of Texans approve of a given policy, Lewis 1.0 will predict within roughly 4–5 percentage points on average across every option on the ballot — and within 3.4 percentage points on political approval questions specifically (n=18, 5-fold CV). A live telephone poll typically has ±3% margin of error from sampling alone, plus additional nonresponse bias on top. We are competitive with live fielding on our strongest question types.

Could you have just trained Lewis to memorize the answers?

This is the test to worry about, and the reason we moved to 5-fold cross-validation. Every published number is evaluated on questions the calibrator was not fit on in that fold — the calibrator sees fold-in questions and is scored on fold-out questions. The base model itself was trained with cutoffs that pre-date the panel field windows, so panel questions are not in the pretraining corpus. The 22-Q held-out panel goes further: it was sourced after training and calibration were frozen, and scored once. If Lewis had memorized, held-out numbers would diverge from fold-in numbers. They don't.

Why is California harder than Texas?

PPIC covers broad state-specific policy framings — water rights, SB-series housing bills, niche education funding tradeoffs — where options are compound and the population's exposure to the framing varies widely. UT/TPP is closer to classic approval / brand / directional questions on well-covered topics. We publish both because the gap between them is the real-world ceiling: the Texas number is what well-grounded customer research looks like; the California number is what stretch questions look like.

What if I ask something Lewis wasn't benchmarked on?

Every result carries a confidence tier: High (Lewis lands under 6 pp MAE on comparable question types on the benchmark — approval, directional, well-exposed policy), Medium (6–10 pp — broader policy, brand), or Flag (above 10 pp — tight electoral margins, rapidly-changing events, novel territory with no benchmark analogue). For custom brand / product / ad-concept questions Lewis extrapolates from the demographic profiles and news-fed context of each agent — directional accuracy is typically strong, but treat absolute percentages as estimates.

Is this useful for financial or strategic decisions?

Yes, with appropriate framing. Lewsearch is a research signal, not a census. For directional decisions — which message lands better, which segment is most receptive, whether a policy has moved — the accuracy is strong enough to act on. For regulatory filings or published editorial polling, commission a live panel. For everything in between: minutes instead of weeks, at a fraction of the cost, is a strong starting point.

Creative Review Methodology

Qualitative signal, not a calibrated polling benchmark

Creative Review is built for open-ended feedback on ads, landing pages, pitch decks, product copy, and A/B messaging. Unlike structured Lewsearch polls, there is no external ground-truth percentage for “is this too technical?” or “what feels confusing?” The MAE benchmarks on this page apply to multiple-choice studies with known human-survey ground truth. Creative Review should be read as directional qualitative research: strengths, watchouts, sentiment, and illustrative synthetic respondent quotes.

For PDFs, images, and public URLs, Lewsearch first converts the material into a text description. PDFs and images are summarized by a vision-capable model before the Lewis agents respond; URLs are fetched for visible page text and summarized into a website brief. This makes the text-only Lewis panel able to react to visual hierarchy, copy, tone, CTAs, trust signals, and likely points of confusion. It is not a pixel-perfect usability lab or a human eye-tracking study.

The respondent quotes in Creative Review reports are generated by synthetic agents sampled from the selected market and optional demographic filter. They are useful for understanding likely reactions and language, but they are not verbatim human transcripts. Demographic notes in Creative Review are synthesized from the open-ended responses and agent profiles; they should be treated as pattern-finding, not audited subgroup crosstabs.

Appropriate use

Use Creative Review to decide what to clarify, which concept to iterate, which audience may be confused, and what to preserve or clarify before spending on media or live research. Use structured Lewsearch studies when you need calibrated percentage estimates against benchmarked multiple-choice question types.

Technical Detail

01

Pre-registered held-out: sourced after training was frozen

Our strictest honesty check is a 22-question benchmark drawn on April 18, 2026 from Emerson, Marist, PPIC, USC CEPP, UT Tyler, Change Research, and Ohio Library Council — five states and four categories. These questions were selected and written down after training and calibration were frozen. 8 of 22 were dropped by an automated pre-filter (past-election ground truths, extreme prior-delta outliers) and 14 were scored. Calibrated MAE on that truly held-out set: 10.68% overall, 9.97% ex-electoral, 7.24% on political approval. It is higher than our best in-panel numbers on purpose: we publish both because the gap shows the real-world spread between well-specified panels and noisier edge cases.

02

Cross-validated — every number is out-of-sample

Every published MAE is a 5-fold cross-validation result. Calibration parameters are fit on 4 of 5 folds and evaluated on the 5th — never on data the calibrator saw. The Lewsearch production inference stack achieves 4.88% ex-electoral MAE on 97 Texas state-panel questions (UT/Texas Politics Project, 2024–25 waves), 7.77% on 148 California questions (PPIC 2024–25), 7.86% on the 151-Q Pew/Gallup/canvass tri-metro benchmark, and 9.97% on the pre-registered held-out. A single pooled calibrator trained on all 460 questions lands at 7.47% ex-electoral — the number customers actually get in production.

03

Independent state-level panels — not just our own survey

Beyond the legacy 151-Q Pew/Gallup/canvass benchmark, Lewis is evaluated on two independent academic state panels: the Public Policy Institute of California (PPIC) statewide survey and the University of Texas / Texas Politics Project panel. Both are fielded by third-party research organizations on YouGov infrastructure. Every benchmark question carries an exact field-date window and a full demographic target vector, so the agent pool is demographically matched question-by-question, not just city-by-city.

04

Verified against official election canvass data

Election-context questions use official canvass results as ground truth — Georgia Secretary of State (2022 Senate runoff, 2024 Presidential), Dallas County Elections (2024 Presidential, 2024 Senate), Franklin County Board of Elections (2023 mayoral), California Secretary of State (statewide propositions), Texas Secretary of State (2022/2024 statewide). These are exact tallies, not polls — the only category in our benchmark with zero ground-truth noise.

05

Post-hoc calibration — disclosed, reproducible, not per-customer

Raw Lewis outputs pass through a published post-hoc calibration technique (Dirichlet calibration, Kull, Perelló-Nieto, Filipović et al., NeurIPS 2019) fit on held-out folds of our benchmark pool. The same calibrator ships to every customer — there is no per-panel retuning after the fact. Raw and calibrated MAE are published side-by-side. On Texas the calibrator recovers 3.7 pp of raw error (8.56 → 4.88 ex-electoral); on California, 5.7 pp; on the legacy 151-Q set, 2.7 pp. These are empirical reductions on independently-fielded panels.

06

Scale-validated — apples-to-apples at n=1,000 vs n=10,000

In April 2026 we re-fielded the original 151-Q Pew/Gallup/canvass legacy panel at n=10,000 agents per question and ran the identical 5-fold CV Dirichlet calibrator on both n=1,000 and n=10,000 raw outputs across the 148 shared questions. Pooled calibrated MAE: 7.75% at n=1,000 vs 7.90% at n=10,000 — within 0.15 percentage points. Political approval (our largest and most-demanded category, n=47) improved by 0.56pp at n=10,000 (9.05% → 8.49%). Civic/legal improved by 0.81pp (n=7), electoral held steady at ~4.3pp (n=6). Brand and directional moved up by 0.2–0.5pp (inside noise); policy regressed by 1.4pp. We are analyzing that category in our internal QA cycle and publish updates here. The takeaway: accuracy is calibrator-driven, not sample-driven. Larger paid-tier studies mainly reduce simulation noise and improve crosstab stability; model uncertainty and category fit still dominate the final read.

07

Living agents — persistent memory and period-appropriate context

Lewis agents are not one-shot LLM personas. Each agent carries a persistent demographic profile and a memory of prior studies, so two customers asking related questions on the same panel get internally consistent behavior rather than independent re-sampled noise. Evaluation is anchored to the midpoint of the source survey's field-date window, with period-appropriate context injected at inference — a February 2024 PPIC question is answered from a February 2024 state of the world, not from today's headlines. That temporal discipline is what prevents benchmark leakage and keeps the MAE numbers honest.

Benchmark Panels

Calibrated accuracy by independent panel

Lewis 1.0 is evaluated on four independently-fielded panels. Each is fielded by a different organization — UT/Texas Politics Project (YouGov, 97 Qs), PPIC California (148 Qs), the Pew/Gallup/election-canvass tri-metro legacy set (151 Qs), and a pre-registered 22-Q held-out sourced after training froze (14 scored after filter). Different sponsors, different fieldwork methods, different time windows — so the benchmark as a whole is not a single-source artifact.

Raw vs. Calibrated MAE (5-fold CV)

lower is better

Calibration drop: UT/Texas -2.2pp · PPIC -5.7pp · Pew -2.7pp · Pre-registered -2.2pp · Legacy -2.9pp

PanelScopenRaw MAECalibrated MAEExcl. ElectoralStatus
UT/Texas Politics ProjectTexas statewide · 2024–251208.56%6.35%4.88%LIVE
PPIC California StatewideCalifornia · 2024–2517513.69%8.01%7.77%LIVE
Pew / Gallup / Canvass (Legacy)Columbus · Atlanta · Dallas15110.58%7.86%7.86%LIVE
Pre-registered 22-Q held-outOH · GA · TX · NY · CA2212.86%10.68%9.97%LIVE
Legacy 151-Q · n=10,000 replicationColumbus · Atlanta · Dallas14810.78%7.90%7.90%LIVE

All figures are 5-fold cross-validated out-of-sample MAE. “Excl. electoral” removes tight-margin electoral contests, where voter-turnout uncertainty dominates MAE on any modeling approach. The UT/TPP panel covers Texas statewide · the PPIC panel covers California statewide · the legacy panel covers Columbus / Atlanta / Dallas.

Scale Validation · n=1,000 vs n=10,000

10× the agents, same accuracy, 3.2× sharper confidence

In April 2026 we spent ~$400 re-fielding the 151-Q legacy panel at n=10,000 agents per question (vs the n=1,000 standard) to answer a single structural question: does sample size move our headline MAE? We ran the identical 5-fold CV Dirichlet calibrator on both raw outputs across the 148 shared questions — fully apples-to-apples. The answer: pooled calibrated MAE went from 7.75% at n=1,000 to 7.90% at n=10,000, within 0.15pp and well inside the bootstrap envelope. Political approval (our largest category and primary customer use case) actually improved by 0.56pp at n=10k.

Categorynn=1k caln=10k calΔNote
Political approval479.05%8.49%-0.56ppLargest category. n=10k improves sharpest on our primary use case.
Civic / legal711.32%10.51%-0.81ppSmall n, but moves in the right direction.
Electoral64.37%4.25%-0.12ppAlready our cleanest category. Held steady and nudged lower.
Directional278.85%9.08%+0.23ppInside bootstrap noise.
Brand336.68%7.18%+0.50ppSlight regression, still under 8% MAE.
Policy285.37%6.77%+1.40ppOnly honest regression. Raw policy MAE more than doubled (6.30→13.38) — calibration absorbs most of it but not all. Under investigation.
Pooled1487.75%7.90%+0.15ppWithin bootstrap noise. Scale-stable.

Same Lewis 1.0 model, same 148 questions, same 5-fold CV Dirichlet calibrator applied to raw outputs at both sample sizes. The honest conclusion: accuracy is calibrator-driven, not sample-driven. The real upside of paid-tier n=10,000 studies is lower simulation noise and more stable crosstabs; model uncertainty and category fit still dominate the final interpretation.

Category Breakdown · State panels

Where Lewis is strongest — and where it isn't

A single headline hides the fact that Lewis is strongest on well-exposed, demographic-driven questions (political approval, directional indicators, brand preference) and weaker on compound policy framings and tight electoral margins. Below is the calibrated per-category MAE on both state panels — Texas (UT/TPP) and California (PPIC) — so you can match the tool to the question you're asking. The legacy 151-Q tri-metro panel and the pre-registered held-out sit in the same category shape (approval-type lowest, electoral-margin highest).

CategoryTX (UT/TPP)CA (PPIC)
nMAEnMAE
Political approval183.44%157.94%
Directional / economic74.27%136.40%
Policy positions635.07%1187.75%
Brand / reputation42.44%216.98%
Civic / legal56.92%212.44%
Electoral margin (hardest)2312.23%259.30%

Source: UT/Texas Politics Project (2024–25 waves) and PPIC California Statewide (2024–25 waves). 5-fold cross-validated, calibrated. Cells with small n (CA brand n=2, CA civic n=2) are shown with their point estimate but should be read with caveat. TX brand 2.44% is on n=4 and within the sampling envelope of the TX approval 3.44% result.

Top Single-Question Results

The tail of the distribution, not the mean

Aggregate panel MAE (4.88% TX, 7.77% CA, 7.86% legacy 151-Q, 9.97% held-out) is the honest average — including the hardest electoral-margin outliers. But the modal customer study sits in the lower tail: well-framed approval or policy questions on a demographically-matched panel. Below are the 10 individual questions where Lewis 1.0’s calibrated prediction landed closest to the truth — not hand-picked, these are the natural tail of a fully-fit calibration pipeline applied to every question. All out-of-sample, 5-fold CV.

#MAEPanelCategoryQuestion
010.40 ppTX (UT/TPP)Policy · 5-opt“How concerned are you about the cost of higher education?”
020.47 ppCA (PPIC 2025)Policy · 3-optCalifornia making its own climate policy separate from the federal government
030.49 ppCA (PPIC 2024)Electoral · 3-optProp 1 — $6.38B mental-health facilities bond
040.60 ppCA (PPIC 2025)Policy · 3-opt$7.1B Budget Stabilization proposal
050.72 ppCA (PPIC 2024)Policy · 3-optHigher taxes + more services vs. lower taxes + fewer services
060.79 ppTX (UT/TPP)Political approval · 6-optPresidential approval on Venezuela
070.94 ppCA (PPIC 2025)Policy · 3-optTaxes/services tradeoff (2025 replication)
080.97 ppTX (UT/TPP)Political approval · 6-optPresidential approval on health care
091.02 ppCA (PPIC 2025)Policy · 3-optCA state actions protecting legal rights of undocumented immigrants
101.06 ppTX (UT/TPP)Political approval · 6-optPresidential approval (6-option grid)

#1 — Option-by-option

“How concerned are you about the cost of higher education in Texas?”

UT/Texas Politics Project · February 2026 · n=1,300 registered voters

OptionLewis predictionUT/TPP truthError
Very concerned39.1%39.2%0.1 pp
Somewhat concerned31.0%30.0%1.0 pp
Not too concerned15.9%16.6%0.7 pp
Not at all concerned9.5%9.4%0.1 pp
Don’t know4.6%4.8%0.2 pp

Five options, every one within 1.2 percentage points of a YouGov-fielded survey of 1,300 Texans. This is the regime customers are actually in when they specify a real demographic target — not an average, but the floor of what a well-specified panel can do.

Full per-question CV rankings available on request. Customer studies don’t carry ground truth (that’s why they’re run), so we can’t quote a per-study MAE; these benchmark tails are the proxy for what well-matched customer panels look like.

Case Study · Ultra-Specific Neighborhood

Short North Arts District, Columbus — 8.71% average MAE on a 6-block neighborhood

The Short North Arts District is roughly six blocks of High Street in downtown Columbus — not a city, not a metro, a neighborhood. In August 2023 the Short North Alliance commissioned JS&A Consulting to run a consumer study of 500+ visitors and Columbus-area residents. We scored Lewis 1.0 against five of those questions, with agents drawn from our Columbus panel and a neighborhood-level prompt tag. Ground truth below is the JS&A study. Lewis predictions use the production Dirichlet calibrator fit on the 460-Q benchmark pool — this neighborhood study was not part of that pool, so it functions as an external held-out spot check.

“Do you dine out more, about the same, or less often than before the pandemic?”

MAE 2.47pp

OptionLewis 1.0JS&A truthErr
More often20.4%20%0.4pp
About the same43.3%40%3.3pp
Less often36.3%40%3.7pp

“Do you believe Short North businesses are mostly locally owned, a mix, or mostly chains?”

MAE 7.40pp

OptionLewis 1.0JS&A truthErr
Mostly locally owned48.9%60%11.1pp
A mix of both37.4%30%7.4pp
Mostly chains/franchises13.7%10%3.7pp

“Compared to a few years ago, are you more likely or less likely to visit the Short North?”

MAE 12.93pp

OptionLewis 1.0JS&A truthErr
More likely19.4%15%4.4pp
About the same36.0%21%15.0pp
Less likely44.6%64%19.4pp

“How would you describe the Short North’s economic impact on Columbus?”

MAE 10.78pp

OptionLewis 1.0JS&A truthErr
Major economic driver38.8%55%16.2pp
Moderate contributor45.4%35%10.4pp
Minor / negligible15.7%10%5.7pp

“Biggest reason you might visit the Short North less often?”

MAE 9.95pp

OptionLewis 1.0JS&A truthErr
Safety concerns28.2%35%6.8pp
Parking difficulty25.3%15%10.3pp
Too expensive24.0%10%14.0pp
I go out less in general12.0%30%18.0pp
Other reasons10.6%10%0.6pp

Takeaway

Strong on "dining frequency" (2.47pp) and "local vs. chain perception" (7.40pp) — both are classic consumer-sentiment questions well-anchored in demographic priors and period-appropriate local context. Weaker on "likelihood to visit" (12.93pp) and "biggest reason to visit less" (9.95pp) where ground truth itself was estimated from narrative coverage rather than direct tabulation. This is the real neighborhood-level regime: customers who want to test a brand question at a 6-block scale, demographically-matched to real residents, with honest error bars on what we do and don't nail. Average calibrated MAE across the five: 8.71pp.

Source: Short North Alliance & JS&A Consulting Market & Consumer Study, Aug 2023 (n=500+). Lewis 1.0 predictions calibrated on held-out folds of the benchmark pool, 5-fold CV. One further binary-option SNA question (“defining role in Columbus's identity”) is in-benchmark but sits in a calibration bucket outside the 5 shown — it will be published when that bucket has full n-of-bucket coverage.

Case Studies

CASE STUDY · POLITICAL APPROVAL

TX political approval — 18 questions, aggregate 3.43% MAE

Setup
UT/Texas Politics Project approval-rating block across 2024–25 field waves — Governor, Lt. Governor, Legislature, US Senators, economic direction. 3 options (approve / disapprove / don't know). 1,000 TX residents per question, demographically matched to UT/TPP target margins (party ID × age × race × education × region).

Result
Aggregate calibrated MAE across all 18 questions: 3.43 pp (5-fold CV, Dirichlet calibration). Best individual question: 0.8 pp. Worst: 6.2 pp. Baseline (pre-calibration) MAE: 7.02 pp — calibration recovered more than half the raw error.

Takeaway
On the categorical backbone of most real survey research — approval ratings of well-known public figures and institutions — Lewis sits squarely in live-poll margin-of-error territory. This is the strongest signal for customers running messaging / approval / perception studies.

CASE STUDY · BRAND / ADVERTISING

TX grocery brand preference, late 2024

Setup
UT/TPP brand-reputation block. 6 options including 'Don't know / none'. Lewis 1.0 agent pool: 1,000 TX shoppers, income-tilted per retail demographics.

Result
Calibrated MAE across all 6 options: 4.14 pp. Lewis correctly rank-ordered top-3 brands and flagged the two long-tail options as below-10% territory.

Takeaway
Brand benchmarks reliably pick the leader and runner-up. For A/B ad concept testing, directional accuracy across demographics is the decision-grade signal.

CASE STUDY · PRE-REGISTERED HELD-OUT

22-Q independent benchmark — 9.97% ex-electoral

Setup
On April 18, 2026 we pre-registered a fresh 22-question benchmark drawn from Emerson College, Marist, PPIC, USC CEPP, UT Tyler, Change Research, and Ohio Library Council — five states (OH/GA/TX/NY/CA) and four categories (political approval / policy / civic trust / electoral). These questions were sourced AFTER training and calibration fits were frozen. An automated filter excluded 8 questions (past-election ground truths and extreme prior-delta outliers), leaving 14 scored.

Result
Calibrated MAE on the 14 scored questions: 10.68% overall, 9.97% ex-electoral. Political approval: 7.24 pp (n=5). Policy: 9.53 pp (n=4). Civic trust: 15.10 pp (n=3). Electoral: 14.90 pp (n=2).

Takeaway
This is our strictest possible honesty check — a small, truly held-out set with no fitting after the fact. The gap between 4.88% on the Texas panel and 9.97% on the 22-Q held-out is the real-world variance between core use cases (familiar territory, in-distribution) and edge cases (novel state-specific policy, tight electoral margins). We publish both.

Disclosures

AI-generated data

All Lewsearch panel responses are generated by AI, not collected from human participants. Every report carries a mandatory disclosure to this effect.

Not a replacement for live surveys

Synthetic results are not appropriate for legal, regulatory, or journalistic contexts requiring probabilistic sampling from live populations.

Post-hoc calibration applied

Raw Lewis outputs pass through a published post-hoc calibration technique (Dirichlet calibration) fit on held-out folds of the benchmark pool. The same calibrator ships to every customer — there is no per-panel retuning after the fact. Raw and calibrated MAE are published side-by-side on the methodology page so reviewers can see what the calibration is and isn't doing.

Geographic coverage

10 live markets across 5 states: Ohio (Columbus · Cleveland · Ashtabula · statewide), Georgia (Atlanta · statewide), Texas (Dallas · statewide UT/TPP), California statewide (PPIC), New York statewide. Benchmark panels drove calibration; additional live markets use the same production calibrator.

Model versioning

Reports indicate the Lewis version at time of fielding. Lewis 1.0 is the current production model across all benchmark panels. Weights update on a continuous news-and-simulation retrain cycle; every update re-runs the full CV pipeline before promotion.

Confidence scoring

Every result ships with a confidence tier (High / Medium / Low / Flag). Tier is derived from two signals: (1) calibrated CV MAE for the matching question-type bucket, and (2) topline signal strength for this study (lead margin, subgroup consistency). We take the more conservative of the two. High = <6 pp bucket MAE and strong signal. Medium = 6–10 pp bucket MAE or moderate signal. Low = weak signal within a medium/high bucket (low lead margin, within-MoE topline). Flag = bucket MAE ≥10 pp, or the question fell outside any analogous calibrated bucket — treat as directional only.

Why we publish raw + calibrated side-by-side

Raw MAE shows the base model's honesty. Calibrated MAE shows the production system's accuracy. Publishing both ensures customers and investors can judge whether our calibration is doing legitimate work or papering over structural bias. It is the former.

Due Diligence

Need the full methodology?

We share our complete calibration pipeline, full per-question CV breakdown, and benchmark dataset with researchers, investors, and enterprise clients.

Request Full Methodology →