Can AI Tell If Analytics Data Is Synthetic? 10 New LLMs Tested

AI & Analytics

Updated 2026-06-18

Alex Schlee

Founder & CEO

The Synthetic Data Test

Key Takeaways

9 of 10 models completed successfully after replacing Claude Fable 5, which was unavailable after a U.S. export-control directive, with GPT-5.5
Best evidence-backed audit: Gemini 3.5 Flash -- perfect GA4 field accuracy, actual data requests, clear synthetic-data reasoning
Fastest classifier: Grok 4.3 -- excellent answer in 8 seconds, but it did not make data requests
Best budget result: Gemini 3.1 Flash Lite -- $0.027 per run, excellent quality, but lower schema accuracy
Most expensive: Claude Opus 4.8 Fast ($1.64) and GPT-5.5 ($1.45) were useful but not cost leaders
Main finding: Most models correctly identified the dataset as synthetic, but they varied a lot in how much evidence they actually gathered

In Round 1, we tested whether LLMs could avoid hallucinating when attribution data was broken.

In Round 2, we tested whether newer models could repeat a good analytics answer three times in a row.

For Round 3, we changed the question:

Determine whether the connected analytics dataset appears to be real production data, synthetic demo data, or inconclusive. Use only evidence available from the connected data sources.

That is a different kind of analytics task. It is less about campaign recommendations and more about evidence quality. A good model needs to inspect the data, notice statistical and structural tells, avoid overclaiming, and explain what evidence would change its mind.

This matters because demo data is not just a sales prop. Good synthetic data should help people understand a product, test workflows, write content, and catch model weaknesses. Bad synthetic data teaches models the wrong lesson.

The Prompt

We asked each model to return:

Conclusion: real / synthetic / inconclusive
Confidence: 0-100
Evidence supporting the conclusion
Evidence against the conclusion
Specific synthetic-data tells, if any
Specific real-data tells, if any
What additional data would change its mind
Recommendations to improve the demo data so it looks more realistic without becoming misleading

Each model ran 3 times against the same connected GA4 property and the same Anamap analytics context.

Full Leaderboard

Rank	Model	Runs	Quality	Accuracy	Avg Time	Best/Worst	Tokens	Cost	Notes
1	Gemini 3.5 Flash	3/3	🏆 excellent	100	53s	41s / 62s	98K	$0.23	Best evidence-backed audit
2	Qwen3.7 Max	3/3	🏆 excellent	100	158s	128s / 198s	80K	$0.12	Detailed realism audit
3	MiniMax M3	3/3	🏆 excellent	100	250s	186s / 289s	141K	$0.06	Nuanced conclusion, slow
4	Grok 4.3	3/3	🏆 excellent	100	8s	5s / 10s	33K	$0.04	Fastest, but no data requests
5	Claude Opus 4.8	3/3	🏆 excellent	94	81s	73s / 87s	135K	$0.81	Strong structural analysis
6	Claude Opus 4.8 Fast	3/3	🏆 excellent	93	32s	31s / 34s	137K	$1.64	Fast but expensive
7	Gemini 3.1 Flash Lite	3/3	🏆 excellent	90	8s	7s / 9s	101K	$0.03	Cheapest successful run
8	GPT-5.5	3/3	🏆 excellent	88	148s	105s / 172s	224K	$1.45	Useful but schema issues
9	GLM 5.2	3/3	✅ good	90	84s	47s / 133s	126K	$0.20	Correct but less complete
10	Qwen3.7 Plus	3/3	⚠️ fair	100	146s	140s / 153s	149K	$0.05	Accurate syntax, thinner evidence

Note: The auto-generated leaderboard ranked Grok 4.3 first because it was fast, cheap, successful, and had perfect field accuracy. For this article, I rank Gemini 3.5 Flash higher because the task explicitly required evidence from connected data, and Grok completed without making data requests.

Want AI analytics with evidence?

Anamap is built around data-source queries, citations, and model behavior testing -- not just fluent answers.

Try Anamap Free View Leaderboard

What the Models Found

The models mostly converged on the right answer: the dataset looks synthetic.

The strongest evidence repeated across runs:

The GA4 property is explicitly labeled as a test property.
The requested historical window only returned a short populated date range.
Traffic attribution is suspiciously incomplete or classified as unassigned.
Geography, browser, and device distributions are too tidy.
Event names and attributes match a curated tracking plan too closely.
Conversion-like product events fire at high volume while GA4 conversions and revenue stay at zero.
Some ratios are too smooth or too clean for production traffic.

The best models did not stop at "synthetic." They explained which signals were structural, which could be caused by broken tracking, and which additions would make the dataset more realistic.

Model Breakdowns

1. Gemini 3.5 Flash

Cost: $0.23 | Avg Time: 53s | Accuracy: 100/100 | Provider: Google

Gemini 3.5 Flash was the best balanced result for this specific task. It made data requests, kept perfect GA4 field accuracy, and clearly explained why the dataset was synthetic.

Its final summary called out the lack of weekly seasonality, 100% missing traffic source attribution, mathematically suspicious event ratios, and alignment with static documentation benchmarks.

Why it ranked first:

Perfect field accuracy
Evidence-backed answer from queried data
Good speed for a multi-turn audit
Strong explanation of synthetic tells

2. Qwen3.7 Max

Cost: $0.12 | Avg Time: 158s | Accuracy: 100/100 | Provider: Qwen

Qwen3.7 Max gave one of the clearest realism audits. It identified exactly bounded geo values, limited browser diversity, a short temporal window, clean conversion funnels, and missing long-tail behavior.

It was slower than Gemini 3.5 Flash, but its answer was easy to turn into a demo-data improvement checklist: add more geography, more browser/device tail, more temporal seasonality, and more realistic outliers.

3. MiniMax M3

Cost: $0.06 | Avg Time: 250s | Accuracy: 100/100 | Provider: MiniMax

MiniMax M3 was the most nuanced low-cost result. It concluded the dataset was likely synthetic, but it also noted counter-signals: work-hour skew, weekday dominance, event drift beyond the documented plan, realistic new/returning ratios, and plausible session-to-user ratios.

That nuance is useful. A weaker model simply says "synthetic" and moves on. MiniMax described why the generator is already doing some things well.

The tradeoff: it was very slow.

4. Grok 4.3

Cost: $0.04 | Avg Time: 8s | Accuracy: 100/100 | Provider: xAI

Grok 4.3 was the fastest successful model by a wide margin. It correctly classified the dataset as synthetic and gave a concise explanation.

But the benchmark recorded zero data requests, which means it appears to have relied on provided context rather than actively inspecting the connected GA4 data. For a quick classifier, that is impressive. For an evidence audit, it is a limitation.

5. Claude Opus 4.8

Cost: $0.81 | Avg Time: 81s | Accuracy: 94/100 | Provider: Anthropic

Claude Opus 4.8 produced one of the strongest structural explanations. It identified geography collapsing to exact country/city pairs, browser distribution missing the normal long tail, a tidy cartesian grid across geo/device/browser, and zero new users across days.

This was a rich answer, but it was not cheap. It also took a small accuracy hit from GA4 field issues.

6. Claude Opus 4.8 Fast

Cost: $1.64 | Avg Time: 32s | Accuracy: 93/100 | Provider: Anthropic

Claude Opus 4.8 Fast lived up to the name on latency, but not on cost. It was fast and thoughtful, identifying smooth traffic, all-unassigned channels, a closed geography set, and suspicious product-event rates.

The problem is economic: it cost more than GPT-5.5 and roughly 60x Gemini 3.1 Flash Lite.

7. Gemini 3.1 Flash Lite

Cost: $0.03 | Avg Time: 8s | Accuracy: 90/100 | Provider: Google

Gemini 3.1 Flash Lite was the cheapest successful Round 3 model. It correctly identified synthetic/test-data patterns and was extremely fast.

The answer was thinner than Gemini 3.5 Flash, and the field-accuracy score was lower. I would use it for cheap smoke tests, not as the final judge for a data-quality audit.

8. GPT-5.5

Cost: $1.45 | Avg Time: 148s | Accuracy: 88/100 | Provider: OpenAI

GPT-5.5 was added as a one-off replacement after Claude Fable 5 failed all attempts through OpenRouter. Public reporting and Anthropic's own statement indicate Fable 5 had been disabled after a U.S. export-control directive, so we treated those failures as access-related rather than model-quality evidence.

It correctly concluded the dataset was synthetic demo data. Its strongest evidence: the test-property label, only 8 populated days in a requested 90-day window, a curated event taxonomy, and conversion-like events with zero GA4 conversions/revenue.

The caveat: GPT-5.5 hit repeated GA4 compatibility issues around traffic dimensions with eventCount and screenPageViews, giving it the lowest accuracy score among successful Round 3 models.

9. GLM 5.2

Cost: $0.20 | Avg Time: 84s | Accuracy: 90/100 | Provider: Z.ai

GLM 5.2 correctly identified the dataset as synthetic with high confidence. It called out 100% engagement, zero new users, unassigned traffic, zero conversions, weekend traffic oddities, and generation-like documentation parameters.

It was useful, but not as complete or as clean as the top models.

10. Qwen3.7 Plus

Cost: $0.05 | Avg Time: 146s | Accuracy: 100/100 | Provider: Qwen

Qwen3.7 Plus had perfect field accuracy, but the final output was less evidence-rich than Qwen3.7 Max. It identified synthetic fingerprints, but did not provide the same level of audit depth.

This is a good reminder that valid queries are not the same thing as a good analysis.

What Failed

Claude Fable 5 was selected by the newest-model automation, but failed all 3 attempts through OpenRouter. Public reporting and Anthropic's own statement indicate Fable 5 had been disabled after a U.S. export-control directive, so we treated this as an access failure rather than a useful model-quality result and replaced it with GPT-5.5 as a one-off run.

That replacement is now included in the Round 3 results.

Demo Data Lessons

The models gave surprisingly useful feedback for improving synthetic analytics data.

If the goal is a demo dataset that feels realistic without pretending to be production data, the biggest improvements are:

Add long-tail geography instead of a small closed set of cities and countries.
Add realistic browser and device tails: Edge, Firefox, Samsung Internet, bots, odd devices.
Add seasonality: weekday/weekend dips, launch spikes, quiet periods, holidays.
Add more acquisition mess: referrals, organic search, paid campaigns, direct traffic, spam.
Add conversion inconsistencies that mirror real tracking: partial revenue, missing events, delayed conversions.
Avoid perfect ratios and overly clean funnels.
Keep explicit demo/test labeling so users are not misled.

That last point matters. The goal is not to fool users. The goal is to create enough realism that the product, the AI, and the workflow are all tested honestly.

Methodology

Test setup:

GA4 property ID: 509106858
Same Anamap analytics benchmark runner as prior rounds
Same system prompt and data-source tooling pattern
3 runs per model
Max 4 model turns per run
OpenRouter model selection limited to models created in the last 3 months
Previous benchmark models excluded
GPT-5.5 added as a one-off replacement for Claude Fable 5 after access to Fable 5 was disabled following a U.S. export-control directive

Prompt: Determine whether the connected dataset is real, synthetic, or inconclusive, and explain the evidence.

Evaluation criteria:

Completion rate across 3 runs
Quality of final analysis
GA4 field accuracy / hallucination score
Whether the model queried connected data
Cost
Latency
Specificity of evidence
Usefulness of demo-data improvement recommendations

Total cost: $4.62 across the initial 10-model run plus the GPT-5.5 replacement. The replacement leaderboard excludes the unavailable Claude Fable row.

What This Means

For synthetic-data detection, the best model is not necessarily the fastest or most expensive one.

Gemini 3.5 Flash was the best evidence-backed result. Grok 4.3 was the fastest classifier. MiniMax M3 gave the most useful low-cost nuanced answer. GPT-5.5 was directionally right but expensive and less precise with GA4 query construction.

The broader pattern is encouraging: most frontier models can identify synthetic analytics data when the evidence is available. The harder problem is making them show their work reliably.

Compare all benchmark rounds

See every model from Round 1, Round 2, and Round 3 in the combined leaderboard.

View Full Leaderboard Try Anamap Free

Frequently Asked Questions

Which model was best at detecting synthetic analytics data?

Gemini 3.5 Flash was the best evidence-backed model in this benchmark. It completed all 3 runs, kept perfect GA4 field accuracy, made data requests, and produced a clear synthetic-data audit.

Why did Grok 4.3 not rank first if it was fastest?

Grok 4.3 was the fastest successful model and produced a correct answer, but it made zero data requests. For a task that asks the model to use evidence from connected data sources, that matters.

How did GPT-5.5 perform?

GPT-5.5 completed all 3 runs and correctly classified the dataset as synthetic demo data. It averaged 148 seconds and $1.45 per run, with an 88/100 field-accuracy score due to GA4 compatibility issues.

Was the synthetic data too obvious?

Some signals were intentionally obvious, such as the test-property label. But the better models went beyond that and inspected distributions, attribution gaps, temporal coverage, event taxonomy, and conversion inconsistencies.

Should demo data try to fool AI models?

No. Demo data should be clearly labeled. The goal is not deception; it is realism. A good demo dataset should contain enough realistic messiness to test workflows and model judgment without misleading users.

ABOUT THE AUTHOR

Alex Schlee

Founder & CEO

Alex Schlee is the founder of Anamap and has experience spanning the full gamut of analytics from implementation engineering to warehousing and insight generation. He's a great person to connect with about anything related to analytics or technology.

Anamap Blog

Can AI Tell If Analytics Data Is Synthetic? 10 New LLMs Tested

Want to stay up to date with our latest blog posts?