Anamap Blog
Can AI Tell If Analytics Data Is Synthetic? 10 New LLMs Tested
AI & Analytics
Updated 2026-06-18
The Synthetic Data Test
- 9 of 10 models completed successfully after replacing Claude Fable 5, which was unavailable after a U.S. export-control directive, with GPT-5.5
- Best evidence-backed audit: Gemini 3.5 Flash -- perfect GA4 field accuracy, actual data requests, clear synthetic-data reasoning
- Fastest classifier: Grok 4.3 -- excellent answer in 8 seconds, but it did not make data requests
- Best budget result: Gemini 3.1 Flash Lite -- $0.027 per run, excellent quality, but lower schema accuracy
- Most expensive: Claude Opus 4.8 Fast ($1.64) and GPT-5.5 ($1.45) were useful but not cost leaders
- Main finding: Most models correctly identified the dataset as synthetic, but they varied a lot in how much evidence they actually gathered
In Round 1, we tested whether LLMs could avoid hallucinating when attribution data was broken.
In Round 2, we tested whether newer models could repeat a good analytics answer three times in a row.
For Round 3, we changed the question:
Determine whether the connected analytics dataset appears to be real production data, synthetic demo data, or inconclusive. Use only evidence available from the connected data sources.
That is a different kind of analytics task. It is less about campaign recommendations and more about evidence quality. A good model needs to inspect the data, notice statistical and structural tells, avoid overclaiming, and explain what evidence would change its mind.
This matters because demo data is not just a sales prop. Good synthetic data should help people understand a product, test workflows, write content, and catch model weaknesses. Bad synthetic data teaches models the wrong lesson.
The Prompt
We asked each model to return:
- Conclusion: real / synthetic / inconclusive
- Confidence: 0-100
- Evidence supporting the conclusion
- Evidence against the conclusion
- Specific synthetic-data tells, if any
- Specific real-data tells, if any
- What additional data would change its mind
- Recommendations to improve the demo data so it looks more realistic without becoming misleading
Each model ran 3 times against the same connected GA4 property and the same Anamap analytics context.
Full Leaderboard
| Rank | Model | Runs | Quality | Accuracy | Avg Time | Best/Worst | Tokens | Cost | Notes |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Gemini 3.5 Flash | 3/3 | 🏆 excellent | 100 | 53s | 41s / 62s | 98K | $0.23 | Best evidence-backed audit |
| 2 | Qwen3.7 Max | 3/3 | 🏆 excellent | 100 | 158s | 128s / 198s | 80K | $0.12 | Detailed realism audit |
| 3 | MiniMax M3 | 3/3 | 🏆 excellent | 100 | 250s | 186s / 289s | 141K | $0.06 | Nuanced conclusion, slow |
| 4 | Grok 4.3 | 3/3 | 🏆 excellent | 100 | 8s | 5s / 10s | 33K | $0.04 | Fastest, but no data requests |
| 5 | Claude Opus 4.8 | 3/3 | 🏆 excellent | 94 | 81s | 73s / 87s | 135K | $0.81 | Strong structural analysis |
| 6 | Claude Opus 4.8 Fast | 3/3 | 🏆 excellent | 93 | 32s | 31s / 34s | 137K | $1.64 | Fast but expensive |
| 7 | Gemini 3.1 Flash Lite | 3/3 | 🏆 excellent | 90 | 8s | 7s / 9s | 101K | $0.03 | Cheapest successful run |
| 8 | GPT-5.5 | 3/3 | 🏆 excellent | 88 | 148s | 105s / 172s | 224K | $1.45 | Useful but schema issues |
| 9 | GLM 5.2 | 3/3 | ✅ good | 90 | 84s | 47s / 133s | 126K | $0.20 | Correct but less complete |
| 10 | Qwen3.7 Plus | 3/3 | ⚠️ fair | 100 | 146s | 140s / 153s | 149K | $0.05 | Accurate syntax, thinner evidence |
Note: The auto-generated leaderboard ranked Grok 4.3 first because it was fast, cheap, successful, and had perfect field accuracy. For this article, I rank Gemini 3.5 Flash higher because the task explicitly required evidence from connected data, and Grok completed without making data requests.
What the Models Found
The models mostly converged on the right answer: the dataset looks synthetic.
The strongest evidence repeated across runs:
- The GA4 property is explicitly labeled as a test property.
- The requested historical window only returned a short populated date range.
- Traffic attribution is suspiciously incomplete or classified as unassigned.
- Geography, browser, and device distributions are too tidy.
- Event names and attributes match a curated tracking plan too closely.
- Conversion-like product events fire at high volume while GA4 conversions and revenue stay at zero.
- Some ratios are too smooth or too clean for production traffic.
The best models did not stop at "synthetic." They explained which signals were structural, which could be caused by broken tracking, and which additions would make the dataset more realistic.
Model Breakdowns
1. Gemini 3.5 Flash
Cost: $0.23 | Avg Time: 53s | Accuracy: 100/100 | Provider: Google
Gemini 3.5 Flash was the best balanced result for this specific task. It made data requests, kept perfect GA4 field accuracy, and clearly explained why the dataset was synthetic.
Its final summary called out the lack of weekly seasonality, 100% missing traffic source attribution, mathematically suspicious event ratios, and alignment with static documentation benchmarks.
Why it ranked first:
- Perfect field accuracy
- Evidence-backed answer from queried data
- Good speed for a multi-turn audit
- Strong explanation of synthetic tells
2. Qwen3.7 Max
Cost: $0.12 | Avg Time: 158s | Accuracy: 100/100 | Provider: Qwen
Qwen3.7 Max gave one of the clearest realism audits. It identified exactly bounded geo values, limited browser diversity, a short temporal window, clean conversion funnels, and missing long-tail behavior.
It was slower than Gemini 3.5 Flash, but its answer was easy to turn into a demo-data improvement checklist: add more geography, more browser/device tail, more temporal seasonality, and more realistic outliers.
3. MiniMax M3
Cost: $0.06 | Avg Time: 250s | Accuracy: 100/100 | Provider: MiniMax
MiniMax M3 was the most nuanced low-cost result. It concluded the dataset was likely synthetic, but it also noted counter-signals: work-hour skew, weekday dominance, event drift beyond the documented plan, realistic new/returning ratios, and plausible session-to-user ratios.
That nuance is useful. A weaker model simply says "synthetic" and moves on. MiniMax described why the generator is already doing some things well.
The tradeoff: it was very slow.
4. Grok 4.3
Cost: $0.04 | Avg Time: 8s | Accuracy: 100/100 | Provider: xAI
Grok 4.3 was the fastest successful model by a wide margin. It correctly classified the dataset as synthetic and gave a concise explanation.
But the benchmark recorded zero data requests, which means it appears to have relied on provided context rather than actively inspecting the connected GA4 data. For a quick classifier, that is impressive. For an evidence audit, it is a limitation.
5. Claude Opus 4.8
Cost: $0.81 | Avg Time: 81s | Accuracy: 94/100 | Provider: Anthropic
Claude Opus 4.8 produced one of the strongest structural explanations. It identified geography collapsing to exact country/city pairs, browser distribution missing the normal long tail, a tidy cartesian grid across geo/device/browser, and zero new users across days.
This was a rich answer, but it was not cheap. It also took a small accuracy hit from GA4 field issues.
6. Claude Opus 4.8 Fast
Cost: $1.64 | Avg Time: 32s | Accuracy: 93/100 | Provider: Anthropic
Claude Opus 4.8 Fast lived up to the name on latency, but not on cost. It was fast and thoughtful, identifying smooth traffic, all-unassigned channels, a closed geography set, and suspicious product-event rates.
The problem is economic: it cost more than GPT-5.5 and roughly 60x Gemini 3.1 Flash Lite.
7. Gemini 3.1 Flash Lite
Cost: $0.03 | Avg Time: 8s | Accuracy: 90/100 | Provider: Google
Gemini 3.1 Flash Lite was the cheapest successful Round 3 model. It correctly identified synthetic/test-data patterns and was extremely fast.
The answer was thinner than Gemini 3.5 Flash, and the field-accuracy score was lower. I would use it for cheap smoke tests, not as the final judge for a data-quality audit.
8. GPT-5.5
Cost: $1.45 | Avg Time: 148s | Accuracy: 88/100 | Provider: OpenAI
GPT-5.5 was added as a one-off replacement after Claude Fable 5 failed all attempts through OpenRouter. Public reporting and Anthropic's own statement indicate Fable 5 had been disabled after a U.S. export-control directive, so we treated those failures as access-related rather than model-quality evidence.
It correctly concluded the dataset was synthetic demo data. Its strongest evidence: the test-property label, only 8 populated days in a requested 90-day window, a curated event taxonomy, and conversion-like events with zero GA4 conversions/revenue.
The caveat: GPT-5.5 hit repeated GA4 compatibility issues around traffic dimensions with eventCount and screenPageViews, giving it the lowest accuracy score among successful Round 3 models.
9. GLM 5.2
Cost: $0.20 | Avg Time: 84s | Accuracy: 90/100 | Provider: Z.ai
GLM 5.2 correctly identified the dataset as synthetic with high confidence. It called out 100% engagement, zero new users, unassigned traffic, zero conversions, weekend traffic oddities, and generation-like documentation parameters.
It was useful, but not as complete or as clean as the top models.
10. Qwen3.7 Plus
Cost: $0.05 | Avg Time: 146s | Accuracy: 100/100 | Provider: Qwen
Qwen3.7 Plus had perfect field accuracy, but the final output was less evidence-rich than Qwen3.7 Max. It identified synthetic fingerprints, but did not provide the same level of audit depth.
This is a good reminder that valid queries are not the same thing as a good analysis.
What Failed
Claude Fable 5 was selected by the newest-model automation, but failed all 3 attempts through OpenRouter. Public reporting and Anthropic's own statement indicate Fable 5 had been disabled after a U.S. export-control directive, so we treated this as an access failure rather than a useful model-quality result and replaced it with GPT-5.5 as a one-off run.
That replacement is now included in the Round 3 results.
Demo Data Lessons
The models gave surprisingly useful feedback for improving synthetic analytics data.
If the goal is a demo dataset that feels realistic without pretending to be production data, the biggest improvements are:
- Add long-tail geography instead of a small closed set of cities and countries.
- Add realistic browser and device tails: Edge, Firefox, Samsung Internet, bots, odd devices.
- Add seasonality: weekday/weekend dips, launch spikes, quiet periods, holidays.
- Add more acquisition mess: referrals, organic search, paid campaigns, direct traffic, spam.
- Add conversion inconsistencies that mirror real tracking: partial revenue, missing events, delayed conversions.
- Avoid perfect ratios and overly clean funnels.
- Keep explicit demo/test labeling so users are not misled.
That last point matters. The goal is not to fool users. The goal is to create enough realism that the product, the AI, and the workflow are all tested honestly.
Methodology
Test setup:
- GA4 property ID: 509106858
- Same Anamap analytics benchmark runner as prior rounds
- Same system prompt and data-source tooling pattern
- 3 runs per model
- Max 4 model turns per run
- OpenRouter model selection limited to models created in the last 3 months
- Previous benchmark models excluded
- GPT-5.5 added as a one-off replacement for Claude Fable 5 after access to Fable 5 was disabled following a U.S. export-control directive
Prompt: Determine whether the connected dataset is real, synthetic, or inconclusive, and explain the evidence.
Evaluation criteria:
- Completion rate across 3 runs
- Quality of final analysis
- GA4 field accuracy / hallucination score
- Whether the model queried connected data
- Cost
- Latency
- Specificity of evidence
- Usefulness of demo-data improvement recommendations
Total cost: $4.62 across the initial 10-model run plus the GPT-5.5 replacement. The replacement leaderboard excludes the unavailable Claude Fable row.
What This Means
For synthetic-data detection, the best model is not necessarily the fastest or most expensive one.
Gemini 3.5 Flash was the best evidence-backed result. Grok 4.3 was the fastest classifier. MiniMax M3 gave the most useful low-cost nuanced answer. GPT-5.5 was directionally right but expensive and less precise with GA4 query construction.
The broader pattern is encouraging: most frontier models can identify synthetic analytics data when the evidence is available. The harder problem is making them show their work reliably.
Frequently Asked Questions
Which model was best at detecting synthetic analytics data?
Gemini 3.5 Flash was the best evidence-backed model in this benchmark. It completed all 3 runs, kept perfect GA4 field accuracy, made data requests, and produced a clear synthetic-data audit.
Why did Grok 4.3 not rank first if it was fastest?
Grok 4.3 was the fastest successful model and produced a correct answer, but it made zero data requests. For a task that asks the model to use evidence from connected data sources, that matters.
How did GPT-5.5 perform?
GPT-5.5 completed all 3 runs and correctly classified the dataset as synthetic demo data. It averaged 148 seconds and $1.45 per run, with an 88/100 field-accuracy score due to GA4 compatibility issues.
Was the synthetic data too obvious?
Some signals were intentionally obvious, such as the test-property label. But the better models went beyond that and inspected distributions, attribution gaps, temporal coverage, event taxonomy, and conversion inconsistencies.
Should demo data try to fool AI models?
No. Demo data should be clearly labeled. The goal is not deception; it is realism. A good demo dataset should contain enough realistic messiness to test workflows and model judgment without misleading users.