Anamap Blog

Can AI Tell If Analytics Data Is Synthetic? 10 New LLMs Tested

AI & Analytics

Updated 2026-06-18

Alex Schlee

Founder & CEO

The Synthetic Data Test

Key Takeaways
  • 9 of 10 models completed successfully after replacing Claude Fable 5, which was unavailable after a U.S. export-control directive, with GPT-5.5
  • Best evidence-backed audit: Gemini 3.5 Flash -- perfect GA4 field accuracy, actual data requests, clear synthetic-data reasoning
  • Fastest classifier: Grok 4.3 -- excellent answer in 8 seconds, but it did not make data requests
  • Best budget result: Gemini 3.1 Flash Lite -- $0.027 per run, excellent quality, but lower schema accuracy
  • Most expensive: Claude Opus 4.8 Fast ($1.64) and GPT-5.5 ($1.45) were useful but not cost leaders
  • Main finding: Most models correctly identified the dataset as synthetic, but they varied a lot in how much evidence they actually gathered

In Round 1, we tested whether LLMs could avoid hallucinating when attribution data was broken.

In Round 2, we tested whether newer models could repeat a good analytics answer three times in a row.

For Round 3, we changed the question:

Determine whether the connected analytics dataset appears to be real production data, synthetic demo data, or inconclusive. Use only evidence available from the connected data sources.

That is a different kind of analytics task. It is less about campaign recommendations and more about evidence quality. A good model needs to inspect the data, notice statistical and structural tells, avoid overclaiming, and explain what evidence would change its mind.

This matters because demo data is not just a sales prop. Good synthetic data should help people understand a product, test workflows, write content, and catch model weaknesses. Bad synthetic data teaches models the wrong lesson.

The Prompt

We asked each model to return:

  1. Conclusion: real / synthetic / inconclusive
  2. Confidence: 0-100
  3. Evidence supporting the conclusion
  4. Evidence against the conclusion
  5. Specific synthetic-data tells, if any
  6. Specific real-data tells, if any
  7. What additional data would change its mind
  8. Recommendations to improve the demo data so it looks more realistic without becoming misleading

Each model ran 3 times against the same connected GA4 property and the same Anamap analytics context.

Full Leaderboard

RankModelRunsQualityAccuracyAvg TimeBest/WorstTokensCostNotes
1Gemini 3.5 Flash3/3🏆 excellent10053s41s / 62s98K$0.23Best evidence-backed audit
2Qwen3.7 Max3/3🏆 excellent100158s128s / 198s80K$0.12Detailed realism audit
3MiniMax M33/3🏆 excellent100250s186s / 289s141K$0.06Nuanced conclusion, slow
4Grok 4.33/3🏆 excellent1008s5s / 10s33K$0.04Fastest, but no data requests
5Claude Opus 4.83/3🏆 excellent9481s73s / 87s135K$0.81Strong structural analysis
6Claude Opus 4.8 Fast3/3🏆 excellent9332s31s / 34s137K$1.64Fast but expensive
7Gemini 3.1 Flash Lite3/3🏆 excellent908s7s / 9s101K$0.03Cheapest successful run
8GPT-5.53/3🏆 excellent88148s105s / 172s224K$1.45Useful but schema issues
9GLM 5.23/3✅ good9084s47s / 133s126K$0.20Correct but less complete
10Qwen3.7 Plus3/3⚠️ fair100146s140s / 153s149K$0.05Accurate syntax, thinner evidence

Note: The auto-generated leaderboard ranked Grok 4.3 first because it was fast, cheap, successful, and had perfect field accuracy. For this article, I rank Gemini 3.5 Flash higher because the task explicitly required evidence from connected data, and Grok completed without making data requests.

Want AI analytics with evidence?
Anamap is built around data-source queries, citations, and model behavior testing -- not just fluent answers.

What the Models Found

The models mostly converged on the right answer: the dataset looks synthetic.

The strongest evidence repeated across runs:

  • The GA4 property is explicitly labeled as a test property.
  • The requested historical window only returned a short populated date range.
  • Traffic attribution is suspiciously incomplete or classified as unassigned.
  • Geography, browser, and device distributions are too tidy.
  • Event names and attributes match a curated tracking plan too closely.
  • Conversion-like product events fire at high volume while GA4 conversions and revenue stay at zero.
  • Some ratios are too smooth or too clean for production traffic.

The best models did not stop at "synthetic." They explained which signals were structural, which could be caused by broken tracking, and which additions would make the dataset more realistic.

Model Breakdowns

1. Gemini 3.5 Flash

Cost: $0.23 | Avg Time: 53s | Accuracy: 100/100 | Provider: Google

Gemini 3.5 Flash was the best balanced result for this specific task. It made data requests, kept perfect GA4 field accuracy, and clearly explained why the dataset was synthetic.

Its final summary called out the lack of weekly seasonality, 100% missing traffic source attribution, mathematically suspicious event ratios, and alignment with static documentation benchmarks.

Why it ranked first:

  • Perfect field accuracy
  • Evidence-backed answer from queried data
  • Good speed for a multi-turn audit
  • Strong explanation of synthetic tells

2. Qwen3.7 Max

Cost: $0.12 | Avg Time: 158s | Accuracy: 100/100 | Provider: Qwen

Qwen3.7 Max gave one of the clearest realism audits. It identified exactly bounded geo values, limited browser diversity, a short temporal window, clean conversion funnels, and missing long-tail behavior.

It was slower than Gemini 3.5 Flash, but its answer was easy to turn into a demo-data improvement checklist: add more geography, more browser/device tail, more temporal seasonality, and more realistic outliers.

3. MiniMax M3

Cost: $0.06 | Avg Time: 250s | Accuracy: 100/100 | Provider: MiniMax

MiniMax M3 was the most nuanced low-cost result. It concluded the dataset was likely synthetic, but it also noted counter-signals: work-hour skew, weekday dominance, event drift beyond the documented plan, realistic new/returning ratios, and plausible session-to-user ratios.

That nuance is useful. A weaker model simply says "synthetic" and moves on. MiniMax described why the generator is already doing some things well.

The tradeoff: it was very slow.

4. Grok 4.3

Cost: $0.04 | Avg Time: 8s | Accuracy: 100/100 | Provider: xAI

Grok 4.3 was the fastest successful model by a wide margin. It correctly classified the dataset as synthetic and gave a concise explanation.

But the benchmark recorded zero data requests, which means it appears to have relied on provided context rather than actively inspecting the connected GA4 data. For a quick classifier, that is impressive. For an evidence audit, it is a limitation.

5. Claude Opus 4.8

Cost: $0.81 | Avg Time: 81s | Accuracy: 94/100 | Provider: Anthropic

Claude Opus 4.8 produced one of the strongest structural explanations. It identified geography collapsing to exact country/city pairs, browser distribution missing the normal long tail, a tidy cartesian grid across geo/device/browser, and zero new users across days.

This was a rich answer, but it was not cheap. It also took a small accuracy hit from GA4 field issues.

6. Claude Opus 4.8 Fast

Cost: $1.64 | Avg Time: 32s | Accuracy: 93/100 | Provider: Anthropic

Claude Opus 4.8 Fast lived up to the name on latency, but not on cost. It was fast and thoughtful, identifying smooth traffic, all-unassigned channels, a closed geography set, and suspicious product-event rates.

The problem is economic: it cost more than GPT-5.5 and roughly 60x Gemini 3.1 Flash Lite.

7. Gemini 3.1 Flash Lite

Cost: $0.03 | Avg Time: 8s | Accuracy: 90/100 | Provider: Google

Gemini 3.1 Flash Lite was the cheapest successful Round 3 model. It correctly identified synthetic/test-data patterns and was extremely fast.

The answer was thinner than Gemini 3.5 Flash, and the field-accuracy score was lower. I would use it for cheap smoke tests, not as the final judge for a data-quality audit.

8. GPT-5.5

Cost: $1.45 | Avg Time: 148s | Accuracy: 88/100 | Provider: OpenAI

GPT-5.5 was added as a one-off replacement after Claude Fable 5 failed all attempts through OpenRouter. Public reporting and Anthropic's own statement indicate Fable 5 had been disabled after a U.S. export-control directive, so we treated those failures as access-related rather than model-quality evidence.

It correctly concluded the dataset was synthetic demo data. Its strongest evidence: the test-property label, only 8 populated days in a requested 90-day window, a curated event taxonomy, and conversion-like events with zero GA4 conversions/revenue.

The caveat: GPT-5.5 hit repeated GA4 compatibility issues around traffic dimensions with eventCount and screenPageViews, giving it the lowest accuracy score among successful Round 3 models.

9. GLM 5.2

Cost: $0.20 | Avg Time: 84s | Accuracy: 90/100 | Provider: Z.ai

GLM 5.2 correctly identified the dataset as synthetic with high confidence. It called out 100% engagement, zero new users, unassigned traffic, zero conversions, weekend traffic oddities, and generation-like documentation parameters.

It was useful, but not as complete or as clean as the top models.

10. Qwen3.7 Plus

Cost: $0.05 | Avg Time: 146s | Accuracy: 100/100 | Provider: Qwen

Qwen3.7 Plus had perfect field accuracy, but the final output was less evidence-rich than Qwen3.7 Max. It identified synthetic fingerprints, but did not provide the same level of audit depth.

This is a good reminder that valid queries are not the same thing as a good analysis.

What Failed

Claude Fable 5 was selected by the newest-model automation, but failed all 3 attempts through OpenRouter. Public reporting and Anthropic's own statement indicate Fable 5 had been disabled after a U.S. export-control directive, so we treated this as an access failure rather than a useful model-quality result and replaced it with GPT-5.5 as a one-off run.

That replacement is now included in the Round 3 results.

Demo Data Lessons

The models gave surprisingly useful feedback for improving synthetic analytics data.

If the goal is a demo dataset that feels realistic without pretending to be production data, the biggest improvements are:

  • Add long-tail geography instead of a small closed set of cities and countries.
  • Add realistic browser and device tails: Edge, Firefox, Samsung Internet, bots, odd devices.
  • Add seasonality: weekday/weekend dips, launch spikes, quiet periods, holidays.
  • Add more acquisition mess: referrals, organic search, paid campaigns, direct traffic, spam.
  • Add conversion inconsistencies that mirror real tracking: partial revenue, missing events, delayed conversions.
  • Avoid perfect ratios and overly clean funnels.
  • Keep explicit demo/test labeling so users are not misled.

That last point matters. The goal is not to fool users. The goal is to create enough realism that the product, the AI, and the workflow are all tested honestly.

Methodology

Test setup:

  • GA4 property ID: 509106858
  • Same Anamap analytics benchmark runner as prior rounds
  • Same system prompt and data-source tooling pattern
  • 3 runs per model
  • Max 4 model turns per run
  • OpenRouter model selection limited to models created in the last 3 months
  • Previous benchmark models excluded
  • GPT-5.5 added as a one-off replacement for Claude Fable 5 after access to Fable 5 was disabled following a U.S. export-control directive

Prompt: Determine whether the connected dataset is real, synthetic, or inconclusive, and explain the evidence.

Evaluation criteria:

  • Completion rate across 3 runs
  • Quality of final analysis
  • GA4 field accuracy / hallucination score
  • Whether the model queried connected data
  • Cost
  • Latency
  • Specificity of evidence
  • Usefulness of demo-data improvement recommendations

Total cost: $4.62 across the initial 10-model run plus the GPT-5.5 replacement. The replacement leaderboard excludes the unavailable Claude Fable row.

What This Means

For synthetic-data detection, the best model is not necessarily the fastest or most expensive one.

Gemini 3.5 Flash was the best evidence-backed result. Grok 4.3 was the fastest classifier. MiniMax M3 gave the most useful low-cost nuanced answer. GPT-5.5 was directionally right but expensive and less precise with GA4 query construction.

The broader pattern is encouraging: most frontier models can identify synthetic analytics data when the evidence is available. The harder problem is making them show their work reliably.

Compare all benchmark rounds
See every model from Round 1, Round 2, and Round 3 in the combined leaderboard.

Frequently Asked Questions

Which model was best at detecting synthetic analytics data?

Gemini 3.5 Flash was the best evidence-backed model in this benchmark. It completed all 3 runs, kept perfect GA4 field accuracy, made data requests, and produced a clear synthetic-data audit.

Why did Grok 4.3 not rank first if it was fastest?

Grok 4.3 was the fastest successful model and produced a correct answer, but it made zero data requests. For a task that asks the model to use evidence from connected data sources, that matters.

How did GPT-5.5 perform?

GPT-5.5 completed all 3 runs and correctly classified the dataset as synthetic demo data. It averaged 148 seconds and $1.45 per run, with an 88/100 field-accuracy score due to GA4 compatibility issues.

Was the synthetic data too obvious?

Some signals were intentionally obvious, such as the test-property label. But the better models went beyond that and inspected distributions, attribution gaps, temporal coverage, event taxonomy, and conversion inconsistencies.

Should demo data try to fool AI models?

No. Demo data should be clearly labeled. The goal is not deception; it is realism. A good demo dataset should contain enough realistic messiness to test workflows and model judgment without misleading users.

Want to stay up to date with our latest blog posts?

Sign up for our email list to receive updates on new blog posts and product releases.

ABOUT THE AUTHOR

Alex Schlee

Founder & CEO

Alex Schlee is the founder of Anamap and has experience spanning the full gamut of analytics from implementation engineering to warehousing and insight generation. He's a great person to connect with about anything related to analytics or technology.