Anamap Blog

I Benchmarked 10 LLMs on Broken Analytics Data. Only 30% Delivered Value.

AI & Analytics

1/28/2026

Alex Schlee

Founder & CEO

The Setup: A Test Most Models Would Fail

Key Takeaways
  • All 10 models achieved perfect API syntax and valid GA4 queries
  • Only 30% provided actionable insights when data was broken
  • Best overall: Claude Opus 4.5 ($1.30) found workarounds and extracted real value
  • Best value: Grok 4.1 Fast ($0.03) delivered solid analysis at 1/40th the cost
  • Dangerous: 30% hallucinated or gave misleading recommendations

I benchmarked 10 leading LLMs on a deceptively simple analytics question using the Anamap AI library. But the data had a critical flaw: 100% of traffic showed as "(not set)" with zero conversion attribution.

This is a scenario every analytics professional dreads, and one that happens more often than you'd think. The question I asked: "Which traffic sources and landing pages are driving our highest-value users, and where should we double down our marketing investment?"

Every model got the technical part right: perfect API syntax, valid GA4 field names, clean query structures. But technical accuracy is table stakes. The real test was: Would they catch the broken data before giving recommendations?

The Results: Technical Accuracy ≠ Analytical Value

⚠️ The Real Test
All 10 models achieved perfect API syntax (100% accuracy scores on field names). But the data had 100% attribution failures — every traffic source was "(not set)" with zero conversion attribution.

This benchmark tested: Will they catch the broken data and provide value anyway?
Models Tested
10
Actionable Analysis
30%
Hallucinated/Weak
30%
API Accuracy
100%
🏆 Models That Delivered Value
30% provided actionable insights despite 100% broken attribution data
1
Claude Opus 4.5
Anthropic
AnalysisExceptional
Time96s
Cost$1.30
Workarounds + Guidance
2
Claude Sonnet 4.5
Anthropic
AnalysisClear
Time124s
Cost$0.66
Critical Issue + Pivot
3
Grok 4.1 Fast
xAI
AnalysisSolid
Time83s
Cost$0.03
Found Signal Anyway
⚠️ Identified But No Next Steps
40% correctly diagnosed the problem but left users stuck without guidance
GPT-5
OpenAI
IssueDiagnostic
Time163s
Cost$0.24
Thorough But No Action
Gemini 2.5 Flash
Google
IssueClear ID
Time27s
Cost$0.15
No Workarounds
DeepSeek V3.2
DeepSeek
IssueAccurate
Time199s
Cost$0.03
No Next Steps
Grok Code Fast 1
xAI
IssueClear ID
Time28s
Cost$0.02
Identified Only
❌ Hallucinated or Weak Caveats
30% gave confident recommendations from broken data or buried critical caveats
Gemini 2.5 Flash Lite
Google
IssueFabricated
Time48s
Cost$0.02
"Organic Search Primary"
GPT-5 Mini
OpenAI
IssueMisleading
Time141s
Cost$0.05
"72% Direct Traffic"
Gemini 3 Flash Preview
Google
IssueWeak Caveat
Time11s
Cost$0.05
Buried Warning
Want AI analytics that handles broken data gracefully?
See how Anamap detects data quality issues before they mislead your team.

The Full Leaderboard

Here's how each model performed across all metrics:

RankModelQualityTimeCostTokensData Quality Handling
1Claude Opus 4.5🏆 excellent96s$1.30240K✅✅ Best analysis
2Claude Sonnet 4.5🏆 excellent124s$0.66191K✅✅ Clear pivot
3Grok 4.1 Fast🏆 excellent83s$0.03122K✅ Found signal
4GPT-5🏆 excellent163s$0.24123K✅ Diagnostic only
5Gemini 2.5 Flash🏆 excellent27s$0.15449K✅ Identified issue
6DeepSeek V3.2🏆 excellent199s$0.03125K✅ No next steps
7Grok Code Fast 1🏆 excellent28s$0.0278K✅ Identified issue
8Gemini 3 Flash Preview🏆 excellent11s$0.0587K⚠️ Weak caveat
9GPT-5 Mini🏆 excellent141s$0.05127K❌ Misleading
10Gemini 2.5 Flash Lite🏆 excellent48s$0.02170K❌ Hallucinated

Note: All models achieved "excellent" quality ratings for technical execution (valid GA4 queries, proper API syntax). The differentiation came entirely from how they handled the data quality problem.

Category Breakdown

🏆 Models That Delivered Value (30%)

These models didn't just identify the problem. They provided actionable workarounds and next steps.

1. Claude Opus 4.5

Cost: $1.30 | Time: 96s

Claude Opus didn't just flag the attribution failure. It pivoted to extract genuine insights from available data:

"Google organic search and LinkedIn are your two highest-value external traffic sources, driving engaged users to key conversion pages."

What made it exceptional:

  • Called out 100% "(not set)" attribution immediately
  • Pivoted to pageReferrer analysis as a workaround
  • Found signal in landing page performance and funnel paths
  • Provided specific, actionable fixes for the tracking implementation
  • Generated 6 detailed charts including traffic flow analysis

Key insight discovered: The /features page had 99.2% engagement rate and served as the primary gateway to pricing, a critical finding despite broken source attribution.

2. Claude Sonnet 4.5

Cost: $0.66 | Time: 124s

Sonnet led with the data quality issue but immediately pivoted to what IS working:

"Critical data quality issue detected: 100% of traffic shows '(not set)' for source/medium attribution... However, landing page data reveals strong engagement patterns."

Standout analysis:

  • Identified 73% signup funnel abandonment rate (signup_started → sign_up)
  • Found 414K subscription upgrades that couldn't be attributed
  • Provided concrete implementation fixes
  • Suggested proxy approaches while tracking is fixed

3. Grok 4.1 Fast

Cost: $0.03 | Time: 83s

The best value player. Despite the lowest cost among actionable models, Grok found real signal:

"Traffic sources are untrackable (100% '(not set)'), masking external marketing ROI... Subscription upgrades occur almost exclusively from in-app landing pages like /dashboard (28%) and /features (8%), signaling strong product-led growth."

What it found:

  • Recognized product-led growth patterns from in-app upgrade paths
  • Provided specific UTM discipline recommendations
  • Calculated that 50%+ of upgrades came from in-app pages
  • Identified /pricing and /features as high-conversion leverage points

⚠️ Models That Identified But Stopped There (40%)

These models correctly diagnosed the problem but left users without guidance.

GPT-5

Cost: $0.24 | Time: 163s

The most thorough diagnostician, but that's where it stopped:

"No revenue-expansion events (subscription_upgrade, add_on_purchased) were recorded in the GA4 property over the last 30 days."

GPT-5 provided exhaustive metadata analysis confirming the tracking gap but offered minimal actionable workarounds. If you needed confirmation that something was wrong, this was your model. If you needed help moving forward, you'd be stuck.

Gemini 2.5 Flash

Cost: $0.15 | Time: 27s

Fast and accurate identification, minimal workarounds:

"An attempt to identify top traffic sources... revealed significant data quality issues. All 'sign_up' events were attributed to '(not set)' for both traffic source/medium and landing page."

Good for quick diagnostics but didn't pivot to available data.

DeepSeek V3.2

Cost: $0.03 | Time: 199s

Slowest model, accurate diagnosis:

"No high-value conversion events (subscription_upgrade, add_on_purchased) were recorded in the last 30 days."

Correctly identified tracking gaps and suggested verification steps, but provided no analysis of what data WAS available.

Grok Code Fast 1

Cost: $0.02 | Time: 28s

Fast and cheap, correctly flagged the issue but stopped short of extracting value:

"Traffic source attribution is entirely missing. All sessions show '(not set)' for source/medium."

Provided a clear diagnosis but no workarounds or analysis of available data.

❌ Models That Hallucinated or Gave Weak Caveats (30%)

These responses ranged from misleading to dangerously wrong.

Gemini 2.5 Flash Lite

Cost: $0.02 | Time: 48s

The cheapest model. And it showed:

"Organic search from Google is the primary driver of high-value users, contributing the most sessions and engagement time."

The problem: All traffic data showed "(not set)". There was NO organic search data. This model fabricated specific session numbers (15,420 sessions from google/organic) and engagement durations that didn't exist in the dataset.

If a stakeholder acted on this analysis, they'd be optimizing for phantom traffic sources.

GPT-5 Mini

Cost: $0.05 | Time: 141s

Technically accurate, completely misleading:

"Users whose firstUserSource is recorded as '(direct)' account for ~9,043 of 12,596 subscription_upgrade users (≈71.8%)."

The problem: While technically true that "(direct)" (which includes "(not set)") was the largest bucket, presenting this as actionable insight about "direct traffic" performance completely misses that this data represents an attribution failure, not a channel insight. Following this advice would lead to meaningless "direct channel optimization."

Gemini 3 Flash Preview

Cost: $0.05 | Time: 11s

The fastest model buried its warning:

"A significant portion of traffic is attributed to '(not set)' sources and landing pages, suggesting a need for UTM parameter enforcement."

This caveat appeared well into the analysis, after providing landing page recommendations. The structure implied the analysis was valid with a minor data quality note, rather than flagging that the core question couldn't be answered.

What This Means for Analytics Products

The Speed-Cost-Quality Tradeoff

MetricWinnerLoser
FastestGemini 3 Flash (11s)DeepSeek V3.2 (199s)
CheapestGemini 2.5 Flash Lite ($0.02)Claude Opus 4.5 ($1.30)
Best ValueGrok 4.1 Fast ($0.03, actionable)N/A
Best AnalysisClaude Opus 4.5Gemini Flash Lite
Most ThoroughGPT-5 (diagnostics)N/A

For analytics products, optimizing purely for speed or cost risks deploying models that either:

  1. Hallucinate insights from broken data (dangerous)
  2. Leave users stuck without guidance (frustrating)

The Real Benchmark

Technical accuracy is table stakes; all 10 models passed. What separates useful from dangerous:

  1. Data quality detection - Does the model recognize when data is broken?
  2. Clear communication - Is the issue prominently flagged, not buried?
  3. Analytical pivot - Can it extract value from available data?
  4. Actionable guidance - Does it help users move forward?

A $0.02 wrong answer costs more than a $1.30 right one.

Ready to try AI-powered analytics?
Anamap catches data quality issues before giving recommendations. No hallucinations, no misleading insights.

Methodology

Test Setup:

  • GA4 property with intentionally broken attribution tracking
  • 100% of sessions showing "(not set)" for source/medium
  • Valid conversion events that couldn't be attributed to channels
  • Standard marketing ROI question

Models Tested:

  • Anthropic: Claude Opus 4.5, Claude Sonnet 4.5
  • OpenAI: GPT-5, GPT-5 Mini
  • Google: Gemini 3 Flash Preview, Gemini 2.5 Flash, Gemini 2.5 Flash Lite
  • xAI: Grok 4.1 Fast, Grok Code Fast 1
  • DeepSeek: V3.2

Evaluation Criteria:

  • Query structure quality (all passed)
  • GA4 field name accuracy (97% average, one model at 75%)
  • Data quality issue detection
  • Actionable workaround provision
  • User value delivered

This benchmark was conducted using the Anamap AI analytics library, which provides unified analytics querying across GA4 and Amplitude. The test specifically evaluated model behavior when encountering data quality issues, a common real-world scenario that purely technical benchmarks miss.

Frequently Asked Questions

Which LLM is best for analytics?

Based on our benchmark, Claude Opus 4.5 delivered the best overall analysis, correctly identifying data quality issues while still extracting actionable insights from available data. For budget-conscious users, Grok 4.1 Fast at $0.03 per query provided the best value with solid analytical capabilities.

Do AI models hallucinate with analytics data?

Yes. In our test with intentionally broken GA4 data (100% attribution failure), 30% of models either hallucinated insights or provided misleadingly framed results. Gemini 2.5 Flash Lite fabricated traffic source data, while GPT-5 Mini presented "(not set)" data as actionable "direct traffic" insights.

How much does it cost to run AI analytics queries?

Costs varied significantly across our benchmark:

  • Cheapest: Gemini 2.5 Flash Lite ($0.02) and Grok Code Fast 1 ($0.02)
  • Most Expensive: Claude Opus 4.5 ($1.30)
  • Best Value: Grok 4.1 Fast ($0.03) delivered actionable insights at low cost

Can AI detect broken analytics tracking?

Our benchmark specifically tested this. Results varied:

  • 70% detected the issue (attribution data showing "(not set)")
  • 30% either missed it or buried warnings in their analysis
  • Only 30% both detected AND provided actionable workarounds

Which is faster: Claude, GPT-5, or Gemini?

Gemini models were fastest in our benchmark:

  • Fastest: Gemini 3 Flash Preview (11 seconds)
  • Slowest: DeepSeek V3.2 (199 seconds)
  • Claude Opus 4.5 took 96 seconds; GPT-5 took 163 seconds

Should I use cheap AI models for analytics?

It depends on your risk tolerance. The cheapest model (Gemini 2.5 Flash Lite at $0.02) hallucinated data in our test. A wrong answer that leads to misallocated marketing budget costs far more than the price difference between models. For critical decisions, invest in models with better analytical judgment.


Want to stay up to date with our latest blog posts?

Sign up for our email list to receive updates on new blog posts and product releases.

ABOUT THE AUTHOR

Alex Schlee

Founder & CEO

Alex Schlee is the founder of Anamap and has experience spanning the full gamut of analytics from implementation engineering to warehousing and insight generation. He's a great person to connect with about anything related to analytics or technology.