Anamap Blog
I Benchmarked 10 LLMs on Broken Analytics Data. Only 30% Delivered Value.
AI & Analytics
1/28/2026
The Setup: A Test Most Models Would Fail
- All 10 models achieved perfect API syntax and valid GA4 queries
- Only 30% provided actionable insights when data was broken
- Best overall: Claude Opus 4.5 ($1.30) found workarounds and extracted real value
- Best value: Grok 4.1 Fast ($0.03) delivered solid analysis at 1/40th the cost
- Dangerous: 30% hallucinated or gave misleading recommendations
I benchmarked 10 leading LLMs on a deceptively simple analytics question using the Anamap AI library. But the data had a critical flaw: 100% of traffic showed as "(not set)" with zero conversion attribution.
This is a scenario every analytics professional dreads, and one that happens more often than you'd think. The question I asked: "Which traffic sources and landing pages are driving our highest-value users, and where should we double down our marketing investment?"
Every model got the technical part right: perfect API syntax, valid GA4 field names, clean query structures. But technical accuracy is table stakes. The real test was: Would they catch the broken data before giving recommendations?
The Results: Technical Accuracy ≠ Analytical Value
This benchmark tested: Will they catch the broken data and provide value anyway?
The Full Leaderboard
Here's how each model performed across all metrics:
| Rank | Model | Quality | Time | Cost | Tokens | Data Quality Handling |
|---|---|---|---|---|---|---|
| 1 | Claude Opus 4.5 | 🏆 excellent | 96s | $1.30 | 240K | ✅✅ Best analysis |
| 2 | Claude Sonnet 4.5 | 🏆 excellent | 124s | $0.66 | 191K | ✅✅ Clear pivot |
| 3 | Grok 4.1 Fast | 🏆 excellent | 83s | $0.03 | 122K | ✅ Found signal |
| 4 | GPT-5 | 🏆 excellent | 163s | $0.24 | 123K | ✅ Diagnostic only |
| 5 | Gemini 2.5 Flash | 🏆 excellent | 27s | $0.15 | 449K | ✅ Identified issue |
| 6 | DeepSeek V3.2 | 🏆 excellent | 199s | $0.03 | 125K | ✅ No next steps |
| 7 | Grok Code Fast 1 | 🏆 excellent | 28s | $0.02 | 78K | ✅ Identified issue |
| 8 | Gemini 3 Flash Preview | 🏆 excellent | 11s | $0.05 | 87K | ⚠️ Weak caveat |
| 9 | GPT-5 Mini | 🏆 excellent | 141s | $0.05 | 127K | ❌ Misleading |
| 10 | Gemini 2.5 Flash Lite | 🏆 excellent | 48s | $0.02 | 170K | ❌ Hallucinated |
Note: All models achieved "excellent" quality ratings for technical execution (valid GA4 queries, proper API syntax). The differentiation came entirely from how they handled the data quality problem.
Category Breakdown
🏆 Models That Delivered Value (30%)
These models didn't just identify the problem. They provided actionable workarounds and next steps.
1. Claude Opus 4.5
Cost: $1.30 | Time: 96s
Claude Opus didn't just flag the attribution failure. It pivoted to extract genuine insights from available data:
"Google organic search and LinkedIn are your two highest-value external traffic sources, driving engaged users to key conversion pages."
What made it exceptional:
- Called out 100% "(not set)" attribution immediately
- Pivoted to pageReferrer analysis as a workaround
- Found signal in landing page performance and funnel paths
- Provided specific, actionable fixes for the tracking implementation
- Generated 6 detailed charts including traffic flow analysis
Key insight discovered: The /features page had 99.2% engagement rate and served as the primary gateway to pricing, a critical finding despite broken source attribution.
2. Claude Sonnet 4.5
Cost: $0.66 | Time: 124s
Sonnet led with the data quality issue but immediately pivoted to what IS working:
"Critical data quality issue detected: 100% of traffic shows '(not set)' for source/medium attribution... However, landing page data reveals strong engagement patterns."
Standout analysis:
- Identified 73% signup funnel abandonment rate (signup_started → sign_up)
- Found 414K subscription upgrades that couldn't be attributed
- Provided concrete implementation fixes
- Suggested proxy approaches while tracking is fixed
3. Grok 4.1 Fast
Cost: $0.03 | Time: 83s
The best value player. Despite the lowest cost among actionable models, Grok found real signal:
"Traffic sources are untrackable (100% '(not set)'), masking external marketing ROI... Subscription upgrades occur almost exclusively from in-app landing pages like /dashboard (28%) and /features (8%), signaling strong product-led growth."
What it found:
- Recognized product-led growth patterns from in-app upgrade paths
- Provided specific UTM discipline recommendations
- Calculated that 50%+ of upgrades came from in-app pages
- Identified /pricing and /features as high-conversion leverage points
⚠️ Models That Identified But Stopped There (40%)
These models correctly diagnosed the problem but left users without guidance.
GPT-5
Cost: $0.24 | Time: 163s
The most thorough diagnostician, but that's where it stopped:
"No revenue-expansion events (subscription_upgrade, add_on_purchased) were recorded in the GA4 property over the last 30 days."
GPT-5 provided exhaustive metadata analysis confirming the tracking gap but offered minimal actionable workarounds. If you needed confirmation that something was wrong, this was your model. If you needed help moving forward, you'd be stuck.
Gemini 2.5 Flash
Cost: $0.15 | Time: 27s
Fast and accurate identification, minimal workarounds:
"An attempt to identify top traffic sources... revealed significant data quality issues. All 'sign_up' events were attributed to '(not set)' for both traffic source/medium and landing page."
Good for quick diagnostics but didn't pivot to available data.
DeepSeek V3.2
Cost: $0.03 | Time: 199s
Slowest model, accurate diagnosis:
"No high-value conversion events (subscription_upgrade, add_on_purchased) were recorded in the last 30 days."
Correctly identified tracking gaps and suggested verification steps, but provided no analysis of what data WAS available.
Grok Code Fast 1
Cost: $0.02 | Time: 28s
Fast and cheap, correctly flagged the issue but stopped short of extracting value:
"Traffic source attribution is entirely missing. All sessions show '(not set)' for source/medium."
Provided a clear diagnosis but no workarounds or analysis of available data.
❌ Models That Hallucinated or Gave Weak Caveats (30%)
These responses ranged from misleading to dangerously wrong.
Gemini 2.5 Flash Lite
Cost: $0.02 | Time: 48s
The cheapest model. And it showed:
"Organic search from Google is the primary driver of high-value users, contributing the most sessions and engagement time."
The problem: All traffic data showed "(not set)". There was NO organic search data. This model fabricated specific session numbers (15,420 sessions from google/organic) and engagement durations that didn't exist in the dataset.
If a stakeholder acted on this analysis, they'd be optimizing for phantom traffic sources.
GPT-5 Mini
Cost: $0.05 | Time: 141s
Technically accurate, completely misleading:
"Users whose firstUserSource is recorded as '(direct)' account for ~9,043 of 12,596 subscription_upgrade users (≈71.8%)."
The problem: While technically true that "(direct)" (which includes "(not set)") was the largest bucket, presenting this as actionable insight about "direct traffic" performance completely misses that this data represents an attribution failure, not a channel insight. Following this advice would lead to meaningless "direct channel optimization."
Gemini 3 Flash Preview
Cost: $0.05 | Time: 11s
The fastest model buried its warning:
"A significant portion of traffic is attributed to '(not set)' sources and landing pages, suggesting a need for UTM parameter enforcement."
This caveat appeared well into the analysis, after providing landing page recommendations. The structure implied the analysis was valid with a minor data quality note, rather than flagging that the core question couldn't be answered.
What This Means for Analytics Products
The Speed-Cost-Quality Tradeoff
| Metric | Winner | Loser |
|---|---|---|
| Fastest | Gemini 3 Flash (11s) | DeepSeek V3.2 (199s) |
| Cheapest | Gemini 2.5 Flash Lite ($0.02) | Claude Opus 4.5 ($1.30) |
| Best Value | Grok 4.1 Fast ($0.03, actionable) | N/A |
| Best Analysis | Claude Opus 4.5 | Gemini Flash Lite |
| Most Thorough | GPT-5 (diagnostics) | N/A |
For analytics products, optimizing purely for speed or cost risks deploying models that either:
- Hallucinate insights from broken data (dangerous)
- Leave users stuck without guidance (frustrating)
The Real Benchmark
Technical accuracy is table stakes; all 10 models passed. What separates useful from dangerous:
- Data quality detection - Does the model recognize when data is broken?
- Clear communication - Is the issue prominently flagged, not buried?
- Analytical pivot - Can it extract value from available data?
- Actionable guidance - Does it help users move forward?
A $0.02 wrong answer costs more than a $1.30 right one.
Methodology
Test Setup:
- GA4 property with intentionally broken attribution tracking
- 100% of sessions showing "(not set)" for source/medium
- Valid conversion events that couldn't be attributed to channels
- Standard marketing ROI question
Models Tested:
- Anthropic: Claude Opus 4.5, Claude Sonnet 4.5
- OpenAI: GPT-5, GPT-5 Mini
- Google: Gemini 3 Flash Preview, Gemini 2.5 Flash, Gemini 2.5 Flash Lite
- xAI: Grok 4.1 Fast, Grok Code Fast 1
- DeepSeek: V3.2
Evaluation Criteria:
- Query structure quality (all passed)
- GA4 field name accuracy (97% average, one model at 75%)
- Data quality issue detection
- Actionable workaround provision
- User value delivered
This benchmark was conducted using the Anamap AI analytics library, which provides unified analytics querying across GA4 and Amplitude. The test specifically evaluated model behavior when encountering data quality issues, a common real-world scenario that purely technical benchmarks miss.
Frequently Asked Questions
Which LLM is best for analytics?
Based on our benchmark, Claude Opus 4.5 delivered the best overall analysis, correctly identifying data quality issues while still extracting actionable insights from available data. For budget-conscious users, Grok 4.1 Fast at $0.03 per query provided the best value with solid analytical capabilities.
Do AI models hallucinate with analytics data?
Yes. In our test with intentionally broken GA4 data (100% attribution failure), 30% of models either hallucinated insights or provided misleadingly framed results. Gemini 2.5 Flash Lite fabricated traffic source data, while GPT-5 Mini presented "(not set)" data as actionable "direct traffic" insights.
How much does it cost to run AI analytics queries?
Costs varied significantly across our benchmark:
- Cheapest: Gemini 2.5 Flash Lite ($0.02) and Grok Code Fast 1 ($0.02)
- Most Expensive: Claude Opus 4.5 ($1.30)
- Best Value: Grok 4.1 Fast ($0.03) delivered actionable insights at low cost
Can AI detect broken analytics tracking?
Our benchmark specifically tested this. Results varied:
- 70% detected the issue (attribution data showing "(not set)")
- 30% either missed it or buried warnings in their analysis
- Only 30% both detected AND provided actionable workarounds
Which is faster: Claude, GPT-5, or Gemini?
Gemini models were fastest in our benchmark:
- Fastest: Gemini 3 Flash Preview (11 seconds)
- Slowest: DeepSeek V3.2 (199 seconds)
- Claude Opus 4.5 took 96 seconds; GPT-5 took 163 seconds
Should I use cheap AI models for analytics?
It depends on your risk tolerance. The cheapest model (Gemini 2.5 Flash Lite at $0.02) hallucinated data in our test. A wrong answer that leads to misallocated marketing budget costs far more than the price difference between models. For critical decisions, invest in models with better analytical judgment.
Related Reading
- The Best LLM for Analytics in 2026: Our top recommendations by use case, based on 16 models tested
- Round 2: 6 New LLMs, 3 Runs Each: We tested 6 more models 3 times each. MiniMax M2.5 at $0.06 beat Claude Opus 4.6 at $1.35.
- LLM Analytics Benchmark: The Definitive Leaderboard: Combined results from all rounds -- 16 models compared in one place
- Common Analytics Mistakes and How to Avoid Them: Learn about the data quality issues that trip up both humans and AI
- Data Planning for Better Data Quality: How to set up your analytics implementation to avoid "(not set)" nightmares
- Analytics Maps Will Turbocharge Your Insights: Why visual documentation helps teams catch tracking issues faster