Anamap Blog
LLM Analytics Benchmark: The Definitive Leaderboard
AI & Analytics
2/12/2026
The Most Comprehensive Real-World LLM Analytics Benchmark
Most LLM benchmarks test coding puzzles or trivia questions. We test something different: Can this AI model actually help you understand your analytics data?
This leaderboard combines results from every round of our ongoing benchmark series. Each round tests a new batch of models against real Google Analytics 4 data with a standard marketing analytics question. We evaluate not just technical accuracy (API syntax, field names) but analytical judgment: Can the model detect data quality issues, provide actionable insights, and help users make better decisions?
- Real GA4 data -- not synthetic or sanitized. Our test property has intentionally broken attribution tracking.
- Standard query -- "Which traffic sources and landing pages are driving our highest-value users?"
- Holistic evaluation -- technical accuracy, data quality detection, analytical depth, and actionable guidance
- Multi-run testing -- newer rounds test each model 3 times to measure consistency
Combined Leaderboard
| # | Model | Provider | Round | Runs | Quality | Accuracy | Avg Time | Cost | $/1K tok | Key Strength |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | MiniMax M2.5 | MiniMax | R2 | 3/3 | ๐ excellent | 100 | 70s | $0.06 | $0.0003 | Fastest & cheapest, excellent quality |
| 2 | Kimi K2.5 | MoonshotAI | R2 | 3/3 | ๐ excellent | 100 | 125s | $0.07 | $0.0005 | 98.5% engagement insight |
| 3 | Claude Opus 4.6 | Anthropic | R2 | 3/3 | ๐ excellent | 100 | 143s | $1.35 | $0.0056 | Most comprehensive analysis |
| 4 | GLM 5 | Z.ai | R2 | 3/3 | ๐ excellent | 100 | 205s | $0.16 | $0.0009 | Actionable conversion rates |
| 5 | Qwen3 Max Thinking | Qwen | R2 | 3/3 | ๐ excellent | 96 | 89s | $0.44 | $0.0012 | Fast deep thinking |
| 6 | Claude Opus 4.5 | Anthropic | R1 | 1/1 | ๐ excellent | 100 | 96s | $1.30 | $0.0054 | Best workarounds for broken data |
| 7 | Claude Sonnet 4.5 | Anthropic | R1 | 1/1 | ๐ excellent | 100 | 124s | $0.66 | $0.0034 | Clear pivot to actionable data |
| 8 | Grok 4.1 Fast | xAI | R1 | 1/1 | ๐ excellent | 100 | 83s | $0.03 | $0.0002 | Best value in Round 1 |
| 9 | GPT-5 | OpenAI | R1 | 1/1 | ๐ excellent | 100 | 163s | $0.24 | $0.0020 | Thorough diagnostics |
| 10 | Gemini 2.5 Flash | R1 | 1/1 | ๐ excellent | 100 | 27s | $0.15 | $0.0003 | Fast identification | |
| 11 | DeepSeek V3.2 | DeepSeek | R1 | 1/1 | ๐ excellent | 100 | 199s | $0.03 | $0.0002 | Accurate low-cost diagnosis |
| 12 | Grok Code Fast 1 | xAI | R1 | 1/1 | ๐ excellent | 100 | 28s | $0.02 | $0.0003 | Ultra-fast identification |
| 13 | Gemini 3 Flash Preview | R1 | 1/1 | ๐ excellent | 100 | 11s | $0.05 | $0.0006 | Fastest overall (11s) | |
| 14 | GPT-5 Mini | OpenAI | R1 | 1/1 | โ ๏ธ misleading | 100 | 141s | $0.05 | $0.0004 | Misleading framing of broken data |
| 15 | Gemini 2.5 Flash Lite | R1 | 1/1 | โ hallucinated | 75 | 48s | $0.02 | $0.0001 | Fabricated traffic source data | |
| - | Aurora Alpha | Stealth (OpenRouter) | R2 | 0/3 | ๐ฅ error | - | - | - | - | Context window too small (128K) |
Round-by-Round Results
Round 2: Consistency Test (February 2026)
6 models, 3 runs each, 18 total test runs
The second round focused on consistency, testing whether AI models deliver reliable results across multiple runs. We also expanded to include models from Chinese AI labs and a stealth OpenRouter release.
Key findings:
- MiniMax M2.5 dominated on every efficiency metric -- fastest, cheapest, excellent quality
- 5 of 6 models achieved excellent quality, a significant improvement over Round 1's quality variance
- Aurora Alpha (stealth OpenRouter release) failed all 3 runs due to context window limitations
- Quality was consistent -- 14 of 15 successful runs scored "excellent"
- Speed varied significantly -- GLM 5 ranged from 145s to 275s across runs
Read the full Round 2 analysis
Round 1: The Broken Data Test (January 2026)
10 models, 1 run each, 10 total test runs
The original benchmark tested how leading AI models handle a common real-world scenario: broken analytics data. All traffic attribution showed as "(not set)" with zero conversion tracking.
Key findings:
- All 10 models achieved perfect API syntax -- technical accuracy is table stakes
- Only 30% provided actionable insights despite broken data
- 30% hallucinated -- fabricating traffic source data or presenting broken data as insights
- Claude Opus 4.5 delivered the best analysis with workarounds and next steps
- Grok 4.1 Fast was the Round 1 best value at $0.03 with solid analysis
Read the full Round 1 analysis
How to Use This Data
Choosing by Budget
| Budget | Best Choice | Why |
|---|---|---|
| Under $0.05/query | Grok 4.1 Fast ($0.03, R1) or MiniMax M2.5 ($0.02/run, R2) | Both delivered excellent quality at rock-bottom prices |
| Under $0.25/query | Kimi K2.5 ($0.07, R2) or Gemini 2.5 Flash ($0.15, R1) | Strong analysis with good speed |
| No budget limit | Claude Opus 4.6 ($1.35, R2) or Claude Opus 4.5 ($1.30, R1) | Most comprehensive, thorough investigation |
Choosing by Use Case
| Use Case | Recommended Model | Reason |
|---|---|---|
| Daily automated queries | MiniMax M2.5 | Cheapest + fastest at excellent quality |
| Executive dashboards | Claude Opus 4.6 | Most thorough, catches nuances |
| Quick diagnostics | Gemini 3 Flash Preview | 11s response time |
| Budget analytics teams | Grok 4.1 Fast or Kimi K2.5 | Excellent analysis under $0.10 |
| Data quality audits | Claude Opus 4.5 or 4.6 | Best at finding and working around issues |
Models to Avoid
- Gemini 2.5 Flash Lite -- hallucinated traffic source data in our test. A wrong answer is worse than no answer.
- GPT-5 Mini -- presented broken data as actionable "direct traffic" insights without adequate caveats.
- Aurora Alpha -- failed to complete analysis due to context window limitations.
Methodology
Test Environment
- Data source: Real Google Analytics 4 property
- Data condition: Intentionally broken attribution tracking (100% "(not set)" for source/medium)
- Query: "Which traffic sources and landing pages are driving our highest-value users, and where should we double down our marketing investment?"
- Valid conversion events: sign_up, subscription_upgrade, add_on_purchased (properly tracked)
What We Evaluate
- Technical accuracy -- Valid GA4 field names, correct API syntax, proper query structure
- Accuracy score (0-100) -- How accurately the model uses real GA4 dimensions and metrics
- Data quality detection -- Does the model identify attribution tracking issues?
- Analytical depth -- How many data requests? How thorough is the investigation?
- Actionable output -- Does the model provide guidance, workarounds, and next steps?
- Consistency (multi-run rounds) -- Does the model deliver similar quality every time?
How Models Are Ranked
The leaderboard ranks models by a composite score weighing:
- Quality rating (40%) -- overall analytical value delivered
- Cost efficiency (25%) -- $/1K tokens
- Accuracy score (20%) -- GA4 field name correctness
- Speed (15%) -- average response time
Models that hallucinate or provide misleading results are ranked below models that correctly identify limitations, regardless of other metrics.
This leaderboard is updated with each new benchmark round. All testing is conducted using the Anamap AI analytics library. Want to suggest a model for the next round? Let us know.
Frequently Asked Questions
What is the best LLM for Google Analytics?
Based on our benchmark of 16 models across 28 test runs, MiniMax M2.5 offers the best combination of quality, speed, and cost at $0.02 per query. For maximum analytical depth, Claude Opus 4.5/4.6 delivers the most comprehensive analysis at approximately $1.30 per query.
How often is this leaderboard updated?
We run new benchmark rounds periodically, testing fresh batches of models as they release. Each round is documented in a detailed blog post, and results are added to this combined leaderboard. The current data reflects 2 rounds (January and February 2026).
Why test on broken analytics data?
Broken attribution is one of the most common real-world analytics problems. Testing on clean, well-structured data only measures technical capability. Our benchmark measures analytical judgment: can the AI detect problems, communicate them clearly, and still extract value?
Can I trust cheap AI models for analytics?
Yes, with caveats. Our Round 2 results show that MiniMax M2.5 ($0.02/query) and Kimi K2.5 ($0.02/query) both delivered excellent quality consistently. However, Round 1 showed that the cheapest model (Gemini 2.5 Flash Lite at $0.02) hallucinated data. Always validate that a model handles edge cases before relying on it for production analytics.
Which AI providers make the best analytics models?
Based on our data: Anthropic (Claude) leads on analytical depth but is the most expensive. MiniMax and MoonshotAI offer the best value in Round 2. xAI (Grok) was the best value in Round 1. Google's results were mixed, with Gemini 2.5 Flash performing well but Flash Lite hallucinating. OpenAI's GPT-5 was solid but GPT-5 Mini was misleading.
How do you measure LLM accuracy in analytics?
We track an accuracy score (0-100) that measures how correctly each model uses valid GA4 API field names. A score of 100 means every dimension and metric used was valid. We also evaluate whether models fabricate data or present broken data as reliable insights, which is a higher-level form of hallucination that pure syntax checks miss.
Related Reading
- The Best LLM for Analytics in 2026: Our top recommendations by use case, based on all benchmark data
- Round 1: I Benchmarked 10 LLMs on Broken Analytics Data: The original benchmark testing 10 established models
- Round 2: The Consistency Test โ 6 Models, 3 Runs Each: Testing newer models for consistency and cost efficiency