Anamap Blog

LLM Analytics Benchmark: The Definitive Leaderboard

AI & Analytics

2/12/2026

Alex Schlee

Founder & CEO

The Most Comprehensive Real-World LLM Analytics Benchmark

Most LLM benchmarks test coding puzzles or trivia questions. We test something different: Can this AI model actually help you understand your analytics data?

This leaderboard combines results from every round of our ongoing benchmark series. Each round tests a new batch of models against real Google Analytics 4 data with a standard marketing analytics question. We evaluate not just technical accuracy (API syntax, field names) but analytical judgment: Can the model detect data quality issues, provide actionable insights, and help users make better decisions?

How This Benchmark Works
  • Real GA4 data -- not synthetic or sanitized. Our test property has intentionally broken attribution tracking.
  • Standard query -- "Which traffic sources and landing pages are driving our highest-value users?"
  • Holistic evaluation -- technical accuracy, data quality detection, analytical depth, and actionable guidance
  • Multi-run testing -- newer rounds test each model 3 times to measure consistency

Combined Leaderboard

๐Ÿ† Combined LLM Analytics Leaderboard
16 models tested across 2 rounds ยท 28 total runs
#Model โ†•ProviderRoundRunsQualityAccuracy โ†•Avg Time โ†•Cost โ†•$/1K tok โ†•Key Strength
1MiniMax M2.5MiniMaxR23/3๐Ÿ† excellent10070s$0.06$0.0003Fastest & cheapest, excellent quality
2Kimi K2.5MoonshotAIR23/3๐Ÿ† excellent100125s$0.07$0.000598.5% engagement insight
3Claude Opus 4.6AnthropicR23/3๐Ÿ† excellent100143s$1.35$0.0056Most comprehensive analysis
4GLM 5Z.aiR23/3๐Ÿ† excellent100205s$0.16$0.0009Actionable conversion rates
5Qwen3 Max ThinkingQwenR23/3๐Ÿ† excellent9689s$0.44$0.0012Fast deep thinking
6Claude Opus 4.5AnthropicR11/1๐Ÿ† excellent10096s$1.30$0.0054Best workarounds for broken data
7Claude Sonnet 4.5AnthropicR11/1๐Ÿ† excellent100124s$0.66$0.0034Clear pivot to actionable data
8Grok 4.1 FastxAIR11/1๐Ÿ† excellent10083s$0.03$0.0002Best value in Round 1
9GPT-5OpenAIR11/1๐Ÿ† excellent100163s$0.24$0.0020Thorough diagnostics
10Gemini 2.5 FlashGoogleR11/1๐Ÿ† excellent10027s$0.15$0.0003Fast identification
11DeepSeek V3.2DeepSeekR11/1๐Ÿ† excellent100199s$0.03$0.0002Accurate low-cost diagnosis
12Grok Code Fast 1xAIR11/1๐Ÿ† excellent10028s$0.02$0.0003Ultra-fast identification
13Gemini 3 Flash PreviewGoogleR11/1๐Ÿ† excellent10011s$0.05$0.0006Fastest overall (11s)
14GPT-5 MiniOpenAIR11/1โš ๏ธ misleading100141s$0.05$0.0004Misleading framing of broken data
15Gemini 2.5 Flash LiteGoogleR11/1โŒ hallucinated7548s$0.02$0.0001Fabricated traffic source data
-Aurora AlphaStealth (OpenRouter)R20/3๐Ÿ’ฅ error----Context window too small (128K)
1
MiniMax M2.5MiniMax
R2
Runs3/3
Quality๐Ÿ† excellent
Accuracy100
Time70s
Cost$0.06
$/1K tok$0.0003
Fastest & cheapest, excellent quality
2
Kimi K2.5MoonshotAI
R2
Runs3/3
Quality๐Ÿ† excellent
Accuracy100
Time125s
Cost$0.07
$/1K tok$0.0005
98.5% engagement insight
3
Claude Opus 4.6Anthropic
R2
Runs3/3
Quality๐Ÿ† excellent
Accuracy100
Time143s
Cost$1.35
$/1K tok$0.0056
Most comprehensive analysis
4
GLM 5Z.ai
R2
Runs3/3
Quality๐Ÿ† excellent
Accuracy100
Time205s
Cost$0.16
$/1K tok$0.0009
Actionable conversion rates
5
Qwen3 Max ThinkingQwen
R2
Runs3/3
Quality๐Ÿ† excellent
Accuracy96
Time89s
Cost$0.44
$/1K tok$0.0012
Fast deep thinking
6
Claude Opus 4.5Anthropic
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time96s
Cost$1.30
$/1K tok$0.0054
Best workarounds for broken data
7
Claude Sonnet 4.5Anthropic
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time124s
Cost$0.66
$/1K tok$0.0034
Clear pivot to actionable data
8
Grok 4.1 FastxAI
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time83s
Cost$0.03
$/1K tok$0.0002
Best value in Round 1
9
GPT-5OpenAI
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time163s
Cost$0.24
$/1K tok$0.0020
Thorough diagnostics
10
Gemini 2.5 FlashGoogle
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time27s
Cost$0.15
$/1K tok$0.0003
Fast identification
11
DeepSeek V3.2DeepSeek
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time199s
Cost$0.03
$/1K tok$0.0002
Accurate low-cost diagnosis
12
Grok Code Fast 1xAI
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time28s
Cost$0.02
$/1K tok$0.0003
Ultra-fast identification
13
Gemini 3 Flash PreviewGoogle
R1
Runs1/1
Quality๐Ÿ† excellent
Accuracy100
Time11s
Cost$0.05
$/1K tok$0.0006
Fastest overall (11s)
14
GPT-5 MiniOpenAI
R1
Runs1/1
Qualityโš ๏ธ misleading
Accuracy100
Time141s
Cost$0.05
$/1K tok$0.0004
Misleading framing of broken data
15
Gemini 2.5 Flash LiteGoogle
R1
Runs1/1
QualityโŒ hallucinated
Accuracy75
Time48s
Cost$0.02
$/1K tok$0.0001
Fabricated traffic source data
-
Aurora AlphaStealth (OpenRouter)
R2
Runs0/3
Quality๐Ÿ’ฅ error
Accuracy-
Time-
Cost-
$/1K tok-
Context window too small (128K)
R1Round 1: 10 models, 1 run each, broken data quality test (Jan 2026)
R2Round 2: 6 models, 3 runs each, marketing attribution test (Feb 2026)

Round-by-Round Results

Round 2: Consistency Test (February 2026)

6 models, 3 runs each, 18 total test runs

The second round focused on consistency, testing whether AI models deliver reliable results across multiple runs. We also expanded to include models from Chinese AI labs and a stealth OpenRouter release.

Key findings:

  • MiniMax M2.5 dominated on every efficiency metric -- fastest, cheapest, excellent quality
  • 5 of 6 models achieved excellent quality, a significant improvement over Round 1's quality variance
  • Aurora Alpha (stealth OpenRouter release) failed all 3 runs due to context window limitations
  • Quality was consistent -- 14 of 15 successful runs scored "excellent"
  • Speed varied significantly -- GLM 5 ranged from 145s to 275s across runs

Read the full Round 2 analysis

Round 1: The Broken Data Test (January 2026)

10 models, 1 run each, 10 total test runs

The original benchmark tested how leading AI models handle a common real-world scenario: broken analytics data. All traffic attribution showed as "(not set)" with zero conversion tracking.

Key findings:

  • All 10 models achieved perfect API syntax -- technical accuracy is table stakes
  • Only 30% provided actionable insights despite broken data
  • 30% hallucinated -- fabricating traffic source data or presenting broken data as insights
  • Claude Opus 4.5 delivered the best analysis with workarounds and next steps
  • Grok 4.1 Fast was the Round 1 best value at $0.03 with solid analysis

Read the full Round 1 analysis

How to Use This Data

Choosing by Budget

BudgetBest ChoiceWhy
Under $0.05/queryGrok 4.1 Fast ($0.03, R1) or MiniMax M2.5 ($0.02/run, R2)Both delivered excellent quality at rock-bottom prices
Under $0.25/queryKimi K2.5 ($0.07, R2) or Gemini 2.5 Flash ($0.15, R1)Strong analysis with good speed
No budget limitClaude Opus 4.6 ($1.35, R2) or Claude Opus 4.5 ($1.30, R1)Most comprehensive, thorough investigation

Choosing by Use Case

Use CaseRecommended ModelReason
Daily automated queriesMiniMax M2.5Cheapest + fastest at excellent quality
Executive dashboardsClaude Opus 4.6Most thorough, catches nuances
Quick diagnosticsGemini 3 Flash Preview11s response time
Budget analytics teamsGrok 4.1 Fast or Kimi K2.5Excellent analysis under $0.10
Data quality auditsClaude Opus 4.5 or 4.6Best at finding and working around issues

Models to Avoid

  • Gemini 2.5 Flash Lite -- hallucinated traffic source data in our test. A wrong answer is worse than no answer.
  • GPT-5 Mini -- presented broken data as actionable "direct traffic" insights without adequate caveats.
  • Aurora Alpha -- failed to complete analysis due to context window limitations.

Methodology

Test Environment

  • Data source: Real Google Analytics 4 property
  • Data condition: Intentionally broken attribution tracking (100% "(not set)" for source/medium)
  • Query: "Which traffic sources and landing pages are driving our highest-value users, and where should we double down our marketing investment?"
  • Valid conversion events: sign_up, subscription_upgrade, add_on_purchased (properly tracked)

What We Evaluate

  1. Technical accuracy -- Valid GA4 field names, correct API syntax, proper query structure
  2. Accuracy score (0-100) -- How accurately the model uses real GA4 dimensions and metrics
  3. Data quality detection -- Does the model identify attribution tracking issues?
  4. Analytical depth -- How many data requests? How thorough is the investigation?
  5. Actionable output -- Does the model provide guidance, workarounds, and next steps?
  6. Consistency (multi-run rounds) -- Does the model deliver similar quality every time?

How Models Are Ranked

The leaderboard ranks models by a composite score weighing:

  • Quality rating (40%) -- overall analytical value delivered
  • Cost efficiency (25%) -- $/1K tokens
  • Accuracy score (20%) -- GA4 field name correctness
  • Speed (15%) -- average response time

Models that hallucinate or provide misleading results are ranked below models that correctly identify limitations, regardless of other metrics.

Want AI analytics that works?
Anamap uses rigorously benchmarked models to deliver reliable analytics insights. No hallucinations, no misleading results.

This leaderboard is updated with each new benchmark round. All testing is conducted using the Anamap AI analytics library. Want to suggest a model for the next round? Let us know.

Frequently Asked Questions

What is the best LLM for Google Analytics?

Based on our benchmark of 16 models across 28 test runs, MiniMax M2.5 offers the best combination of quality, speed, and cost at $0.02 per query. For maximum analytical depth, Claude Opus 4.5/4.6 delivers the most comprehensive analysis at approximately $1.30 per query.

How often is this leaderboard updated?

We run new benchmark rounds periodically, testing fresh batches of models as they release. Each round is documented in a detailed blog post, and results are added to this combined leaderboard. The current data reflects 2 rounds (January and February 2026).

Why test on broken analytics data?

Broken attribution is one of the most common real-world analytics problems. Testing on clean, well-structured data only measures technical capability. Our benchmark measures analytical judgment: can the AI detect problems, communicate them clearly, and still extract value?

Can I trust cheap AI models for analytics?

Yes, with caveats. Our Round 2 results show that MiniMax M2.5 ($0.02/query) and Kimi K2.5 ($0.02/query) both delivered excellent quality consistently. However, Round 1 showed that the cheapest model (Gemini 2.5 Flash Lite at $0.02) hallucinated data. Always validate that a model handles edge cases before relying on it for production analytics.

Which AI providers make the best analytics models?

Based on our data: Anthropic (Claude) leads on analytical depth but is the most expensive. MiniMax and MoonshotAI offer the best value in Round 2. xAI (Grok) was the best value in Round 1. Google's results were mixed, with Gemini 2.5 Flash performing well but Flash Lite hallucinating. OpenAI's GPT-5 was solid but GPT-5 Mini was misleading.

How do you measure LLM accuracy in analytics?

We track an accuracy score (0-100) that measures how correctly each model uses valid GA4 API field names. A score of 100 means every dimension and metric used was valid. We also evaluate whether models fabricate data or present broken data as reliable insights, which is a higher-level form of hallucination that pure syntax checks miss.


Want to stay up to date with our latest blog posts?

Sign up for our email list to receive updates on new blog posts and product releases.

ABOUT THE AUTHOR

Alex Schlee

Founder & CEO

Alex Schlee is the founder of Anamap and has experience spanning the full gamut of analytics from implementation engineering to warehousing and insight generation. He's a great person to connect with about anything related to analytics or technology.