Anamap Blog

6 New LLMs, 3 Runs Each: The Best AI for Analytics Costs $0.06

AI & Analytics

2/12/2026

Alex Schlee

Founder & CEO

Why We Ran Each Model 3 Times

Key Takeaways
  • 5 of 6 models delivered excellent quality across all 3 runs
  • Best overall: MiniMax M2.5 -- fastest (70s avg), cheapest ($0.06), excellent quality
  • Most thorough: Claude Opus 4.6 ($1.35) -- comprehensive analysis but 22x the cost
  • Complete failure: Aurora Alpha (stealth OpenRouter release) couldn't start -- context window too small
  • Consistency matters: Run times varied up to 90% within the same model across runs

In Round 1, we tested 10 LLMs on broken GA4 data with a single run each. The results were revealing, but they left an important question unanswered: How consistent are these models?

LLMs are probabilistic systems. The same prompt can produce different results every time. A model that gives a brilliant answer once might stumble the next time. For analytics products, this is a critical concern. You can't ship an AI assistant that's excellent 60% of the time and mediocre the rest.

So for Round 2, we changed two things:

  1. 3 runs per model -- every model ran the same query 3 separate times
  2. 6 new models -- MiniMax M2.5, Kimi K2.5, Claude Opus 4.6, GLM 5, Qwen3 Max Thinking, and Aurora Alpha

The test query remained the same: "Which traffic sources and landing pages are driving our highest-value users, and where should we double down our marketing investment?" And the same GA4 property with broken attribution data.

The Results: Consistency Meets Cost Efficiency

๐Ÿ”„ The Consistency Test
LLMs are probabilistic -- the same question can yield different results each time. We ran 6 models 3 times each on the same marketing attribution query against a real GA4 property.

5 of 6 models delivered excellent results consistently. The cheapest one won.
Models Tested
6
Runs Per Model
3x
Success Rate
83%
Avg Accuracy
99
๐Ÿ† Delivered Excellent Results (5/6)
All 5 successful models achieved "excellent" quality with near-perfect accuracy scores
1
MiniMax M2.5
MiniMax
Runs
Fastest & Cheapest
Avg Time70s
Range52-88s
Cost$0.06
Accuracy100
2
Kimi K2.5
MoonshotAI
Runs
98.5% Engagement Find
Avg Time125s
Range108-145s
Cost$0.07
Accuracy100
3
Claude Opus 4.6
Anthropic
Runs
Most Thorough
Avg Time143s
Range113-167s
Cost$1.35
Accuracy100
4
GLM 5
Z.ai
Runs
Actionable Rates
Avg Time205s
Range145-275s
Cost$0.16
Accuracy100
5
Qwen3 Max Thinking
Qwen
Runs
2 Invalid Dimensions
Avg Time89s
Range75-114s
Cost$0.44
Accuracy96
โŒ Failed to Complete
Context window limitation prevented any successful analysis
โŒ
Aurora Alpha
Stealth (OpenRouter)
Runs
0/3 Complete Failure
FailureContext limit: 128K vs ~144K needed
๐Ÿ’ฐ Cost Comparison
MiniMax M2.5
$0.06
Kimi K2.5
$0.07
GLM 5
$0.16
Qwen3 Max
$0.44
Claude Opus 4.6
$1.35
Want AI analytics you can rely on?
Anamap tests models rigorously so you don't have to. Get consistent, high-quality analytics insights.

Full Leaderboard

RankModelRunsQualityAccuracyAvg TimeBest/WorstTokensCost$/1K tok
1MiniMax M2.53/3๐Ÿ† excellent10070s52s / 88s190K$0.06$0.0003
2Kimi K2.53/3๐Ÿ† excellent100125s108s / 145s129K$0.07$0.0005
3Claude Opus 4.63/3๐Ÿ† excellent100143s113s / 167s238K$1.35$0.0056
4GLM 53/3๐Ÿ† excellent100205s145s / 275s184K$0.16$0.0009
5Qwen3 Max Thinking3/3๐Ÿ† excellent9689s75s / 114s359K$0.44$0.0012
-Aurora Alpha0/3๐Ÿ’ฅ error------

Note: All 5 successful models achieved "excellent" quality ratings and near-perfect accuracy scores (96-100). The ranking is based on the combination of cost efficiency, speed, and analytical depth.

Consistency Spotlight

One of the most interesting findings from running 3 tests per model was the variance in execution time, even when quality remained consistent.

ModelRun 1Run 2Run 3VarianceQuality Consistency
MiniMax M2.552s69s88s69% spreadExcellent all 3
Kimi K2.5108s123s145s34% spreadExcellent all 3
Claude Opus 4.6113s149s167s48% spreadExcellent all 3
Qwen3 Max Thinking75s78s114s52% spreadExcellent all 3
GLM 5145s196s275s90% spread2 excellent, 1 good

Key finding: Quality was remarkably stable across runs for most models. Even when execution time varied significantly (GLM 5 ranged from 145s to 275s), the analytical output remained consistent. The exception was GLM 5, where one run produced "good" rather than "excellent" quality -- the only quality variance observed across all 15 successful runs.

Model Breakdowns

๐Ÿ† Models That Delivered Excellent Results

1. MiniMax M2.5

Cost: $0.06 | Avg Time: 70s | Accuracy: 100/100 | Provider: MiniMax

The runaway winner on every efficiency metric. MiniMax M2.5 was both the fastest and cheapest model while still delivering excellent analysis across all 3 runs.

"GA4 property has limited source tracking configured (sessionSourceMedium shows '(not set)' for all traffic), making traditional traffic source analysis unreliable. However, valuable conversion signals exist: subscription_upgrade (69,121 events), add_on_purchased (65,561), and sign_up (1,309)."

What made it exceptional:

  • Immediately identified the attribution tracking gap
  • Pivoted to conversion event analysis as the key value driver
  • Found the top landing pages: /dashboard (206K sessions), /features (86K), /docs (35K), /pricing (32K)
  • Flagged that traffic source tracking needs improvement for proper ROI analysis
  • Did all of this in under 70 seconds at $0.02 per run

Consistency across runs: Quality remained excellent across all 3 runs. Time ranged from 52s to 88s, but the analytical output was consistently thorough and well-structured.

Cost comparison: At $0.0003 per 1K tokens, MiniMax M2.5 is 19x cheaper than Claude Opus 4.6 ($0.0056/1K). For teams running hundreds of queries per day, this difference is enormous.

2. Kimi K2.5

Cost: $0.07 | Avg Time: 125s | Accuracy: 100/100 | Provider: MoonshotAI

Kimi K2.5, from Chinese AI lab MoonshotAI, delivered a unique insight that no other model found.

"Landing page analysis reveals strong engagement on product marketing pages, with /landing/serverless achieving 98.5% engagement rate -- far outperforming the homepage."

What stood out:

  • The only model to highlight the /landing/serverless page's exceptional engagement rate (98.5%)
  • Noted that traffic source attribution data was unavailable, limiting full ROI analysis
  • Maintained perfect accuracy scores across all runs
  • Consistently solid at $0.07 total (just $0.01 more than MiniMax)

Consistency across runs: Most consistent timing of all models (108s-145s, 34% spread). Every run was excellent quality.

3. Claude Opus 4.6

Cost: $1.35 | Avg Time: 143s | Accuracy: 100/100 | Provider: Anthropic

The most comprehensive analysis by far, but at 22x the cost of MiniMax M2.5. Claude Opus 4.6 (the newest version since our Round 1 test of Opus 4.5) delivered the deepest investigation.

"The Dashboard (/dashboard) dominates session volume (979K sessions, 99.9% engagement rate), confirming strong product stickiness. For marketing-facing pages, /features leads with 1,004 active users and 428K sessions at 99.4% engagement."

What made it thorough:

  • Analyzed the entire conversion funnel: 6,337 sign_ups, 340,798 subscription_upgrades, 321,895 add-on purchases
  • Quantified engagement rates per page: /pricing at 96.7%, /landing/serverless at 97.8%
  • Explicitly called out the UTM tracking gap as a critical limitation
  • Generated 4 data requests per run, investigating multiple angles
  • Provided specific recommendations for fixing attribution tracking

Consistency across runs: Slightly wider time range (113s-167s) but quality was consistently excellent with near-perfect accuracy scores.

The cost question: Is Claude Opus 4.6's depth worth 22x the cost of MiniMax M2.5? For high-stakes strategy decisions, possibly. For routine daily queries, almost certainly not.

4. GLM 5

Cost: $0.16 | Avg Time: 205s | Accuracy: 100/100 | Provider: Z.ai

GLM 5 was the slowest model but provided the most actionable conversion rate analysis.

"Critical data quality issue discovered: all traffic source attribution is missing. However, landing page analysis reveals clear high-value drivers: /features page leads with 8,029 subscription upgrades (9.2% rate), /docs drives 4,943 upgrades (14.1% rate), and /pricing generates 3,317 upgrades (10.2% rate)."

What it found:

  • Calculated specific conversion rates per landing page -- the only model to do this
  • Identified /docs at 14.1% upgrade rate as the highest-converting page
  • Called out the UTM fix as "immediate action required"
  • Provided a clear, actionable framework for marketing investment

Consistency across runs: This was the least consistent model. Run times ranged from 145s to 275s (90% spread), and one run scored "good" instead of "excellent" -- the only quality variance across all 15 successful runs in this benchmark.

5. Qwen3 Max Thinking

Cost: $0.44 | Avg Time: 89s | Accuracy: 96/100 | Provider: Qwen

Qwen3 Max Thinking was the second-fastest model and used the most tokens (359K), reflecting its "thinking" approach that processes internally before responding.

"The data shows that most traffic (100%) is coming from '(not set)' source, indicating a significant tracking issue with UTM parameters or referral data."

What stood out:

  • Fast analysis despite high token usage (4,032 tokens/sec processing speed)
  • 4 data requests per run, matching Claude and MiniMax in investigation depth
  • Correctly identified the core tracking issue

The accuracy caveat: Qwen3 scored 96/100 on accuracy rather than a perfect 100. It attempted to use 2 invalid GA4 dimensions (audience_segment and audienceName) that don't exist in the GA4 API. While these didn't derail the analysis, they indicate slightly less precise understanding of the GA4 schema compared to the top 4 models.

โŒ Failed Model

Aurora Alpha (Failed)

Cost: N/A | Runs: 0/3 | Provider: Stealth release on OpenRouter (unknown backing provider)

Aurora Alpha failed all 3 attempts before generating a single response. The system prompt (~126K tokens) plus the query exceeded its 128K context window limit.

"Context limit exceeded: 128000 tokens vs ~144K tokens required"

Why this matters: Aurora Alpha appeared on OpenRouter as a stealth release with no publicly identified backing company and limited documentation. This lack of transparency made it impossible to verify its stated capabilities or troubleshoot the failure. The 128K context window is technically within range of the system prompt, but with zero headroom for conversation -- a fundamental limitation for any multi-turn analytics workflow.

Lesson for AI product builders: Always test context window limits under realistic conditions. A model that advertises 128K context but can't handle a typical analytics system prompt is not viable for production use, regardless of how well it performs on smaller prompts.

What This Tells Us

The Cost Efficiency Revolution

MetricWinnerRunner-Up
FastestMiniMax M2.5 (70s)Qwen3 Max Thinking (89s)
CheapestMiniMax M2.5 ($0.06)Kimi K2.5 ($0.07)
Best ValueMiniMax M2.5 ($0.0003/1K)Kimi K2.5 ($0.0005/1K)
Most ThoroughClaude Opus 4.6GLM 5
Most ConsistentKimi K2.5 (34% spread)Claude Opus 4.6 (48% spread)
Unique InsightKimi K2.5 (98.5% engagement)GLM 5 (conversion rates)

Round 1 vs Round 2

AspectRound 1Round 2
Models10 (established players)6 (newer/niche models)
Runs1 per model3 per model
Top performerClaude Opus 4.5 ($1.30)MiniMax M2.5 ($0.06)
Quality spread30% excellent, 40% diagnostic, 30% hallucinated83% excellent, 17% failed
Accuracy range75-10096-100 (excluding failures)
Key findingData quality judgment separates modelsCost efficiency doesn't sacrifice quality

The analytics AI market is maturing fast. In Round 1, model quality varied dramatically -- from fabricating data to providing genius-level analysis. In Round 2, quality has largely converged at the top. The differentiation is now cost, speed, and consistency.

A model that costs $0.06 and delivers excellent results 3 out of 3 times changes the economics of AI-powered analytics entirely.

See the combined leaderboard
16 models tested across 2 rounds. Find the right AI model for your analytics needs.

Methodology

Test Setup:

  • Same GA4 property as Round 1 with broken attribution tracking
  • 100% of sessions showing "(not set)" for source/medium
  • Valid conversion events that couldn't be attributed to channels
  • Standard marketing ROI question
  • 3 runs per model to test consistency

Models Tested:

  • MiniMax: MiniMax M2.5
  • MoonshotAI: Kimi K2.5
  • Anthropic: Claude Opus 4.6
  • Z.ai: GLM 5
  • Qwen: Qwen3 Max Thinking
  • Aurora Alpha (stealth OpenRouter release, unknown provider)

Evaluation Criteria:

  • Quality rating (excellent/good/fair/poor) across all 3 runs
  • Accuracy score (GA4 field name accuracy, 0-100)
  • Run time consistency (variance across 3 runs)
  • Cost efficiency ($/1K tokens)
  • Analytical depth and actionability
  • Data quality issue detection and handling

Total benchmark cost: $2.08 across 18 runs (6 models x 3 runs)


This is Round 2 of our ongoing LLM analytics benchmark series conducted using the Anamap AI analytics library. See Round 1: 10 Models on Broken Data for the original benchmark, or visit the combined leaderboard for all results across rounds.

Frequently Asked Questions

Which AI model is best for marketing analytics on a budget?

Based on Round 2 results, MiniMax M2.5 is the best budget option at $0.06 total across 3 runs ($0.02/query). It delivered excellent quality, perfect accuracy scores (100/100), and was the fastest model at 70 seconds average. At $0.0003 per 1K tokens, it's 19x cheaper than Claude Opus 4.6.

How consistent are LLM analytics results?

Remarkably consistent for quality, but variable for speed. In our 15 successful runs across 5 models, 14 out of 15 (93%) achieved "excellent" quality. However, execution time varied significantly -- GLM 5 ranged from 145s to 275s across its 3 runs. When choosing a model, expect the quality to be reliable but plan for timing variance.

Is Claude Opus 4.6 worth the premium price?

It depends on the stakes. Claude Opus 4.6 ($1.35) delivered the most comprehensive analysis, investigating multiple angles with 4 data requests per run. But MiniMax M2.5 ($0.06) achieved the same "excellent" quality rating at 1/22nd the cost. For routine queries, the cheaper option wins. For high-stakes strategic decisions where depth matters, Claude's thoroughness justifies the premium.

What happened to Aurora Alpha?

Aurora Alpha is a stealth release on OpenRouter with no publicly identified backing company. It failed all 3 runs because its 128K context window couldn't accommodate the system prompt (~126K tokens) plus the query. This highlights the importance of testing real-world context requirements before adopting new models, especially those with limited public documentation.

How does Round 2 compare to Round 1?

Round 1 tested 10 established models (Claude, GPT-5, Gemini, Grok, DeepSeek) with a single run each. Quality varied dramatically: 30% delivered value, 40% stopped at diagnosis, 30% hallucinated. Round 2 tested 6 newer models 3 times each. Quality was much more consistent: 83% excellent. The key shift is that cost efficiency no longer means sacrificing quality.

Should I use Chinese AI models for analytics?

Three of our top 5 models were from Chinese AI labs: MiniMax M2.5 (#1), Kimi K2.5 (#2), and Qwen3 Max Thinking (#5). They delivered excellent quality at competitive prices. For analytics tasks that don't involve sensitive data, these models offer outstanding value. Consider data residency requirements if your organization has compliance constraints.


Want to stay up to date with our latest blog posts?

Sign up for our email list to receive updates on new blog posts and product releases.

ABOUT THE AUTHOR

Alex Schlee

Founder & CEO

Alex Schlee is the founder of Anamap and has experience spanning the full gamut of analytics from implementation engineering to warehousing and insight generation. He's a great person to connect with about anything related to analytics or technology.