same input. N models. real data.

Test every AI model
against your real use case.

Define your test once — prompts, voices, audio files — and run against every model in parallel. Compare side by side with latency, accuracy, cost and an automatic ranking.

arena · my-test · LLM
Model 1Provider 1
The product has mixed reviews. Users praise battery life and build quality, but flag Bluetooth inconsistency as a recurring issue.
latency1.2s
tokens312
cost$0.0014
Model 2Provider 2
Reviews are split. Strong consensus on durability and long battery. Main friction point: Bluetooth drops when moving between rooms.
latency0.9s
tokens287
cost$0.0011
Model 3Provider 3
Positive sentiment on hardware quality and battery. Negative pattern: wireless connectivity issues mentioned in 23% of reviews.
latency0.7s
tokens298
cost$0.0003

Benchmarks test generic data.
Your product isn't generic.

The model that tops MMLU or HumanEval may be the worst choice for classifying your support tickets, transcribing your audio, or reading your domain-specific documents.

Public benchmarks
  • Generic datasets, generic prompts
  • No cost data for your volume
  • No latency for your region
  • Can't test your system prompt
  • No audio or voice evaluation
x-n.dev arenas
  • Your prompts, your scenarios, your data
  • Real cost with your API keys, no markup
  • Latency measured on actual runs
  • Test any system prompt before shipping
  • LLM, STT and TTS — same platform

One setup.
Every model.

Three steps that keep your tests structured, repeatable, and shareable — for LLM, STT and TTS benchmarks.

01
Providers
Choose which providers are available for comparing LLM, STT and TTS.
provider layer
02
Scenarios
The full test setup. Define input data, prompts, expected output, language and voice settings in one place — or generate test items with AI from a short description. Create once and reuse across multiple arenas.
test setup
03
Arenas
The comparison layer. Pick a Scenario, connect N models from OpenAI, Anthropic, Google, AWS, Azure and more. Run in parallel. Results side by side with accuracy, latency, cost and an automatic ranking.
comparison layer

Your test setup.
Once. Reuse everywhere.

Define prompts, test data, expected outputs, audio files or voice settings in one structured Scenario — then run it across any arena. Or describe what you need and let AI generate test items for you.

scenario · my-test
1Prompts & type
2Test items
LLMSummary
System prompt
You summarize product reviews into one concise paragraph. Focus on recurring themes.
User prompt
Summarize these reviews: {{reviews}}
Test items3 items
Create with AI
Describe what you want

Customer support messages in English — mix frustrated, neutral and polite tones. Different issues: late delivery, wrong item, refund…

1

Battery lasts two days. Bluetooth drops in the kitchen.

expected: mixed · battery good · BT issues
2

Great build quality. App sync is unreliable on Android.

expected: positive hardware · app friction
3

Comfortable fit. Noise cancelling works well on flights.

expected: comfort + ANC praised

Pick N models.
Run in parallel.

Connect providers from OpenAI, Anthropic, Google, Deepgram, ElevenLabs and more. Select multiple models per provider and run them all at once.

arena · select models
Provider 1
2/4
Model 1
Model 2
Model 3
Model 4
Provider 2
1/2
Model 1
Model 2

Decide with data.
Not gut feeling.

Every run gets a composite score from accuracy, speed and cost. Favorite outputs, copy everything, or export a report on paid plans.

arena · my-test · run
Side-by-side outputscompleted
Model 2

Reviews are split. Strong consensus on durability…

Model 1

The product has mixed reviews. Users praise…

Model 3

Positive sentiment on hardware quality…

report · ranking
Model rankingscore
🥇
Model 2Provider 2
97%0.9s$0.0011
92
🥈
Model 1Provider 1
94%1.2s$0.0014
87
🥉
Model 3Provider 3
91%0.7s$0.0003
81
Score = 50% accuracy · 25% speed · 25% cost

See who's winning.
In the real world.

Live rankings from anonymized arena runs across x-n.dev — LLM, STT and TTS. Not vendor benchmarks. Recomputed daily from real evaluations.

leaderboard · llm · all time
#ProviderModelWin rateRuns
🥇Provider 1Model 194%1.2k
🥈Provider 2Model 291%980
🥉Provider 3Model 388%740
Min. 10 runs to qualify · recomputed daily at midnight UTC
  • Aggregated from real arena evaluations, not synthetic benchmarks
  • Separate rankings for LLM, STT and TTS providers
  • Updated daily — watch models move as more runs come in

Built for the moment
before you commit to a model.

From voice AI builders to procurement — whoever owns the model decision uses x-n.dev as their testing bench.

  • Voice AI builders
    TTS, STT and LLM are the core of your product. Compare every provider on your real calls — latency, accuracy and cost — and pick the best fit for each task.
  • Product engineers & developers
    You own the model decision. Benchmark cost, latency and quality across providers on your actual prompts — before it ships to production.
  • AI agencies & consultancies
    Run multi-client evaluations with each client’s own keys and scenarios. Deliver vendor-neutral recommendations backed by data and shareable reports.
  • Procurement & AI ops
    Enterprise, compliance-driven model selection. Full run history, exports and an objective score — accuracy, latency, cost and ROI — to justify every choice.
x-n.dev · differentiators
your keys
zero markup, direct to provider
reusable
shareable scenarios
parallel
all models run simultaneously, no waiting
traceable
full run history and exports
objective
accuracy · latency · cost · ranking

Most tools compare LLMs.
What about when your product speaks or listens?

x-n.dev is one of the few platforms that evaluates voice AI providers with the same structure and rigor as language models.

LLM
OpenAIAnthropicGoogleMistralDeepSeekCohereMeta · GroqxAI+ MORE MODELS
STT
OpenAI WhisperAssemblyAIDeepgramGladiaSpeechmaticsAWS Transcribe+ MORE MODELS
TTS
ElevenLabsCartesiaOpenAI TTSDeepgramLMNTHume AIRime AI+ MORE MODELS

Your model.
Your rules. Your data.

Connect any OpenAI-compatible endpoint — Ollama, vLLM, LM Studio, llama.cpp — and benchmark self-hosted models against cloud APIs in the same arena.

Ollama · vLLM · LM Studio · llama.cppExcluded from public leaderboardExpose locally via ngrok or Cloudflare Tunnel
providers · settings
Custom (self-hosted)

Base URL

https://my-llama.ngrok.io/v1

Must be publicly reachable. Use ngrok or Cloudflare Tunnel for local models.

API Key (optional)

••••••••••••
Enter any model ID in arenas — llama3.2, mistral, qwen2.5-coder, etc.
Save

The right tool for
the right job.

Different tools solve different problems. Here's where x-n.dev fits.

x-n.dev
Public leaderboards
Eval frameworks
LLM observability
Your prompts & data
~
STT + TTS + LLM
No code, no setup
Team workspace
Real cost (your API key)
~
Pre-production evaluation

~ = partially supported or requires significant setup

Pick a plan.
Pay for what you run.

Bring your own API keys — we never touch your provider bill. Pick the plan that fits your team and top up runs when you need more.

Trial
$0/month
No credit card required
Explore without a credit card. 100 runs to see the value — expires only when they're gone.
100 runs — forever$0.12 per extra run
  • Up to 3 active scenarios
  • Up to 3 active arenas
  • 1 simultaneous arena
  • 1 user
  • All providers — LLM, STT & TTS
  • Bring your own API keys
  • Audio retained for 7 days
  • Latency, cost & token metrics
  • Multi-org workspaces
Get started
Team
$79/month
For companies and product teams evaluating models before production decisions.
1,000 runs / month$0.05 per extra run
  • Up to 50 active scenarios
  • Up to 30 active arenas
  • Up to 5 simultaneous arenas
  • Up to 10 users
  • Parallel execution — all models run simultaneously
  • Side-by-side comparison
  • Audio retained for 90 days
  • Results export
  • Unlimited history
  • All providers — LLM, STT & TTS
  • Bring your own API keys
  • Multi-org workspaces
Subscribe to Team

For Enterprise without BYOK — Contact us

AI agencies / consultancies — Contact us

Common questions.

Your API keys are stored encrypted and are only used to make requests on your behalf during an Arena run. They are never logged, shared, or used for anything else. You can revoke them at any time from your settings.
Model costs go directly to providers via your own API keys — OpenAI, Anthropic, Google, etc. x-n.dev never adds markup on model usage. You pay us for the platform via a monthly plan (Trial, Pro or Team), which includes a set number of runs. Need more? Top up and keep going.
25+ providers across LLM, STT and TTS — including OpenAI, Anthropic, Google, Mistral, AssemblyAI, Deepgram, ElevenLabs, Cartesia, and more. New providers are added continuously. If yours isn't listed, reach out and we'll prioritize it.
Yes. Scenarios belong to the workspace and are accessible to all members. Everyone runs tests against the same data, prompts and model settings, giving you consistent and comparable results across the team.
Not currently. The platform is a hosted product. We may open source specific components or SDKs in the future. If this matters to you, drop us a note — community interest shapes our roadmap.
x-n.dev handles the parallelization, latency tracking, token counting, cost calculation, and result storage so you don't have to. You still have full control over prompts and configs, but without the boilerplate. It's structured and repeatable by default, and shareable without extra tooling.
Refundable failures and empty outputs are automatically returned to your balance. Runs are not refunded when the provider rejects them because of account limits, quota, rate limits, or models that are unavailable or not enabled for your account.

Start comparing.
Stop guessing.

Bring your own API keys and run your first arena in minutes.

OpenAI
Anthropic
Google
Mistral
xAI
+ MORE MODELS