Can I compare ElevenLabs, Cartesia and Rime TTS providers side by side?

Yes. x-n.dev supports ElevenLabs, Cartesia, Rime, OpenAI TTS, Azure, AWS Polly and more. Create a Scenario with your text, connect your API keys for each provider, and run them all in parallel. Results show audio playback, latency, and cost per character side by side.

How do I benchmark Deepgram vs AssemblyAI vs OpenAI Whisper on my audio?

Upload your audio file as a Scenario, connect API keys for Deepgram, AssemblyAI, OpenAI and any other STT provider, then run an Arena. You get transcripts, accuracy comparison, latency and cost per minute for all providers in one run.

Does x-n.dev support bring your own API keys (BYOK)?

Yes — BYOK is the core model. You connect your own API keys from OpenAI, Anthropic, Google, ElevenLabs, Deepgram and any supported provider. x-n.dev uses them to make requests on your behalf. Model costs go directly to providers at their listed rates — we add zero markup.

Can AI agencies and consultancies use x-n.dev for client work?

Yes — x-n.dev is built for multi-client evaluation. Run benchmarks per client with their own scenarios and API keys, keep each engagement in a separate workspace, and export shareable reports. AI agencies and consultancies use x-n.dev to justify model choices with data, compare LLM, STT and TTS providers across projects, and deliver vendor-neutral recommendations clients can trust.

same input. N models. real data.

Test every AI model
against your real use case.

Define your test once — prompts, voices, audio files — and run against every model in parallel. Compare side by side with latency, accuracy, cost and an automatic ranking.

Start 7 days free See how it works

arena · my-test · LLM

Model 1Provider 1

The product has mixed reviews. Users praise battery life and build quality, but flag Bluetooth inconsistency as a recurring issue.

latency1.2s

tokens312

cost$0.0014

Model 2Provider 2

Reviews are split. Strong consensus on durability and long battery. Main friction point: Bluetooth drops when moving between rooms.

latency0.9s

tokens287

cost$0.0011

Model 3Provider 3

Positive sentiment on hardware quality and battery. Negative pattern: wireless connectivity issues mentioned in 23% of reviews.

latency0.7s

tokens298

cost$0.0003

Why not just use public benchmarks?

Benchmarks test generic data.
Your product isn't generic.

The model that tops MMLU or HumanEval may be the worst choice for classifying your support tickets, transcribing your audio, or reading your domain-specific documents.

Public benchmarks

Generic datasets, generic prompts
No cost data for your volume
No latency for your region
Can't test your system prompt
No audio or voice evaluation

x-n.dev arenas

Your prompts, your scenarios, your data
Real cost with your API keys, no markup
Latency measured on actual runs
Test any system prompt before shipping
LLM, STT and TTS — same platform

How it works

One setup.
Every model.

Three steps that keep your tests structured, repeatable, and shareable — for LLM, STT and TTS benchmarks.

Providers

Choose which providers are available for comparing LLM, STT and TTS.

provider layer

Scenarios

The full test setup. Define input data, prompts, expected output, language and voice settings in one place — or generate test items with AI from a short description. Create once and reuse across multiple arenas.

test setup

Arenas

The comparison layer. Pick a Scenario, connect N models from OpenAI, Anthropic, Google, AWS, Azure and more. Run in parallel. Results side by side with accuracy, latency, cost and an automatic ranking.

comparison layer

Scenarios

Your test setup.
Once. Reuse everywhere.

Define prompts, test data, expected outputs, audio files or voice settings in one structured Scenario — then run it across any arena. Or describe what you need and let AI generate test items for you.

scenario · my-test

LLMSummary

System prompt

You summarize product reviews into one concise paragraph. Focus on recurring themes.

User prompt

Summarize these reviews: {{reviews}}

Describe what you want

Customer support messages in English — mix frustrated, neutral and polite tones. Different issues: late delivery, wrong item, refund…

Battery lasts two days. Bluetooth drops in the kitchen.

expected: mixed · battery good · BT issues

Great build quality. App sync is unreliable on Android.

expected: positive hardware · app friction

Comfortable fit. Noise cancelling works well on flights.

expected: comfort + ANC praised

Arenas

Pick N models.
Run in parallel.

Connect providers from OpenAI, Anthropic, Google, Deepgram, ElevenLabs and more. Select multiple models per provider and run them all at once.

arena · select models

Provider 1

2/4

Model 1

Model 2

Model 3

Model 4

Provider 2

1/2

Model 1

Model 2

Results

Decide with data.
Not gut feeling.

Every run gets a composite score from accuracy, speed and cost. Favorite outputs, copy everything, or export a report on paid plans.

arena · my-test · run

Side-by-side outputscompleted

Model 2

Reviews are split. Strong consensus on durability…

Model 1

The product has mixed reviews. Users praise…

Model 3

Positive sentiment on hardware quality…

report · ranking

Model rankingscore

🥇

Model 2Provider 2

97%0.9s$0.0011

92♥

🥈

Model 1Provider 1

94%1.2s$0.0014

🥉

Model 3Provider 3

91%0.7s$0.0003

Score = 50% accuracy · 25% speed · 25% cost

Leaderboard

See who's winning.
In the real world.

Live rankings from anonymized arena runs across x-n.dev — LLM, STT and TTS. Not vendor benchmarks. Recomputed daily from real evaluations.

leaderboard · llm · all time

#ProviderModelWin rateRuns

🥇Provider 1Model 194%1.2k

🥈Provider 2Model 291%980

🥉Provider 3Model 388%740

Min. 10 runs to qualify · recomputed daily at midnight UTC

Aggregated from real arena evaluations, not synthetic benchmarks
Separate rankings for LLM, STT and TTS providers
Updated daily — watch models move as more runs come in

View leaderboard

Who it's for

Built for the moment
before you commit to a model.

From voice AI builders to procurement — whoever owns the model decision uses x-n.dev as their testing bench.

Voice AI builders
TTS, STT and LLM are the core of your product. Compare every provider on your real calls — latency, accuracy and cost — and pick the best fit for each task.
Product engineers & developers
You own the model decision. Benchmark cost, latency and quality across providers on your actual prompts — before it ships to production.
AI agencies & consultancies
Run multi-client evaluations with each client’s own keys and scenarios. Deliver vendor-neutral recommendations backed by data and shareable reports.
Procurement & AI ops
Enterprise, compliance-driven model selection. Full run history, exports and an objective score — accuracy, latency, cost and ROI — to justify every choice.

x-n.dev · differentiators

your keys

zero markup, direct to provider

reusable

shareable scenarios

parallel

all models run simultaneously, no waiting

traceable

full run history and exports

objective

accuracy · latency · cost · ranking

LLM, STT & TTS — all in one platform

Most tools compare LLMs.
What about when your product speaks or listens?

x-n.dev is one of the few platforms that evaluates voice AI providers with the same structure and rigor as language models.

LLM

OpenAIAnthropicGoogleMistralDeepSeekCohereMeta · GroqxAI+ MORE MODELS

STT

OpenAI WhisperAssemblyAIDeepgramGladiaSpeechmaticsAWS Transcribe+ MORE MODELS

TTS

ElevenLabsCartesiaOpenAI TTSDeepgramLMNTHume AIRime AI+ MORE MODELS

Self-hosted models

Your model.
Your rules. Your data.

Connect any OpenAI-compatible endpoint — Ollama, vLLM, LM Studio, llama.cpp — and benchmark self-hosted models against cloud APIs in the same arena.

Ollama · vLLM · LM Studio · llama.cppExcluded from public leaderboardExpose locally via ngrok or Cloudflare Tunnel

providers · settings

Custom (self-hosted)

Base URL

https://my-llama.ngrok.io/v1

Must be publicly reachable. Use ngrok or Cloudflare Tunnel for local models.

API Key (optional)

••••••••••••

Enter any model ID in arenas — llama3.2, mistral, qwen2.5-coder, etc.

Save

How it compares

The right tool for
the right job.

Different tools solve different problems. Here's where x-n.dev fits.

x-n.dev

Public leaderboards

Eval frameworks

LLM observability

Your prompts & data

✓

✗

✓

STT + TTS + LLM

✓

✗

No code, no setup

✓

✗

Team workspace

✓

✗

✓

Real cost (your API key)

✓

✗

✓

Pre-production evaluation

✓

✗

~ = partially supported or requires significant setup

Pricing

Pick a plan.
Start with 7 days free.

Bring your own API keys — we never touch your provider bill. Pick the plan that fits your team, start a 7-day trial, and keep provider usage billed directly to your accounts.

Initial

$19/month

For builders and developers who ship AI features to production.

Bring your own API keys
Up to 20 active scenarios
Up to 10 active arenas
Up to 3 simultaneous arenas
Up to 3 users
Audio retained for 30 days
Parallel execution — all models run simultaneously
Side-by-side comparison
Results export
Unlimited history
All providers — LLM, STT & TTS

Start 7 days free

Common questions.

Your API keys are stored encrypted and are only used to make requests on your behalf during an Arena run. They are never logged, shared, or used for anything else. You can revoke them at any time from your settings.

Model costs go directly to providers via your own API keys — OpenAI, Anthropic, Google, etc. x-n.dev never adds markup on model usage. You pay us for the platform subscription, and paid plans start with a 7-day Stripe-managed trial.

25+ providers across LLM, STT and TTS — including OpenAI, Anthropic, Google, Mistral, AssemblyAI, Deepgram, ElevenLabs, Cartesia, and more. New providers are added continuously. If yours isn't listed, reach out and we'll prioritize it.

Yes. Scenarios belong to the workspace and are accessible to all members. Everyone runs tests against the same data, prompts and model settings, giving you consistent and comparable results across the team.

Not currently. The platform is a hosted product. We may open source specific components or SDKs in the future. If this matters to you, drop us a note — community interest shapes our roadmap.

x-n.dev handles the parallelization, latency tracking, token counting, cost calculation, and result storage so you don't have to. You still have full control over prompts and configs, but without the boilerplate. It's structured and repeatable by default, and shareable without extra tooling.

Provider-side failures can still appear in your provider account depending on that provider's billing rules. x-n.dev does not add markup to provider usage or maintain a separate prepaid system.

7 days free

Start comparing.
Stop guessing.

Bring your own API keys and run your first arena in minutes.

Start 7 days free

OpenAI

Anthropic

Google

Mistral

xAI

+ MORE MODELS

Test every AI modelagainst your real use case.

Benchmarks test generic data.Your product isn't generic.

One setup.Every model.

Your test setup.Once. Reuse everywhere.

Pick N models.Run in parallel.

Decide with data.Not gut feeling.

See who's winning.In the real world.

Built for the momentbefore you commit to a model.

Most tools compare LLMs.What about when your product speaks or listens?

Your model.Your rules. Your data.

The right tool forthe right job.

Pick a plan.Start with 7 days free.

Common questions.

Start comparing.Stop guessing.

Test every AI model
against your real use case.

Benchmarks test generic data.
Your product isn't generic.

One setup.
Every model.

Your test setup.
Once. Reuse everywhere.

Pick N models.
Run in parallel.

Decide with data.
Not gut feeling.

See who's winning.
In the real world.

Built for the moment
before you commit to a model.

Most tools compare LLMs.
What about when your product speaks or listens?

Your model.
Your rules. Your data.

The right tool for
the right job.

Pick a plan.
Start with 7 days free.

Start comparing.
Stop guessing.