How-To & Life · Guide · Audio, Video & Voice

How to transcribe speech to text

Transcribe speech to text using browser APIs, cloud services, or human transcribers. Compare accuracy, privacy, and cost trade‑offs. Free instant guide, no sign‑up.

By FreeToolArena Staff · Updated June 2026 · 6 min read

Speech-to-text stopped being a novelty around 2022, when OpenAI’s Whisper model hit roughly 5% word error rate on clean English audio — close to human transcriptionist accuracy. Before Whisper, automated transcription ranged from “usable with heavy cleanup” to “comedic.” Now it’s fast, free on commodity hardware, and good enough for production captions, podcast show notes, and meeting summaries. But it still fails in predictable ways — accents, overlapping speech, noise, punctuation, proper nouns — and choosing the right model and workflow for your use case matters. This guide covers the model tiers, how word error rate is measured, speaker diarization, punctuation insertion, accent handling, noise robustness, and the language-support landscape.

Whisper model tiers

OpenAI’s Whisper is the dominant open model. It ships in sizes:

Model      Params   <a href="/learn/vrm-vram">VRAM</a> req   Speed     Quality
tiny       39M      ~1 GB      10x       Basic, English-biased
base       74M      ~1 GB      7x        Decent for clean English
small      244M     ~2 GB      4x        Good for most content
medium     769M     ~5 GB      2x        Near-pro quality
large-v3   1550M    ~10 GB     1x        State-of-the-art accuracy
turbo      809M     ~6 GB      8x        Fast version of large-v3

The speed column is relative to real-time: 10x means a 10-minute audio file transcribes in 1 minute. Quality climbs steeply from tiny to medium, then plateaus — large is noticeably better only on difficult audio (accents, noise, music). For clean studio speech, small and medium often produce nearly identical output.

Word Error Rate (WER)

WER is the standard accuracy metric: (S + D + I) / N, where S is substitutions, D is deletions, I is insertions, and N is the total words in the reference. A 5% WER means 1 in 20 words is wrong.

Reference:     "the meeting is at three on Tuesday"
Transcription: "the meeting is at 3:00 on Tuesday"

Substitutions: "three" -> "3:00" (1)
N = 7 words
WER = 1/7 = 14.3%
(Technically accurate, but semantically identical
 -- WER penalizes formatting differences.)

A state-of-the-art Whisper-large model achieves 5–10% WER on clean conversational English. Lower-quality source audio (phone calls, noisy rooms) pushes WER to 15–30%. For non-English and heavy accents, expect 10–40% depending on the target language’s training data volume.

Speaker diarization

Diarization is the “who spoke when” problem — splitting a transcript into labeled speaker turns. Whisper doesn’t do diarization natively; it transcribes word-by-word with timestamps and leaves speaker attribution to a separate step.

Common diarization pipelines: pyannote.audio (open source, accurate), AWS Transcribe (cloud, integrated), Deepgram (cloud, fast). They cluster voice embeddings to group similar-sounding segments together, then label them Speaker 1, Speaker 2, etc. Accuracy drops with more speakers and overlapping speech.

# With whisperx (which combines Whisper + diarization)
whisperx audio.mp3 --diarize --min_speakers 2 --max_speakers 4

# Output format:
# [00:00.00 -> 00:05.12] SPEAKER_00: Good morning everyone.
# [00:05.20 -> 00:08.45] SPEAKER_01: Thanks for joining the call.

Punctuation insertion

Raw speech recognition produces lowercase, punctuation-free text. A separate punctuation model adds periods, commas, capitalization, and sentence boundaries. Whisper bundles this in; older ASR engines don’t.

Punctuation is surprisingly hard because speech doesn’t have clear sentence boundaries. Speakers trail off, restart mid-thought, use fillers (“um,” “like,” “you know”). Good punctuation models balance: break sentences where pauses and intonation suggest natural boundaries, but don’t create fragments every time someone takes a breath.

Accent handling

Whisper was trained on 680,000 hours of multilingual audio including many accented English variants, which makes it relatively robust. But accuracy still drops for accents underrepresented in the training set. Typical WER penalty:

Accent                 WER increase vs. neutral American
American, British      baseline
Australian, Canadian   ~1% increase
Indian English         ~3-5% increase
Nigerian, Kenyan       ~5-8% increase
Scottish, Irish (thick) ~8-15% increase
Heavy Chinese ESL       ~10-20% increase
Heavy Arabic ESL        ~8-15% increase

Mitigations: use the larger models (large-v3 handles accents notably better than base), give the model context via Whisper’s initial_prompt parameter (“This is a recording of Nigerian English speakers discussing medical research”), and clean the audio (denoise, normalize) before transcription.

Noisy environments

Background noise is the #1 WER killer. A clean studio recording might transcribe at 4% WER; the same content with -20dB background chatter can jump to 15% WER. Mitigations:

Record clean. Close-mic, quiet room, dead-cat windscreen outdoors. Garbage in, garbage out.

Denoise pre-transcription. Tools like RNNoise (open source), Krisp, or Adobe’s Enhance Speech can clean up recorded audio. Apply conservatively — aggressive denoising can remove speech consonants and hurt transcription more than it helps.

Use voice activity detection. Split the audio into speech segments and transcribe each, skipping pure-noise regions. Reduceshallucination risk.

Hallucinations

Whisper sometimes invents text during silence or near-silence — a classic hallucination is generating “Thanks for watching!” at the end of nearly-silent audio (the training data had lots of YouTube endings). Mitigations: trim silence before transcription, use VAD (voice activity detection) to skip quiet regions, enable no_speech_threshold to filter low-probability segments.

Language support

Whisper supports ~100 languages, with varying quality. Top-tier (near-English quality): Spanish, French, German, Mandarin Chinese, Japanese, Portuguese, Italian, Korean. Mid-tier (usable, 10–20% WER): Arabic, Hindi, Russian, Indonesian, Turkish, Polish, Dutch. Low-tier (noisy, often 30%+ WER): low-resource languages with limited training data — Swahili, Welsh, Tagalog, Yoruba.

For non-English, larger models make a bigger difference. Tiny may produce unusable output for French, while medium is excellent.

Cloud vs local

Cloud transcription (AWS, Google, Azure, Deepgram) — easy API, no local GPU needed, pay per minute. Deepgram and AssemblyAI are typically fastest and most accurate for English. Privacy-sensitive content may not be appropriate for cloud.

Local transcription — run Whisper on your own machine. Privacy-safe, no per-minute costs, but requires GPU for large models. CPU works for small model on short files. For one-off transcription of personal content, local is the right default.

Timestamps and alignment

Whisper outputs per-segment timestamps by default (roughly 5–30 seconds per segment). For captions and subtitle generation, you need word-level timestamps. Tools like whisperx add forced alignment via wav2vec2, producing sub-second word-level timing needed for synced captions.

# Word-level timing via whisperx
whisperx input.mp3 --model large-v3 --output_format srt

# Produces .srt with lines like:
# 00:00:01,220 --> 00:00:03,840
# Hello and welcome to the podcast.

Common mistakes

Using the tiny model for production. WER is 15–30%+, which means real-world transcripts need heavy cleanup. Use medium or large unless speed is critical.

Trusting diarization without review. Speaker boundaries are often misplaced, especially with overlapping speech. Manually verify before shipping.

Forgetting to clean input audio. Noise reduction and normalization before transcription can halve WER. Worth the extra step.

Leaving in Whisper hallucinations. Long silences often trigger “thanks for watching” and similar spurious text. Trim silence or use VAD.

Expecting perfect proper nouns. Names, brands, and technical jargon are error-prone. Supply them via the prompt parameter or do a targeted find-replace pass.

Running on CPU when GPU is available. A 10-minute audio file takes 2 minutes on GPU-medium, 20 minutes on CPU. For batch work, the GPU is worth the setup.

Ignoring language hints. Auto-detect works but adds uncertainty. Specify --language en when you know the language.

Run the numbers

Turn audio into text without installing Whisper locally using the speech-to-text tool. Pair with the voice note transcriber for short mobile recordings where speaker attribution matters less, and the audio silence remover to trim dead air before transcription to avoid hallucinations in quiet regions.

Use these while you read

Tools that pair with this guide

Found this useful?Email Buy Me a Coffee

Continue reading

100% in-browserNo downloadsNo sign-upMalware-freeHow we keep this safe →