How-To & Life · Guide · Audio, Video & Voice
How to convert text to speech
Convert text to speech online free, comparing browser and neural voices instantly. No sign-up to adjust SSML, rate, and pitch in your browser.
Text-to-speech went from robotic-sounding novelty to genuinely human-sounding tool around 2020, when neural TTS models (WaveNet, Tacotron, then Glow-TTS and VALL-E) replaced the older concatenative and formant-synthesis approaches. The difference is striking — modern TTS is used for audiobook narration, podcast ads, IVR systems, and accessibility tools without listeners realizing it’s synthetic. Using TTS well, though, still takes more than copy-pasting text. This guide covers SSML markup for precise control, voice selection criteria, prosody (rate, pitch, volume), the split between the browser’s free Web Speech API and cloud TTS services, and the accessibility considerations that separate “technically reads the text” from “actually useful for screen-reader users.”
Advertisement
What neural TTS does differently
Old TTS pipelines concatenated recorded phonemes (tiny speech fragments) and smoothed the seams. Output sounded segmented and robotic. Neural TTS generates the waveform directly from text using deep learning — either in a two-stage pipeline (text to mel-spectrogram, then neural vocoder to waveform) or end-to-end (text straight to waveform). The result has natural prosody, breathing, and intonation.
Current state-of-the-art systems can clone a voice from 3–5 seconds of reference audio, match emotional tone, and even preserve a speaker’s accent across languages. The tradeoff is compute — neural TTS needs a GPU for real-time generation, unlike the old concatenative systems that ran on phones in 2005.
SSML: the markup language for TTS
Speech Synthesis Markup Language (SSML) is a W3C standard that lets you control how text is rendered. It looks like HTML with TTS-specific tags.
<speak>
<p>
Welcome to the tutorial.
<break time="500ms"/>
Today we’ll cover three topics:
SSML, <emphasis level="strong">voice selection</emphasis>,
and <prosody rate="slow">carefully-paced narration</prosody>.
</p>
<p>
The meeting starts at <say-as interpret-as="time">3:00 PM</say-as>,
and the ID is <say-as interpret-as="characters">ABC123</say-as>.
</p>
</speak>Not all TTS engines support all SSML tags. AWS Polly and Google Cloud TTS support broad SSML; OpenAI’s TTS API currently supports only plain text. Check your engine’s docs before authoring SSML.
Key SSML tags
<break time="500ms"/> Pause for 500 milliseconds <prosody rate="slow"> Slower speech <prosody rate="fast"> Faster <prosody pitch="+3st"> Raise pitch by 3 semitones <prosody volume="+6dB"> Louder <emphasis level="strong"> Emphasize words <say-as interpret-as="date"> Read "2024-04-23" as "April 23" <say-as interpret-as="telephone"> Read digits as phone number <say-as interpret-as="characters"> Spell out letter-by-letter <phoneme alphabet="ipa" ph="t@'meItoU">tomato</phoneme> <sub alias="World Health Organization">WHO</sub>
Voice selection
The voice sets the personality of the output. Most cloud TTS services offer dozens of voices per language, with names (Amazon Polly has Matthew, Joanna, Ivy; Google has wavenet voices coded Male A/B/C; Azure has over 400 neural voices across 100+ languages).
Voice choice criteria: match the content’s formality (a news-style voice vs conversational), match the demographic you’re targeting (age, accent, gender), and test with your actual script — some voices handle long sentences better than others.
Prosody: rate, pitch, volume
Rate is measured as a percentage (“slow,” “medium,” “fast,” or 50%–200%). Typical preferences:
Content type Recommended rate Audiobook narration 90-95% (slightly slow, let words land) Podcast ad read 100% (natural) News / announcements 105-110% Tutorial voiceover 90-100% Accessibility (screen reader) user preference; default 100%
Pitch in semitones (-20st to +20st) or percentages. Small shifts (+/- 2 semitones) are useful to distinguish characters in a dialogue or to match a brand voice; big shifts (+/- 10st) sound cartoonish.
Volume in dB (-40dB to +6dB) or named levels. Rarely needed — normalize in post-production instead.
Web Speech API (browser-native)
Browsers include a free TTS engine via the SpeechSynthesis API. No API key, no per-character cost, works offline. The quality varies dramatically by OS — macOS and iOS use Apple’s high-quality neural voices; Windows uses decent neural voices; Linux often has only basic eSpeak voices.
const utter = new SpeechSynthesisUtterance("Hello, world.");
utter.rate = 1.0;
utter.pitch = 1.0;
utter.volume = 1.0;
utter.voice = speechSynthesis.getVoices()
.find(v => v.name.includes("Samantha"));
speechSynthesis.speak(utter);The Web Speech API has no SSML support. You can control rate, pitch, volume per utterance, but not mid-sentence emphasis or pauses. For richer control, use cloud TTS.
Cloud TTS comparison
Provider Voices Neural SSML Price (per 1M chars) AWS Polly 60+ Yes Full $4 standard, $16 neural Google Cloud 220+ Yes Full $4-$16 depending on tier Azure 400+ Yes Full $4-$16 ElevenLabs dozens Yes Some $5-$30 per 1M chars OpenAI TTS 6 Yes None $15 per 1M chars
For long-form content (audiobooks, podcast production) the cost matters; a 50,000-character chapter is ~$0.20–$0.80. For real-time applications (phone systems, games), latency matters more. ElevenLabs and Azure are the common choices for expressive narration; AWS and Google for high-volume IVR.
Pronunciation control
TTS engines mispronounce unusual words, brand names, and technical terms. Fixes:
Spell it phonetically in the text. “Write ‘kubernetes’ as ‘koo-bur-NET-eez’ in the script.” Works but looks odd to editors.
Use SSML phoneme tags. <phoneme alphabet="ipa" ph="ku:b@rneItIs">kubernetes</phoneme>. Precise but requires IPA knowledge.
Define a custom lexicon. AWS Polly and Google support uploading a lexicon file that applies to all requests — best for brand names used across many scripts.
Audio output format
Cloud TTS typically offers MP3 (good default, small file, universal support), WAV or PCM (lossless, large, good for further editing), OGG (smaller than MP3, less universal), or a streaming format for real-time playback.
For a final podcast or video deliverable, request WAV or 320kbps MP3, apply any post-processing (compression, EQ, loudness normalization to -16 LUFS), then export to final format. Don’t use the raw TTS MP3 as-is — post processing makes it sound more professional.
Accessibility considerations
Screen-reader users consume TTS output hours per day. A few rules for TTS-accessible content:
Respect the user’s chosen voice and rate — don’t hardcode a fast rate “to save time.” Screen-reader users typically listen at 300+ WPM with practice.
Provide punctuation that TTS engines interpret correctly. A dash (—) creates a pause; parentheses group phrases; an em-space after a sentence allows natural breath. Avoid Unicode decorations and special characters that engines may verbalize literally (“star”, “black-small-square”).
For accessibility-focused apps, offer a voice-selection UI rather than hardcoding one voice.
Common mistakes
Assuming all engines support SSML. OpenAI TTS ignores SSML entirely. Test your markup against your actual engine.
Using the default voice without auditioning alternatives. Voice choice dramatically changes perceived quality. Compare three or four on the same script before committing.
Speaking too fast for long-form content. Audiobook narration at 100% is typically too fast; 90–95% lets words land.
Ignoring mispronunciations. Brand names, product names, and technical jargon almost always need lexicon entries or phoneme tags.
Shipping raw TTS without post-processing. Loudness normalization, EQ, and subtle compression are the difference between “robot reading” and “professional narration.”
Forgetting silence at the head and tail. Cloud TTS often produces output that starts speaking immediately. Add 300–500ms of silence at each end for natural pacing.
Using neural voices without disclosure when required. Some jurisdictions require disclosure of AI-generated voice in ads and political content.
Run the numbers
Generate speech from text with voice controls using the text-to-speech tool. Pair with the audio trimmer to tidy the start and end of the generated file before shipping, and the speech-to-text tool when you want to generate a transcript for captions from the synthesized audio in a round-trip workflow.
Use these while you read
Tools that pair with this guide
- Text to SpeechConvert text to natural-sounding speech with system voices. Pick voice, rate, and pitch. Download the audio.Audio, Video & Voice
- Speech to TextTranscribe your voice to text live using any mic in 30+ languages. Free online speech‑to‑text tool — edit and copy results instantly with no download or registration.Audio, Video & Voice
- Voice Note TranscriberRecord a voice note and get a text transcript instantly in the same browser tab. Copy, edit, or paste anywhere—free online tool with no sign-up required.Audio, Video & Voice
- Dummy Image PlaceholderGenerate placeholder images with custom sizes and colors online. Download or copy to clipboard in seconds — free, browser-only tool with no ads.Audio, Video & Voice
Advertisement
Continue reading
- How-To & LifeHow to Start Drone PhotographyStart drone photography the right way: understand FAA Part 107 and TRUST, pick a drone tier, avoid no‑fly zones, and master ND filters. Free instant guide, no sign-up.
- How-To & LifeHow to Pick Colors From ImagesExtract colors from images using eyedropper tools—hex at cursor, screenshot sampling, brand matching. A free online workflow guide with accessibility tips, no sign-up.
- How-To & LifeHow to Extract Colors From ImagesCompare dominant vs average color extraction and grab brand palettes from logos. Free online guide on algorithms and use cases with instant access.
- How-To & LifeHow to Blur Faces in PhotosBlur faces using pixelation or Gaussian techniques, and learn why mild blur can be reversed. A free online guide for privacy-safe censoring in seconds.
- How-To & LifeHow to View EXIF Metadata in PhotosExtract hidden camera data and settings from any photo with a free, online EXIF viewer. Verify image authenticity and check capture time instantly in your browser, no sign-up.
- How-To & LifeHow to Remove EXIF Metadata From PhotosStrip GPS, camera, and time data from images losslessly free online. Protect your privacy by understanding EXIF risks instantly with no download needed.