How-To & Life · Guide · Audio, Video & Voice
How to convert text to speech
Web Speech API vs neural/commercial TTS, picking voices, SSML for pauses and emphasis, rate and pitch control, and when to export vs play inline.
Text-to-speech went from robotic-sounding novelty to genuinely human-sounding tool around 2020, when neural TTS models (WaveNet, Tacotron, then Glow-TTS and VALL-E) replaced the older concatenative and formant-synthesis approaches. The difference is striking — modern TTS is used for audiobook narration, podcast ads, IVR systems, and accessibility tools without listeners realizing it’s synthetic. Using TTS well, though, still takes more than copy-pasting text. This guide covers SSML markup for precise control, voice selection criteria, prosody (rate, pitch, volume), the split between the browser’s free Web Speech API and cloud TTS services, and the accessibility considerations that separate “technically reads the text” from “actually useful for screen-reader users.”
Advertisement
What neural TTS does differently
Old TTS pipelines concatenated recorded phonemes (tiny speech fragments) and smoothed the seams. Output sounded segmented and robotic. Neural TTS generates the waveform directly from text using deep learning — either in a two-stage pipeline (text to mel-spectrogram, then neural vocoder to waveform) or end-to-end (text straight to waveform). The result has natural prosody, breathing, and intonation.
Current state-of-the-art systems can clone a voice from 3–5 seconds of reference audio, match emotional tone, and even preserve a speaker’s accent across languages. The tradeoff is compute — neural TTS needs a GPU for real-time generation, unlike the old concatenative systems that ran on phones in 2005.
SSML: the markup language for TTS
Speech Synthesis Markup Language (SSML) is a W3C standard that lets you control how text is rendered. It looks like HTML with TTS-specific tags.
<speak>
<p>
Welcome to the tutorial.
<break time="500ms"/>
Today we’ll cover three topics:
SSML, <emphasis level="strong">voice selection</emphasis>,
and <prosody rate="slow">carefully-paced narration</prosody>.
</p>
<p>
The meeting starts at <say-as interpret-as="time">3:00 PM</say-as>,
and the ID is <say-as interpret-as="characters">ABC123</say-as>.
</p>
</speak>Not all TTS engines support all SSML tags. AWS Polly and Google Cloud TTS support broad SSML; OpenAI’s TTS API currently supports only plain text. Check your engine’s docs before authoring SSML.
Key SSML tags
<break time="500ms"/> Pause for 500 milliseconds <prosody rate="slow"> Slower speech <prosody rate="fast"> Faster <prosody pitch="+3st"> Raise pitch by 3 semitones <prosody volume="+6dB"> Louder <emphasis level="strong"> Emphasize words <say-as interpret-as="date"> Read "2024-04-23" as "April 23" <say-as interpret-as="telephone"> Read digits as phone number <say-as interpret-as="characters"> Spell out letter-by-letter <phoneme alphabet="ipa" ph="t@'meItoU">tomato</phoneme> <sub alias="World Health Organization">WHO</sub>
Voice selection
The voice sets the personality of the output. Most cloud TTS services offer dozens of voices per language, with names (Amazon Polly has Matthew, Joanna, Ivy; Google has wavenet voices coded Male A/B/C; Azure has over 400 neural voices across 100+ languages).
Voice choice criteria: match the content’s formality (a news-style voice vs conversational), match the demographic you’re targeting (age, accent, gender), and test with your actual script — some voices handle long sentences better than others.
Prosody: rate, pitch, volume
Rate is measured as a percentage (“slow,” “medium,” “fast,” or 50%–200%). Typical preferences:
Content type Recommended rate Audiobook narration 90-95% (slightly slow, let words land) Podcast ad read 100% (natural) News / announcements 105-110% Tutorial voiceover 90-100% Accessibility (screen reader) user preference; default 100%
Pitch in semitones (-20st to +20st) or percentages. Small shifts (+/- 2 semitones) are useful to distinguish characters in a dialogue or to match a brand voice; big shifts (+/- 10st) sound cartoonish.
Volume in dB (-40dB to +6dB) or named levels. Rarely needed — normalize in post-production instead.
Web Speech API (browser-native)
Browsers include a free TTS engine via the SpeechSynthesis API. No API key, no per-character cost, works offline. The quality varies dramatically by OS — macOS and iOS use Apple’s high-quality neural voices; Windows uses decent neural voices; Linux often has only basic eSpeak voices.
const utter = new SpeechSynthesisUtterance("Hello, world.");
utter.rate = 1.0;
utter.pitch = 1.0;
utter.volume = 1.0;
utter.voice = speechSynthesis.getVoices()
.find(v => v.name.includes("Samantha"));
speechSynthesis.speak(utter);The Web Speech API has no SSML support. You can control rate, pitch, volume per utterance, but not mid-sentence emphasis or pauses. For richer control, use cloud TTS.
Cloud TTS comparison
Provider Voices Neural SSML Price (per 1M chars) AWS Polly 60+ Yes Full $4 standard, $16 neural Google Cloud 220+ Yes Full $4-$16 depending on tier Azure 400+ Yes Full $4-$16 ElevenLabs dozens Yes Some $5-$30 per 1M chars OpenAI TTS 6 Yes None $15 per 1M chars
For long-form content (audiobooks, podcast production) the cost matters; a 50,000-character chapter is ~$0.20–$0.80. For real-time applications (phone systems, games), latency matters more. ElevenLabs and Azure are the common choices for expressive narration; AWS and Google for high-volume IVR.
Pronunciation control
TTS engines mispronounce unusual words, brand names, and technical terms. Fixes:
Spell it phonetically in the text. “Write ‘kubernetes’ as ‘koo-bur-NET-eez’ in the script.” Works but looks odd to editors.
Use SSML phoneme tags. <phoneme alphabet="ipa" ph="ku:b@rneItIs">kubernetes</phoneme>. Precise but requires IPA knowledge.
Define a custom lexicon. AWS Polly and Google support uploading a lexicon file that applies to all requests — best for brand names used across many scripts.
Audio output format
Cloud TTS typically offers MP3 (good default, small file, universal support), WAV or PCM (lossless, large, good for further editing), OGG (smaller than MP3, less universal), or a streaming format for real-time playback.
For a final podcast or video deliverable, request WAV or 320kbps MP3, apply any post-processing (compression, EQ, loudness normalization to -16 LUFS), then export to final format. Don’t use the raw TTS MP3 as-is — post processing makes it sound more professional.
Accessibility considerations
Screen-reader users consume TTS output hours per day. A few rules for TTS-accessible content:
Respect the user’s chosen voice and rate — don’t hardcode a fast rate “to save time.” Screen-reader users typically listen at 300+ WPM with practice.
Provide punctuation that TTS engines interpret correctly. A dash (—) creates a pause; parentheses group phrases; an em-space after a sentence allows natural breath. Avoid Unicode decorations and special characters that engines may verbalize literally (“star”, “black-small-square”).
For accessibility-focused apps, offer a voice-selection UI rather than hardcoding one voice.
Common mistakes
Assuming all engines support SSML. OpenAI TTS ignores SSML entirely. Test your markup against your actual engine.
Using the default voice without auditioning alternatives. Voice choice dramatically changes perceived quality. Compare three or four on the same script before committing.
Speaking too fast for long-form content. Audiobook narration at 100% is typically too fast; 90–95% lets words land.
Ignoring mispronunciations. Brand names, product names, and technical jargon almost always need lexicon entries or phoneme tags.
Shipping raw TTS without post-processing. Loudness normalization, EQ, and subtle compression are the difference between “robot reading” and “professional narration.”
Forgetting silence at the head and tail. Cloud TTS often produces output that starts speaking immediately. Add 300–500ms of silence at each end for natural pacing.
Using neural voices without disclosure when required. Some jurisdictions require disclosure of AI-generated voice in ads and political content.
Run the numbers
Generate speech from text with voice controls using the text-to-speech tool. Pair with the audio trimmer to tidy the start and end of the generated file before shipping, and the speech-to-text tool when you want to generate a transcript for captions from the synthesized audio in a round-trip workflow.
Advertisement