Type or paste text, pick a voice and language, adjust speed and pitch, then press play. Your browser handles all synthesis locally. No uploads, no accounts, no limits.
Last updated: March 19, 2026Text to speech (TTS) converts written text into spoken audio. The process takes a string of characters, analyzes the language structure, applies pronunciation rules, and generates a waveform that sounds like a human voice. Modern TTS systems use either concatenative synthesis (splicing recorded speech fragments) or parametric synthesis (generating audio from statistical models).
Browser-based TTS relies on the Web Speech API, specifically the SpeechSynthesis interface. This API ships with every major browser and uses the operating system's installed voices. On macOS, you get access to the system voices in System Settings. On Windows, you get the Microsoft voices. On Android and Chrome OS, you get Google's voices. No installation or plugin is required.
The result varies by platform. Some voices sound robotic and choppy. Others, particularly newer neural voices on Windows 11 and macOS Sonoma, sound close to natural speech. The quality depends entirely on which voices your OS provides.
The SpeechSynthesis API works through a queue system. You create a SpeechSynthesisUtterance object, set its text, voice, rate, pitch, and volume properties, and pass it to speechSynthesis.speak(). The browser adds it to a queue and starts processing. You can pause, resume, and cancel at any time.
Voice loading is asynchronous. Browsers load the voice list after the page renders, so the voice dropdown in this tool listens for the voiceschanged event before populating. Some browsers fire this event once, others fire it multiple times. The tool handles both cases.
Word-level highlighting works through the boundary event on the utterance object. When the speech engine reaches a word boundary, it fires an event containing the character index and length of the current word. The tool uses this data to highlight the active word in the text display. Not all browsers support boundary events with the same precision. Chrome and Edge provide word-level boundaries consistently. Safari and Firefox may only provide sentence-level boundaries or skip them entirely.
The number of available voices depends on your device and operating system. A typical setup provides between 20 and 200 voices across multiple languages. The language filter in this tool groups voices by their BCP 47 language tag (en-US, fr-FR, de-DE, etc.) so you can find voices for a specific locale quickly.
| Platform | Typical Voice Count | Notable Voices |
|---|---|---|
| macOS 14+ | 80-120 | Siri voices, enhanced Samantha |
| Windows 11 | 30-80 | Microsoft neural voices (Jenny, Aria) |
| Chrome (all OS) | 20-40 | Google US/UK English, multilingual |
| Android | 30-60 | Google TTS engine voices |
| iOS / iPadOS | 60-100 | Siri voices, downloadable enhanced |
You can install additional voices through your operating system settings. On macOS, go to System Settings > Accessibility > Spoken Content > System Voice > Manage Voices. On Windows, go to Settings > Time & Language > Speech > Manage voices. New voices appear in the dropdown after installation and a page refresh.
TTS serves multiple accessibility needs. People with dyslexia often find it easier to process information when they can hear it read aloud while following along visually. The word highlighting in this tool supports that dual-channel approach by marking the current word as the voice speaks it.
People with visual impairments use screen readers for daily computer use, but TTS tools like this one offer a simpler interface for one-off tasks: reading an article, reviewing an email draft, or proofing a document. Hearing text read back often reveals errors that visual proofreading misses, such as repeated words, missing articles, or awkward phrasing.
Language learners use TTS to hear correct pronunciation. By selecting a native-language voice and adjusting the speed to a slower rate, learners can hear how words and phrases should sound. The pitch and speed controls let them customize the output to match their comprehension level.
Speed has the largest effect on naturalness. The default rate of 1.0 is calibrated to a comfortable listening pace. Going above 1.5x makes most voices sound mechanical. Going below 0.7x introduces unnatural pauses between syllables. For most use cases, stay between 0.8x and 1.3x.
Pitch changes the fundamental frequency of the voice. A pitch of 1.0 is the voice's natural pitch. Values below 1.0 produce a deeper voice; values above 1.0 produce a higher one. Small adjustments (0.8 to 1.2) sound more natural than extreme values.
Punctuation affects pacing. Periods create longer pauses than commas. Semicolons and colons create medium pauses. Question marks change the intonation on most voices. If your text sounds too rushed, add commas or break long sentences into shorter ones. If it sounds choppy, combine short sentences.
Some voices handle proper nouns, abbreviations, and numbers better than others. If a voice mispronounces a word, try spelling it phonetically. For example, changing "nginx" to "engine X" or "GIF" to "jif" can fix pronunciation issues without changing the meaning.
Yes, once the page loads. The Web Speech API uses voices installed on your operating system, so no internet connection is needed for speech synthesis. The page itself needs to load initially, but after that, all processing is local. Some browser-provided voices (marked as "network" in the voice list) require an internet connection, but most system voices work fully offline.
The Web Speech API exposes whatever voices the operating system and browser provide. macOS includes Siri and enhanced voices. Windows includes Microsoft voices. Chrome adds its own set of Google voices. Each combination of OS, browser, and installed language packs produces a different voice list. You can install additional voices through your OS accessibility or language settings.
This tool does not impose a character limit. However, some browsers cap individual utterances. Chrome, for example, stops speaking after roughly 15 minutes of continuous speech. For very long texts, the tool automatically splits the input into smaller chunks and queues them sequentially, so playback continues without interruption regardless of length.
The Web Speech API does not produce a downloadable audio stream by default. This tool offers a Record Audio button that uses the MediaRecorder API combined with AudioContext to capture the audio output in browsers that support this approach. The result downloads as a .webm file. If your browser does not support recording, you can use screen recording software to capture the audio while it plays.
No. All processing runs entirely in your browser. The text never leaves your device. There are no analytics tracking input, no server-side processing, and no data stored anywhere. You can verify this by opening your browser's developer tools and monitoring the Network tab while using the tool.