IX · Specialized DomainsAdvanced

Automatic Speech Recognition

The technology that converts spoken audio to text - enabling voice interfaces, transcription services, and accessibility tools by mapping the acoustic properties of speech to the words being spoken.

Added May 18, 2026 · 3 min read

ASR is the gateway technology for voice interfaces - every voice assistant, transcription service, and accessibility tool depends on it. Understanding ASR explains why voice recognition sometimes fails (accents, background noise, rare vocabulary), why certain languages are better supported than others (data availability), and what the state of the art actually is (near-human accuracy for high-resource languages in clean conditions, but significant degradation for challenging audio). It is also increasingly relevant as audio content becomes a primary medium for AI interaction.

Automatic Speech Recognition (ASR) converts spoken language into text. From Siri and Google Assistant understanding voice commands to Zoom's meeting transcription to real-time closed captions on YouTube, ASR is among the most widely deployed AI technologies, handling billions of minutes of speech daily.

Classical ASR systems separated acoustic modelling (what sounds are being made) from language modelling (what words make sense in sequence) and then integrated both with pronunciation dictionaries (how words sound). Hidden Markov Models (HMMs) for acoustic modelling and N-gram language models dominated from the 1970s through the 2000s. Deep learning made each of these components more powerful: deep neural networks replaced HMMs for acoustic modelling, recurrent language models replaced N-grams, and end-to-end systems eventually replaced the full pipeline.

Connectionist Temporal Classification (CTC) enabled end-to-end ASR training: rather than requiring frame-by-frame alignment between audio and text (which is expensive to annotate), CTC defines a loss that sums over all possible alignments, making it possible to train directly from audio-text pairs without alignment annotations. CTC-based models dominated industrial ASR in the late 2010s.

Attention-based encoder-decoder models further simplified ASR: an audio encoder (convolutional layers followed by a Transformer or LSTM) extracts acoustic representations; a decoder generates text token by token using cross-attention to focus on relevant audio representations. These models capture longer-range context in both audio and language, improving performance especially for conversational speech.

Whisper (OpenAI, 2022) is the landmark modern ASR system: trained on 680,000 hours of diverse, multilingual, internet-sourced audio with automatic transcription labels. Rather than specialised ASR training with careful data curation, Whisper used a simple encoder-decoder Transformer trained at scale on heterogeneous data. The resulting model achieves near-human accuracy on standard English benchmarks, handles 99 languages, performs translation (speech to English text), and is remarkably robust to accents, noise, and technical vocabulary. Whisper's approach of scale over specialisation followed the same playbook that worked for LLMs.

Challenges: speaker diarisation (identifying who said what in multi-speaker conversations), low-resource languages (Whisper's quality drops substantially for languages with little internet audio), spontaneous conversational speech (filled pauses, incomplete sentences, overlapping speech), technical and domain-specific vocabulary (medical dictation, legal proceedings), and real-time low-latency requirements for voice assistants.

Analogy

A court stenographer who listens to spoken testimony and types it verbatim in real time, capturing exactly what was said, by whom, with appropriate punctuation. The stenographer is performing ASR: mapping acoustic signals (spoken words) to text symbols, using both their understanding of English phonetics (acoustic model) and their familiarity with likely legal language (language model) to accurately transcribe speech that might be incomplete, accented, or spoken quickly. ASR systems learn these same capabilities from data rather than human training.

Real-world example

Zoom's live transcription uses ASR to generate real-time captions for every meeting. The system processes audio in chunks, uses a streaming ASR model to generate text within ~200ms of each spoken word, and applies a post-correction model to improve accuracy after each sentence boundary. The same system generates searchable meeting transcripts and speaker-attributed notes. OpenAI's Whisper has been deployed by healthcare providers for clinical dictation: physicians speak their notes into a phone; Whisper transcribes to structured clinical text in real time, eliminating manual documentation burden.

Why it matters

ASR is the gateway technology for voice interfaces - every voice assistant, transcription service, and accessibility tool depends on it. Understanding ASR explains why voice recognition sometimes fails (accents, background noise, rare vocabulary), why certain languages are better supported than others (data availability), and what the state of the art actually is (near-human accuracy for high-resource languages in clean conditions, but significant degradation for challenging audio). It is also increasingly relevant as audio content becomes a primary medium for AI interaction.

In the news

No recent coverage - search for Automatic Speech Recognition.

Related concepts

Neural Vocoder

The AI component that converts the abstract numerical output of a speech synthesis model into actual playable audio waveforms - the piece responsible for making AI voices sound natural.

Speaker Diarization

The process of automatically identifying who is speaking at each moment in an audio recording - answering the question 'who spoke when' without knowing the speakers' identities in advance.

← Back to concepts