Real-Time Accent Conversion System
The problem
Accent conversion sits at an awkward intersection: ASR systems are heavily biased toward American English, and most TTS systems either ignore accent or get it badly wrong. A speaker with a strong regional accent who wants their voice "translated" into a more neutral target accent has to either record and post-process, or ship audio off to a cloud service. Neither is private; neither is fast.
I wanted a pipeline that ran locally, in near real-time, and worked without internet.
Architecture
The pipeline is intentionally modular — each stage can be swapped or upgraded independently.
Mic → VAD → ASR → Lexical Normalizer → TTS (US voice) → Speaker
1. Voice Activity Detection (WebRTC VAD)
The simplest piece. WebRTC VAD chops the input audio stream into utterance-sized chunks so the rest of the pipeline doesn't churn on silence.
2. ASR (faster-whisper)
faster-whisper is a CTranslate2 reimplementation of OpenAI Whisper that's roughly 4× faster on CPU. I used the medium model with int8 quantization to keep latency under a second per utterance on a laptop CPU. The output is a transcribed English sentence with no accent metadata — by design.
3. Lexical normalization
This is the secret-sauce stage. Indian English and American English don't just sound different — they have lexical and idiomatic differences. "Prepone the meeting", "do the needful", certain spelling and pluralization patterns. A small LLM (~1B parameters, quantized) rewrites the transcribed sentence into idiomatic American English without changing the meaning. It also fixes Whisper's occasional mishearings on words it has weak priors for.
4. TTS (SpeechT5)
SpeechT5 is Microsoft's speech-language model that takes text plus a speaker x-vector and produces audio. I conditioned it on a high-quality American x-vector (extracted from a reference clip) so the output is consistent regardless of input. Output runs through a small post-processing step (normalization, fade-in / fade-out) before playback.
Latency budget
End-to-end target was sub-3-seconds. Where it goes:
| Stage | Typical latency | |-------|-----------------| | VAD chunking | ~50ms | | Whisper ASR | 600–1200ms | | LLM normalization | 200–600ms | | SpeechT5 TTS | 400–800ms | | Audio playback start | ~100ms | | Total | 1.3–2.7s |
The bulk of the time is the two big neural-network stages. int8 quantization buys roughly 2× over fp32 with negligible quality loss for both models.
What I learned
- Pipelines beat monoliths. Splitting the problem into ASR → normalize → TTS made every stage debuggable in isolation. End-to-end accent-conversion models exist, but they're black boxes when they fail.
- Quantization is mostly free. I expected noticeable degradation from int8. There wasn't — at least, not for sentences a human would ever say.
- The LLM stage is where the personality lives. Without normalization, the output is "American-sounding speech of an Indian-English sentence." With normalization, it's actually American English. Most of the perceived quality gain came from this stage.
What I'd improve
- Continuous streaming rather than utterance chunking — at the cost of much harder voice-cloning conditioning.
- A speaker-preserving variant that conditions TTS on the user's voice with the target accent, instead of a generic American x-vector. This is a real research problem (accent-decoupled voice cloning) and would let the user keep their own identity in the output.
- GPU paths for users who have one — easily another 2–4× faster.
Stack
Python 3.11, faster-whisper, SpeechT5, webrtcvad, numpy, sounddevice. CPU-only inference, int8-quantized.