Implement Voice Accent Conversion AI Model Like Telus for Call Center Applications

Understanding Telus's Voice Accent Conversion System

Telus, Canada's largest telecommunications company, has implemented AI-driven voice accent conversion technology to modify call agent accents in real-time during customer service interactions. This technology raises both technical and ethical questions for developers building similar systems. Understanding the underlying architecture helps you implement responsible voice modification for legitimate use cases like accessibility improvements or regional preference accommodation.

The Telus system leverages speech-to-text, neural voice conversion, and text-to-speech pipelines—three distinct ML components that must work in harmony. Rather than simply swapping accents, modern implementations use phoneme-level transformations combined with prosody preservation to maintain naturalness.

Core Components of Voice Accent Conversion

1. Speech Recognition Pipeline

The first step requires accurate speech-to-text conversion with speaker identification:

import librosa
import numpy as np
from transformers import pipeline

# Load pre-trained ASR model (e.g., Whisper for multilingual support)
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    device=0
)

# Load audio file
audio_path = "agent_call.wav"
waveform, sr = librosa.load(audio_path, sr=16000)

# Transcribe with confidence scores
result = asr_pipeline(audio_path)
transcription = result["text"]
print(f"Transcribed text: {transcription}")

Whisper achieves 95%+ accuracy across different accents and noise conditions. For real-time call processing, you'll want to use streaming ASR instead:

# For real-time streaming (e.g., WebRTC)
from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cuda", compute_type="float16")

# Process audio chunks
segments, info = model.transcribe(
    audio_path,
    beam_size=5,
    language="en",
    vad_filter=True  # Voice activity detection
)

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

2. Voice Conversion Neural Network

The critical middle layer transforms acoustic features from source accent to target accent:

import torch
import torch.nn as nn

class AccentConverter(nn.Module):
    """Neural voice conversion model using encoder-decoder with attention"""
    
    def __init__(self, mel_dim=80, hidden_dim=256):
        super().__init__()
        
        # Encoder: Extract speaker-independent features
        self.encoder = nn.Sequential(
            nn.Linear(mel_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        
        # Decoder: Generate target accent acoustic features
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, mel_dim),
        )
        
        # Accent embedding (source and target)
        self.accent_embedding = nn.Embedding(num_embeddings=5, embedding_dim=128)
    
    def forward(self, mel_spectrogram, source_accent_id, target_accent_id):
        # Extract content features
        content = self.encoder(mel_spectrogram)
        
        # Get accent embeddings
        source_accent = self.accent_embedding(torch.tensor(source_accent_id))
        target_accent = self.accent_embedding(torch.tensor(target_accent_id))
        
        # Accent shift in latent space
        accent_shift = target_accent - source_accent
        converted_features = content + accent_shift.unsqueeze(0)
        
        # Generate target mel-spectrogram
        output_mel = self.decoder(converted_features)
        return output_mel

# Usage
model = AccentConverter(mel_dim=80, hidden_dim=256)
converted_mel = model(input_mel, source_accent=0, target_accent=2)

This approach preserves speaker identity while shifting accent characteristics. For production systems, consider using pre-trained models from HuggingFace's speech models collection.

3. Neural Vocoder for Speech Synthesis

Once you have converted mel-spectrograms, a vocoder reconstructs audio waveforms:

from TTS.tts.layers.glow_tts.glow import Glow
from TTS.vocoder.models.hifigan import HiFiGAN

# Load pre-trained HiFiGAN vocoder
vocoder = HiFiGAN.init_from_raw_dict(vocoder_config)
vocoder.eval()

# Convert mel-spectrogram to waveform
with torch.no_grad():
    waveform = vocoder.inference(converted_mel_spectrogram)

# Save output
output_audio = waveform.cpu().numpy()
library = librosa.get_samplerate()
library.write("output_converted.wav", output_audio, sr=22050)

End-to-End Implementation Pipeline

| Stage | Tool/Model | Latency | Accuracy | |-------|-----------|---------|----------| | Speech Recognition | Whisper Large V3 | 500ms | 95% | | Voice Conversion | Custom Encoder-Decoder | 200ms | N/A (perceptual) | | Vocoding | HiFiGAN | 300ms | 4.2/5 MOS | | Total | Combined Pipeline | ~1000ms | Overall quality depends on training data |

For live call processing, you need to optimize latency significantly. Consider:

# Streaming inference approach
class StreamingAccentConverter:
    def __init__(self, chunk_size=512, overlap=256):
        self.chunk_size = chunk_size
        self.overlap = overlap
        self.buffer = np.zeros(overlap)
    
    def process_chunk(self, audio_chunk):
        # Overlap-add processing for smooth transitions
        windowed = np.hanning(len(audio_chunk)) * audio_chunk
        output = self.convert_features(windowed)
        self.buffer = audio_chunk[-self.overlap:]
        return output
    
    def convert_features(self, chunk):
        # Fast inference path
        mel = librosa.feature.melspectrogram(y=chunk)
        converted_mel = self.model_onnx(mel)  # Use ONNX for speed
        waveform = self.vocoder_fast(converted_mel)
        return waveform

Production Considerations and Ethical Guidelines

Transparency Requirements

Telus's implementation sparked regulatory discussions. When deploying accent conversion, you must:

  1. Disclose usage to callers: Legal compliance requires customers know their agent's voice may be modified
  2. Log all conversions: Maintain audit trails of when and which conversions occur
  3. Implement opt-out mechanisms: Allow customers to decline accent-converted agents

Training Data Selection

Your model quality depends heavily on diverse, balanced training data:

# Dataset structure recommendation
# data/
#   ├── british_english/
#   ├── american_english/
#   ├── canadian_english/
#   ├── indian_english/
#   └── australian_english/

# Each should contain:
# - Phonetically balanced sentences
# - Multiple speakers per accent
# - Various speaking rates and emotions

Avoiding Bias and Misuse

Accent conversion can perpetuate discrimination if misused. Implementation safeguards:

# Audit your accent conversion system
class AccentBiasAudit:
    def __init__(self, model):
        self.model = model
    
    def test_conversion_consistency(self, test_audio_paths):
        """Verify accent conversion doesn't change phonetic content"""
        for audio_path in test_audio_paths:
            original_transcript = self.asr(audio_path)
            converted_audio = self.convert(audio_path)
            converted_transcript = self.asr(converted_audio)
            
            # Calculate word error rate between original and converted
            wer = self.calculate_wer(original_transcript, converted_transcript)
            if wer > 0.05:  # 5% threshold
                print(f"WARNING: High WER for {audio_path}: {wer}")

Implementation Tools and Frameworks

  • TTS Frameworks: Coqui TTS, Glow-TTS (pre-built vocoder pipelines)
  • Speech Recognition: OpenAI Whisper, Pyannote for speaker diarization
  • Voice Conversion: YourTTS, SpeechSplit (open-source models)
  • Inference Optimization: ONNX Runtime, TensorRT for sub-100ms latency

For production deployment on cloud platforms like AWS or Render, containerize your pipeline:

FROM pytorch/pytorch:2.1.0-cuda11.8-runtime-ubuntu22.04

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Conclusion

Voice accent conversion requires orchestrating multiple ML models with careful attention to latency, accuracy, and ethical deployment. While Telus's implementation demonstrates technical feasibility, responsible developers must implement transparent logging, user consent mechanisms, and bias auditing. Start with pre-trained models (Whisper for ASR, HiFiGAN for vocoding) and fine-tune the voice conversion layer on domain-specific data representing the accents you support.

Recommended Tools