Whisper Code Review — Dissecting the Internal Structure of OpenAI’s STT Model

OpenAI’s Whisper is an open-source Speech-to-Text (STT) model. On the surface, it may seem like “just a model that converts speech to text,” but when you look at the code, it’s closer to a Transformer-based multimodal inference engine. In this post, we’ll dive deep into Whisper’s code and dissect its architecture.

screenshot 2025 10 31 at 12.20.45 am

🎯 1. Overall Architecture Overview

The core of the Whisper repository lies in just 3 files:

  • model.py — Model definition (Transformer encoder-decoder)

  • audio.py — Audio preprocessing (FFT, mel spectrogram)

  • transcribe.py — Actual inference pipeline

Thanks to this simple structure, Whisper is one of the most readable PyTorch codebases.

🎧 2. Audio Processing Pipeline (audio.py)

Let’s first look at the audio processing part.

1
2
3
4
5
6
7
8
def load_audio(file: str, sr: int = 16000):
    out = subprocess.run(
        ["ffmpeg", "-i", file, "-f", "s16le", "-ac", "1", "-acodec", "pcm_s16le", "-ar", str(sr), "-"],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        check=True,
    )
    return np.frombuffer(out.stdout, np.int16).astype(np.float32) / 32768.0

Key Points:

  • Whisper normalizes all input audio to 16kHz mono float32.

  • It directly calls ffmpeg to support various formats.

  • The output is a float32 numpy array in the [-1, 1] range.

Next is the mel spectrogram conversion.

1
2
3
4
5
def log_mel_spectrogram(audio):
    spec = librosa.feature.melspectrogram(
        y=audio, sr=16000, n_fft=400, hop_length=160, n_mels=80
    )
    return np.log10(np.maximum(spec, 1e-10))

What’s important here is that it generates an 80-channel Mel spectrogram. This serves as the encoder input tokens for Whisper — essentially treating audio “like an image.”

🧠 3. Model Definition (model.py)

The core of Whisper is the Transformer. Looking at the code, it’s almost a GPT-style decoder paired with a BERT-style encoder.

1
2
3
4
5
class Whisper(nn.Module):
    def __init__(self, dims):
        super().__init__()
        self.encoder = AudioEncoder(dims)
        self.decoder = TextDecoder(dims)

🎙️ Encoder

AudioEncoder takes the Mel spectrogram as input and creates a latent representation.

1
2
3
4
5
class AudioEncoder(nn.Module):
    def __init__(self, dims):
        super().__init__()
        self.conv1 = Conv1d(80, dims.n_audio_state, kernel_size=3, stride=1, padding=1)
        self.blocks = nn.ModuleList([ResidualAttentionBlock(...) for _ in range(dims.n_audio_layer)])

This structure repeats Conv → Self-Attention, extracting temporal features to create “audio context vectors.”

✍️ Decoder

TextDecoder is almost identical to GPT. The difference is that it references audio encoding through Cross-Attention.

1
2
3
4
5
class TextDecoder(nn.Module):
    def forward(self, tokens, audio_features):
        for block in self.blocks:
            x = block(x, cross_kv=audio_features)
        return self.ln(x)

In other words, Whisper is “GPT that predicts sentences while looking at audio.”

⚙️ 4. Inference Pipeline (transcribe.py)

The entire pipeline is this simple:

1
2
3
4
5
def transcribe(model, audio):
    mel = log_mel_spectrogram(audio)
    mel = torch.from_numpy(mel).to(model.device)
    result = model.decode(mel)
    return decode_tokens(result)

Just three steps:

  • Audio load + Mel conversion

  • Transformer encoding + decoding

  • Decode tokens to text

When beam search, temperature, language detection, and other options are added, you get features like --task translate, --temperature 0.2 used in Whisper CLI.

🌍 5. Multilingual Support

One of Whisper STT’s greatest strengths is multilingual recognition. The model supports over 98 languages and includes automatic language detection. This means the model can determine the input audio language without prior specification and convert it to text.

At the code level, the language parameter and --task transcribe option enable this. Thanks to the multilingual data the model was trained on, it shows high accuracy in various languages beyond English, including Spanish, Chinese, and Korean.

⚡6. Whisper.cpp Porting

While the Whisper model is PyTorch-based, there’s a C++ ported project called Whisper.cpp. Main advantages:

  • Fast inference even in CPU environments

  • Executable on mobile and embedded devices

  • Usable without PyTorch installation

In fact, Whisper.cpp has maximized speed and efficiency by applying various techniques such as quantization and memory optimization. It’s an excellent alternative when you want to use Whisper STT on local devices rather than servers.

💡 7. Whisper’s Design Philosophy

Reading through the code, you can feel the Whisper team’s philosophy:

  • Simple > Clever — Consistent design over complex tricks

  • Modularized pipeline — Audio → Mel → Encoder → Decoder → Text

  • End-to-End Differentiable — End-to-end design including preprocessing

Whisper is a best practice for “how to implement a large model with a simple structure.” This has resulted in a codebase that’s easy to read and maintain relative to its model size.

🏁 8. Conclusion

Whisper STT’s code can be called “the standard textbook of STT.” The entire process of converting speech to text using Transformers is contained in a concise and intuitive manner. Thanks to multilingual support and Whisper.cpp porting, its application range is also wide.

📚 If you want to dig deeper:

  • How is Beam Search implemented in Whisper

  • Whisper model’s multilingual processing approach

  • Optimization comparison between Whisper.cpp and MLX ported versions

References