..

Highlights from Interspeech 2025

It was my first time attending Interspech last week. It’s the Monday following the conference so, now I’ve had the opportunity to recover from overconsumption of stroopwaffels and beer, I wanted to write a short note about the work that has stuck in my mind since returning from Rotterdam.

PAST: Phonetic-Acoustic Speech Tokenizer

Quite a few speech codecs use an auxiliary semantic distillation loss to make use of pre-trained representations from large SSL models - this has generally been quite successful. Motivated by the fact that the relationship between between usual codec tokens and phonetic information is often hard to map, this work goes a bit further by adding explicit supervision to codec training, where the codec also performs auxiliary phone classification and CTC character matching tasks (think about it as ASR) to more directly embed phonetic representations into code tokens. I’m not massively convinced by the results and I’m not sure how I feel about adding supervised data into codec training, but I do really like the formulation of the auxiliary tasks.

Towards Frame-level Quality Predictions of Synthetic Speech

I really think creating an accurate frame-wise MOS predictor is somewhat of a holy grail endevour. Evaluating speech models can be incredibly difficult without putting a lot of work into metrics. The authors define the conditions that are needed to create frame-level MOS model, and test a range of formulations of frame-level predictors on ground truth data (where perturbations have been applied at specific locations in an utterance). I’m excited to see some results on real data.

Analysis of Semantic and Acoustic Token Variability Across Speech, Music, and Audio Domains

This poster presented a load of results pertaining to codec token distributions, how they tend to follow zipf’s law, how they differ across domains, and that it’s unsurprising that speech language models training on codec tokens perform well due to the tokens similarity to BPE-tokenized text. Nice and to the point!

NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference

A paper from NVIDIA, this presents a small codec aimed at speech language models, with a huge amount of interesting ablations. The codec itself has 4 FSQ codebooks (non-residual) which is interesting of itself. The author seemed to swear by parallel code predictions models (rather than interleaving or single quantizer approaches…). Again, worth a read to see all of the frame rate x number of quantizer ablations.

Effective Context in Neural Speech Models

It’s not super clear how much context speech models need. Thinking about it, all of Kyutai’s models are fully causal and streamable which is a nice property - but this constraint definitely makes it more difficult to train/increases the resources required to produce a decent model (on varying tasks). This paper works through different tasks and models to gauge the amount of effective context that is actually used internally. They show that most of the tested models don’t actually require that much context and can often be converted to stream with limited performance reduction.