Highlights from NeurIPS 2024
Dec 30, 24This December, I was lucky enough to have been able to attend NeurIPS in Vancouver with Neuphonic. I had a pretty great time attending: I got to meet some fantastic people, listened to some great talks and just about survived simultaneously being jet-lagged, hungover and partaking in brutal, detailed, technical discussion for an entire week.
As this was my first conference, I was pretty taken aback by how knackering the entire process was. The conference centre is enormous, and there’s dozens of sessions going on concurrently. At a conference of this size there’s a subtle but steep learning curve to filtering signal from the noise and discerning the relevant work from the irrelevant; I feel like this is only amplified as a speech researcher, as the speech and audio community at NeurIPS is relatively small and thus a lot of time is spent attempting to work out how ideas or approaches might map onto the problems we face in our domain.
I’ve written a small and biased list of the conference sessions/pieces of work that have stuck around in my mind afer the conference - I thought I might as well share it.
Flow Matching Tutorial
The workshop on flow matching by Yaron Lipman and co was genuinely fantastic and did a great job of building up intuition about flow matching with rigour. One paper mentioned by the authors, Bespoke Solvers for Flow Matching was probably my favourite piece of work I came across at the conference - it outlines some elegant theory which enables the training of bespoke solvers for pre-trained flow-based models that ‘shrink time’, facilitating convergence in a small number of function evaluations.
Latent Functional Maps
This was one of my favourite poster sessions. The work defines a different way to align representational spaces of different models with arbitrary dimensions, by approximating the samples of the representational spaces on an approximation of their underlying manifold. The method appears to be more robust than existing methods like CCA and CKA. The part which interested me the most is that the authors showed that using these maps, they were able to conduct neural stiching that essentially enables you to stitch two representational spaces together to create a composite model; I don’t think the results were great but it’s early doors and an interesting idea, especially in the age of huggingface where so many accesible pre-trained models exist.
The Platonic Representation Hypothesis
Definitely a highlight of the UniReps workshop, the talk (and paper) covers some pretty surprising findings about how conceptual representational spaces appear to converge between models trained in different modalities (i.e. text and images), and how this effect appears to only compound with scale. For example, as self-supervised models scale, the distance between the text apple
and orange
when presented to BERT is similar to an image of an apple
and an orange
when presented to CLIP (this isn’t a result they presented, but an example of the key takeaway). This is definitely an unexpected, and I’d argue quite profound, result. The authors go on to argue that maybe this is just a phenomenon inherent to modelling the same world, even via different modalities (with ackthnowledged caveats of such a vast statement).
SNAC: Multi-Scale Neural Audio Codec
This work is pretty simple, hinging on the idea that higher level codes in neural audio codecs are change at different frequencies to the lower-level increasingly fine-grained codes; they designed their codec around this observation and ended up with a pareto-improvement over existing codecs like DAC.
Indic-Voices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
This paper describes a restoration process for a speech dataset of all 22 official languages spoken in India. What’s interesting about this work is that all of the data was collected by the academic labs partner network around the country, primarily in small settlements using non-studio quality microphones. I think this is a great accomplishment to have got countless people in small communities together to create a resource that will hopefully help democratize speech technology in their low-resource languages. I’m intriguied to see what people make with it!
The Workshop on Machine Learning and Compression
There were some great ideas presented at the compression workshop, but the ones I enjoyed the most were centered on the relationship between compression and human perception. I most enjoyed the talk on overfitted image compression and similar work where models are optimised with perceptual quality in mind rather than just similarity to the data distribution. Definitely take a look at this paper and other works on Wasserstein Distortion (maybe this) if you’re interested.