10/04/2025

Learning how to do machine learning research

I started my first research role at the end of last summer. It’s been a whirlwind, and I think I’ve probably learned more in the past eight months than I have in the rest of my professional experience combined. Though it’s been fun, it’s has also served me yet another slightly brutal reminder that learning is rarely painless. I thought it might be nice to document some of the most important lessons I’ve learned (usually the hard way) thus far.

Don’t trust every(any)thing you see on Arxiv

There’s alot of hype in the ML space. But even if you manage to ignore linkedInfluencer posts and Sam Altman’s snake oil, I’ve still been quite surprised to find that the level of rigour in a lot of academic research is quite poor, and that inflating numbers and cherry-picking results is rife. I knew academic wasn’t perfect, but the problem seems to be more pronounced in ML. I think this might feel especially true in TTS, where authors can pretty easily massage a model into producing good samples, though the models are usually quite under-whelming once you get to play around with them. It’s also usually very easy to find simple adversarial examples. Robustness is often ignored in favour of shiny results, and failure modes are rarely discussed in papers. I’ve found it’s essential to validate any interesting results yourself; your own evals should be your source of truth, most papers usually serve more as inspiration than gospel.

Attention (to detail) is all you need

I don’t have any sort of formal academic research background. I’m completely self-taught as a programmer and I’ve learned everything I know about ML either on the job or knee-deep in textbooks (thanks Kevin Murphy!). This means I’ve developed a somewhat scrappy mentality; I’m definitely good at learning new skills/information/frameworks by doing - failing first and failing fast. But, as I’ve always been interested in implementing and getting my hands dirty, my instinct was to only dig into the lowest level of information necessary to accomplish the side-quest presently at hand. This is a great way to independently learn on your own or to help power through a side-project, but I found I needed to unlearn this quickly.

Pace and attention to detail are usually inversely proportional to one another (moderated by experience, which allows you to know where to focus your attention to detail…), and I’ve painfully discovered that missing out on a minute detail can waste a week of misplaced effort in the wrong direction. In order to make good decisions about research direction, you need to know everything you can know, and know exactly what you don’t know too. Even though it might take more time, making no assumptions about your approach / the tools you’re using is essential; other people’s design decisions aren’t always obvious, and don’t assume someone has done what you think they would do.

To stop myself making any assumptions when I’m writing code, I’ve been force myself to write tedious and obvious assert statements, mainly in my training scripts, but also with anything else where correctness is paramount. Little things like did i properly shuffle my dataset? are easy to assume you did, but let’s say you make a mistake writing that line 1% of the time; if your training code is 600 lines long, then you might have 6 mistakes in there - and who knows how subtle they might be! Obvious mistakes are great - subtle mistakes? The one that take weeks to discover and days to debug? Disasterous!

Simple isn’t always simple at scale

Modern ML models are massive and they’re trained on genuinelly huge amounts of data. In my domain, TTS, most open-source models are now trained on at least ~50k hours (usually ~3tb) of audio. Processing these datasets, even for simple operations like resampling, becomes incredibly time-consuming. Say you want to transcribe everything to make sure the text aligns with the audio, or apply other neural models like upsamplers? You probably need 4 days and a couple of GPUs. Planning ahead with data processing has actually become an essential part of my day job - I’ve become pretty good at estimating how long these jobs can take. Experiments can’t run without their requisite datasets! It’s important to note, that though it might take an extra hour implementing gritty optimisations in your processing pipelines upfront, and this usually feels very frustrating as you want to get the pipleline running ASAP because it takes so long, the speed gains you realise down the line can end up adding up to dozens of hours of compute capacity spared. It’s worth doing the optimisation up front!