ML Monitoring: Keep It Simple, Keep It Compounding
You don't need drift detection to start. One dashboard, one runbook, and a process that gets a little better every time.
There are a million and one ML observability tools. Everybody’s going to try and get you to sign up for their SaaS platform, their SDK, their open source library, talking to you about Jensen-Shannon divergence, Earth Mover’s distance on all of your features. And that will take you to ML observability nirvana.
I don’t think you need most of that to start. Keep it simple, keep it lightweight, keep it self-improving.
What works well is starting with the basics. Start with your most important metrics: your business metrics and your service health. Is the thing running? Are you causing failures? That’s where you start.
Then make it simple. One dashboard, one runbook. One place to look, one place for knowledge. If you can keep it in one place, it’s easier to see what’s happening and debug when something is wrong. If you have your institutional knowledge in one place, it’s easy to see: hey, has this happened before? If it has, what’s the fix? If it’s not there, it’s time to add it.
One useful tip with Datadog: you can consolidate dashboards from multiple locations. In general, Datadog is not the simplest tool for constructing ML performance dashboards, but it does let you do iframes and embeds. So you can put everything into one place and it’s easier to find.
A point of evidence. Instead of complex feature drift detection, just start with the basics: feature values, descriptive stats. We had an issue where we noticed our top line metrics shift a little, scores started to look different. Went from that to our feature descriptive stats. Dumped all of it into AI, asked it to find anything out of the ordinary. It keyed in on something that had shifted recently. From there it traced back through the codebase, from code to data creation to ingestion to model. It was able to quickly bisect and find a potential bug, which we alerted to the team that owned that code, and it turned out to be true. That process took about fifteen minutes. And if that’s good enough to debug something that complicated in fifteen minutes, it works quite well.
So the key things: make observability easy to reach. Institutional knowledge easy to find. A process to share the ops burden, lightweight enough to keep self-improving. Alert goes off, mitigate. If not in the runbook, add it. Add it to the incident log, close any gaps. Each week, inform the team. Lightweight, self-improving.