New ask Hacker News story: Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?

Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?
4 by yansoki | 6 comments on Hacker News.
I'm a solo developer working on a new tool, and I need a reality check from the ops and infrastructure experts here. My background is in software development, not SRE. From my perspective, the monitoring alerts that bubble up from our infrastructure have always felt like a massive distraction. I'll get a page for "High CPU" on a service, spend an hour digging through logs and dashboards, only to find out it was just a temporary traffic spike and not a real issue. It feels like a huge waste of developer time. My hypothesis is that the tools we use are too focused on static thresholds (e.g., "CPU > 80%") and lack the context to tell us what's actually an anomaly. I've been exploring a different approach based on peer-group comparisons (e.g., is api-server-5 behaving differently from its peers api-server-1 through 4?). But I'm coming at this from a dev perspective and I'm very aware that I might be missing the bigger picture. I'd love to learn from the people who live and breathe this stuff. How much developer time is lost at your company to investigating "false positive" infrastructure alerts? Do you think the current tools (Datadog, Prometheus, etc.) create a significant burden for dev teams? Is the idea of "peer-group context" a sensible direction, or are there better ways to solve this that I'm not seeing? I haven't built much yet because I'm committed to solving a real problem. Any brutal feedback or insights would be incredibly valuable.

Comments

Popular posts from this blog

How can Utilize Call Center Outsourcing for Increase your Business Income well?

New ask Hacker News story: EVM-UI – visual tool to interact with EVM-based smart contracts

New ask Hacker News story: Ask HN: Should I quit my startup journey for now?