Nothing it the article is wrong, per se, but it all seems awfully disconnected from the realities I see in monitoring and alerting?
The end advice is right: you want to build the smoke detector, not the car alarm. But … getting that done, now that's the trick. If the org has car alarms, that's the same org whose PMs will not see the "impact" of that ticket to get the monitoring made ship shape, and that ticket will be backlog-icebox-graveyard'ed.
I've had to get a number of "technical" non-technical roles to try to see that, no, the monitoring software can not automatically generate¹ metrics around your application. Yes, you have to actually instrument the code to add those!
Then that gets combined with systems that are just … not the best? at their job: Datadog has so many rough corners around statistics, where graphs will alias, graphs will change shape depending on zoom, units are a PITA or nowhere to be seen, etc. Sumo a god-awful UI — literally tabs inside tabs, and I can't copy the URL?! — and barely understand structured logging. Splunk is marginally better. Pagerduty permits only the simplest handling of events — don't limit me to a handful of tailored rules: rules are logic, logic is a function: give me WASM. And I want usable business-hours only alerts².
Self-hosted systems are perpetually met with "that's not our core focus", but nobody ever seems to convert the cost of managed monitoring/alert systems into "number of FTE that could be hired to maintain a self-hosted system".
(Oddly, the example in the article is a car alarm. Load avg. is, IMO, a useless metric. Better to measure CPU consumption and IOPS consumption, separately, or probably better, more derivative stats around the things doing the IOPS/CPU.)
¹yes, most system come with some collectors to get system-level stuff like CPU usage, etc. I mean metrics specific to your application.
²PD claims to support them, but in practice, they don't work: alerts received off hours don't alert, true … but they never alert once business resume, either! If you're in an org trying to dig itself out of a mess, you need them to not die in the low-prio pile.
(Ugh. Give me ACL systems in these systems that don't suck: PD locks the routing rules behind like "Admin", and security doesn't want to grant the rank and file "Admin", and so 80% of my devs have no idea how the system works because they're not allowed to see how the system works! Give me the ability to do a WMA business-days-only line for diurnal patterns! The list just goes on and on…)
The end advice is right: you want to build the smoke detector, not the car alarm. But … getting that done, now that's the trick. If the org has car alarms, that's the same org whose PMs will not see the "impact" of that ticket to get the monitoring made ship shape, and that ticket will be backlog-icebox-graveyard'ed.
I've had to get a number of "technical" non-technical roles to try to see that, no, the monitoring software can not automatically generate¹ metrics around your application. Yes, you have to actually instrument the code to add those!
Then that gets combined with systems that are just … not the best? at their job: Datadog has so many rough corners around statistics, where graphs will alias, graphs will change shape depending on zoom, units are a PITA or nowhere to be seen, etc. Sumo a god-awful UI — literally tabs inside tabs, and I can't copy the URL?! — and barely understand structured logging. Splunk is marginally better. Pagerduty permits only the simplest handling of events — don't limit me to a handful of tailored rules: rules are logic, logic is a function: give me WASM. And I want usable business-hours only alerts².
Self-hosted systems are perpetually met with "that's not our core focus", but nobody ever seems to convert the cost of managed monitoring/alert systems into "number of FTE that could be hired to maintain a self-hosted system".
(Oddly, the example in the article is a car alarm. Load avg. is, IMO, a useless metric. Better to measure CPU consumption and IOPS consumption, separately, or probably better, more derivative stats around the things doing the IOPS/CPU.)
¹yes, most system come with some collectors to get system-level stuff like CPU usage, etc. I mean metrics specific to your application.
²PD claims to support them, but in practice, they don't work: alerts received off hours don't alert, true … but they never alert once business resume, either! If you're in an org trying to dig itself out of a mess, you need them to not die in the low-prio pile.
(Ugh. Give me ACL systems in these systems that don't suck: PD locks the routing rules behind like "Admin", and security doesn't want to grant the rank and file "Admin", and so 80% of my devs have no idea how the system works because they're not allowed to see how the system works! Give me the ability to do a WMA business-days-only line for diurnal patterns! The list just goes on and on…)