The good tools are expensive, both Sumologic and Wavefront are amazing, but past a certain scale it's hard to justify the cost.
On the open source side Grafana and its stack (tempo, loki and prometheus) all just suck, they kinda scale and check all the boxes, but in a really stupid way (data nodes should probably be query nodes, rather than moving everything from storage and caching, everything is supposed to be a dashboard, don't get me started on the query languages).
The Grafana UI was built by the enemy and explorations are incredibly annoying.
I'm more of an exploratory user, not a dashboard one (think level 3 on-call).
We just moved to grafana from New Relic and it makes me angry just to try and look at monitoring, we have lost so much utility. And the company spent a year on the transition and a shit ton of work, so it’s not (just) that we had good workflows set up that haven’t come across. (I am also a heavy exploratory user)
If you were heavy users of APM, then that's going to be a step backwards, I'm sure - but Grafana's just a visualisation tool, so I'm not sure what actual stack you're comparing there.
I find NewRelic's UI to be more frustrating, but I've spent a lot more time with Grafana's (while very much not a fan of their relentless tinkering with UI layout).
Would love to learn what kind of exploration workflows you've missed in Grafana?
Context: Founder building an affordable alternative in the SaaS observability market (hyperdx.io) - I assume your company might have moved when NR introduced new seat-based pricing?
> On the open source side Grafana and its stack (tempo, loki and prometheus) all just suck, they kinda scale and check all the boxes, but in a really stupid way (data nodes should probably be query nodes, rather than moving everything from storage and caching, everything is supposed to be a dashboard, don't get me started on the query languages).
hmm, you generally move prometheus data into a higher engine like Thanos. It does split layers up pretty well into separate components.
Honestly it does have a learning curve but the cost is generally free and compared to New Relic or something like DataDog they aren't all that different learning curve wise.
Either way it beats the days of ganglia or nagios static screenshots of some half baked python metric. Or even kibana where you'd have to run a gigantic ELK stack to get any data out of it.
> Either way it beats the days of ganglia or nagios static screenshots of some half baked python metric. Or even kibana where you'd have to run a gigantic ELK stack to get any data out of it.
Holy hell, if you'd mentioned Munin also, I'd swear we worked for the same company. Spent ages making Munin play nice with Nagios.
Ganglia was okay for monitoring Spark jobs running in a Yarn cluster, I guess, back in the early days it the GangliaSink shipped by default, and it was far better than trying to scrape every worker instance via JMX...
And then yeah, we moved to ELK for metrics. Spent more time fixing up indexing and ingestion issues.
Then I introduced Prometheus + Grafana, it was far simpler to run (at least until we got big enough to need Thanos...) and far simpler to create dashboards and write queries in.
We are currently running a databricks cluster and avoiding the static-ish ganglia screenshots and injecting prometheus exporters to get data into grafana. Honestly wish the entire data space was more open in that regard.
Really glad things are evolving, it's not perfect but it's sure as hell better than it was!
Prometheus isn't Grafana's - so you can't hold Grafana responsible for PromQL either - but for scaling you want their Mimir product anyway.
What scaling issues are you suffering - limits, performance, costs?
Exploration is pretty good in Grafana - sounds like you're using dashboards to do discovery / exploration rather than their explore tool, which got substantially better in recent versions with their metrics encyclopaedia feature.
What do you use for exploring? KST-plot has been my go to for real time monitoring and after the fact analysis of time series data. It is super fast, I can pan, I can zoom, with keyboard shortcuts, and I can perform functions on the data to create new time series. Airbus is/was a big user of KST for their air frame test data. I haven’t found anything that can replace it
Sumologic is one of the worst tools I’ve ever had to use for monitoring, its “ok” for looking at a pile of logs, not great but “ok”, but when companies try to use it for monitoring - gosh, it’s like going back to the dark ages.
The good tools are expensive, both Sumologic and Wavefront are amazing, but past a certain scale it's hard to justify the cost.
On the open source side Grafana and its stack (tempo, loki and prometheus) all just suck, they kinda scale and check all the boxes, but in a really stupid way (data nodes should probably be query nodes, rather than moving everything from storage and caching, everything is supposed to be a dashboard, don't get me started on the query languages).
The Grafana UI was built by the enemy and explorations are incredibly annoying. I'm more of an exploratory user, not a dashboard one (think level 3 on-call).