Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Logging practices I follow (16elt.com)
97 points by bubblehack3r on Jan 9, 2023 | hide | past | favorite | 79 comments


One thing that's an absolute must: Put an ISO-8601 timestamp at the very beginning of every line in your log. No apache format, no "sun" or "jan" or other words. ISO only. Seriously.

If the timestamp is in a weird format (or, god help you, multiple formats since some libraries log shit in a special way), it'll be just about impossible to tell when things actually happened instead of just when the logging server saw them. In a perfect world these would be milliseconds apart, but lots of bad stuff can happen.

Your log-grepping-guy will thank you.


I think all of this sounds fine in theory, but the reality is that most logged information will not be needed - ever. Exactly what information is needed when can be difficult to predict. So, if a developer feels something might be important, they should probably log it. Within reason, I think it is better to have it and not need it than to need it and not have it.

It seems the author is putting a heavy emphasis on trying to create readable logs. Finding the signal in the noise. I am biased, but I think this is a failure of the tools used to read the logs rather than the logs themselves. This is why I wrote LogViewPlus (https://www.logviewplus.com/).


Same applies to metrics, if it moves, log it, especially with semi-recent TSDBs allowing to store a metric ton of metric in very litte space (well, aside from Mimir from Grafana that managed to fail that lesson...)

> It seems the author is putting a heavy emphasis on trying to create readable logs. Finding the signal in the noise. I am biased, but I think this is a failure of the tools used to read the logs rather than the logs themselves. This is why I wrote LogViewPlus (https://www.logviewplus.com/).

Well, it's failure on many levels. "informational" logs, like say your traditional access.log are mostly used for metrics/analytics but also server as context to any warning or error that app returns during processing of the request. But at same time you kinda want them to have be encoded in something more structured than "a piece of formatted text" (say, a JSON line), while at the same this approach reduces glancability of logs to near zero.

On other side having hundreds lines of code just to decode logs into something searchable is also pretty bad and most importantly very fragile to code changes.

"Just do everything in machine format then send it to collector" like Jaeger (with bonus being ability to do distributed tracing) is a solution but very obese one and needs every app supporting distributed tracing


> Within reason, I think it is better to have it and not need it than to need it and not have it.

Log retention starts to mess with what's considered reasonable. For example, despite the fact that it'd be actually legitimately useful to store 180 days of pcaps, that's just cost prohibitive.


One additional thing I like in structured logs is having some form of context level information be included eith your logs, so that you already know things like tenant id, user id, request id, and basic parameters of the request without having to rewrite all that everytime you get an exception.

Unrelated: I live in the Pacific north west and I clicked on this expecting to find a list like "don't log old growth for timber, don't log the entire area", fun how your brain can associate a word with a concept and ignore the more context-relevant meaning.


Is there a structured form of tracing? Because I feel like this contextual information should easily be part of a trace.


Logging is kinda a mess in general for contextualization. Most that support it use kV tupled appended to the log line itself. OpenTelemetry is probably the best hope of supporting a world with contextual logs, metrics, traces which imho is a good thing. OTEL logging does some opinionated things with message construction though, so caveat emptor.


From my experience, traces are not need for every one. AFAIK opentelemetry "standard" (sdk) changes very often, especially painful because of the constant critical changes in the library, a big size of compiled executable binary on the otuput...


Otel logs aim to record the execution context in the logs.

In languages when the context is implicitly passed (e.g. via thread-local storage / MDC in Java) Otel automatically injects trace id and span id in the logs emitted using your regular logging library (e.g. log4j). Then in your log backend you can make queries like "show me all log records of all services in my distributed system that were part of this particular user request".

Disclosure: I am an Otel contributor, working on logs (work-in-progress, not for production use yet).


opentelemetry + Jaeger all-in-one binary is probably easiest way to start experimenting. Results are pretty useful but it is more work than simple logging


re: unrelated - I too thought this was a forest-related post.


Should add that log messages should answer questions like:

- What happened - When it happened? - Where it happened? - Why it happened? - What's the next step?

If your log doesn't answer at least the first 3 questions, then it's useless. If you don't answer "why", then you should think harder whether that is useful or not.

If I had a cent for every time I see "Something went wrong" optionally followed by stack trace that is nearly entirely in 3rd party code with zero information to correlate it with anything - I would have retired to homestead ages ago.


Also, who did it?


Something I haven't seen discussed very widely: it feels like there's not only a balance needed in determine what to log vs what not to log, but also logging in a way that isn't a detriment to the readability of the code overall.

Over time I've actually found myself logging less just because having to sorta mentally elide logging lines added to the cognitive overhead of reading & understanding code.


Depending on the language and framework you're using, there may be options for logging that's non-intrusive. For example, Spring Aspects and Python decorators. Clean logging is a very common problem that probably already has a lot of solutions if you go looking.


I prefer this post which is more detailed: https://talktotheduck.dev/logging-best-practices-mdc-ingesti...


One thing I'd add is the ability to tag certain data or certain loglines as containing personal information so that they can be scrubbed before transmitting or storing the logs. You don't want things like credit card numbers, government id numbers, home addresses, and so on sitting out there in your logs, available to any developer reading a bug report (or available to everyone, when your company has a data breach). You'd log these things during development, skipping the scrubbing step, while the prod logs get scrubbed.


The key is that the person writing the log message is in no position to judge whether it's right.

The measure of logs is whether you can put them in front of a smart but unfamiliar persons and have them figure out what's happening. At a minimum, they should understand generally what's happening and specifically what each message is saying (though perhaps not its significance).

(i.e., same as when writing code)


Question...why no mention of "change" logs? I'm curious as to why I don't see change logs mentioned often as an important overlay to general system logging.

I liken errors and debugs all related to heart beat and breathing rate but without information like "climbing stairs" or "changed medication", it may be hard to understand context or understand why new errors are being seen. The first question I would expect to ask when seeing logged issues, is what has changed recently that could be related to the new errors. Curious to hear thoughts on this?

I actually built software/startup related to the logging of changes (architecture change/software change/server restarts) but just didn't get traction and curious to know why it's not more interesting to people.


Because these are probably better expressed as metrics. A good observation platform will try to weave all of these concepts together to form that picture you're describing. Some observation providers call these events or annotations.

Theres a difficult cognitive dissonance for looking at a logged event and know that it's unusual or not. Some signals are in fact positively correlated with a failure, and some times it's just noise. Good tooling hopefully makes the distinction of these two options as easy as possible.


At $FORMER_EMPLOYER, we had a company-wide service that tracked changes of all kinds: source code changes, deployment changes, config changes, etc... It was useful because some changes are never reflected in YOUR logs, but they are reflected in someone else's logs. The systems that tried to do log change detection were all bad because the ML-driven clustering systems didn't produce interpretable information.


Personally since few yeas I feel more and more issues with logs:

- most devs have lost the concept of logging levels, considering normal spitting out crappy giants backtraces and wall of meaningless text;

- most devs have lost the idea of "being quiet" or "frugality", also have NEVER tried to read logs like an application user who do not have nor want to pass gazillion of lines of sources often crappily arranged.

In the classic *nix world skimming log for "alerting patterns" was easy, for modern crapplications it's a bit of a nightmare. Similarly using logs for debug and mere health check is sometimes useless since many messages should at maximum be debug level logs, others are meaningless and even looking at them from the sources does NOT clarify until you read much more.


I fight log infra all the time. I can't win the fight against structured logging anymore, so I'm now fighting against type systems and allergies to global state to make log output available everywhere. If you're going to ram structured logging down devs' throats, then the least you can do is to make it easy enough to use. I don't want to have to pass a logging object everywhere. There's like two pieces of information you need in order to make a logger. Just write it to global state somewhere so I don't have to worry about it and can call it anywhere.

I absolutely loathe reinventing global state by passing "context" objects and the like everywhere. It's the dumbest thing in the world but no one ever questions it.


At least passing context objects everywhere is better than dynamic dependency injection. I'm in the "dump the logger in a global variable" boat too though.


I was somewhat in that boat too, until the first time I had to make several modules in the code log in a special way (that required some custom code), determined at runtime.

Mostly I just wish more languages had Lisp-style dynamic binding / "special variables". Logging is one of the perfect use cases for dynamic scope - you'd have your normal logger object/configuration as the top-level value of a global, and then let-bind it whenever you need to alter its value for all code executed within that specific scope.

Alas, about the only widely-used form of dynamic binding today is environment variables.


I do this in Scala, via https://www.scala-lang.org/api/2.12.13/scala/util/DynamicVar...

It's not perfect, since it uses JVM's thread-local storage under the hood; this can break when e.g. evaluating Futures in a ThreadPool. For variables which are rarely-overridden, like loggers, I do so with a wrapper that also switches the ExecutionContext to a new ThreadPool (urgh, multithreading...)

PS: I do the same for env vars too ;) http://www.chriswarbo.net/blog/2021-04-08-env_vars.html


That's a neat post, thanks for sharing! I didn't realize Scala had dynamic variables.

As for the HN comment that prompted your blog article, I did a double-take reading it, because I could've sworn I wrote the exact same thing around the same time - turns out I did, though on a different thread :).


What is the reason structured logging is bad? I'm curious, as I felt like it made my life a lot easier.


Simply put, it makes logs unreadable for humans without tooling. Things primarily touched by humans should be as friendly as possible to them. Prior to the advent of structured logging, every log message had a unique visual 'fingerprint' that the eyes and brain could grok at a glance, and spot anomalies really quickly. With structured logs, everything looks the same and so you can't use the brain's inbuilt abilities to process them.

And if you did want tooling to help, then typically the formats were regular and so you could use ordinary text processing tools to help. sed, awk, grep. With structured logging putting everything into nested balanced expressions, you need parsing. Parsing works until you run across something the parser can't figure out. Say you have some Go code with a JSON logger object you're passing around. What if you want to log something but your logging object isn't passed to that function. (you could pass it everywhere but that increases the arity of every single method by 1, and are also reinventing global state poorly) You're SOL, now you're stuck with fmt.Println() and you just broke jq. No, jq does not handle this failure mode. No, Golang does not let you just spit out arbitrary JSON. Thy must use the logging object.

The only thing it helps is ingestion into databases for heavy machine processing. Which is fine, but don't make it the only or even default way software tools spit out logs. In every other way, introducing parsing into your workflow just slows it all down. The only way I can see structured logging making anyone's job easier is if they never understood how it all worked before.


I have the same question! I understand parent's criticisms of context objections and logging boilerplate, but I'm not following the "fight against structured logging." What are the alternatives? No logging? Unstructured logging? Why would either of those be better than structured logs?


While I personally favor no logging, almost everyone who criticizes structured logging would prefer unstructured logging so they can make it someone else's job to restructure it (i.e. index and query it).


It's no work at all to go from regular language to context-free language, and a whole lot more work (parsing) to go the other way.


> Whatever service you are using for logging, it costs money, and a fast way to burn money is to log the entire json object that was relatively small on your dev env, but blew up on production.

You could also, you know, run your own infrastructure and log to your heart's content.


There are still going to be time and effort costs involved in scaling that infrastructure as your log volume increases


You have to output a lot of logs before you fill up even a single large consumer-grade hard drive, especially given logs are typically compressed when rotated.

It's usually only when you involve ELK or something like that your logs start to get big. Which in turn is typically necessitated by over-complicated distributed software design.

If you're at the scale where this actually matters and you're serving millions or requests per second from a worldwide user-base, then affording storage for the logs really shouldn't be a problem anymore (idk, with the possible exception of Twitter)


> You have to output a lot of logs before you fill up even a single large consumer-grade hard drive, especially given logs are typically compressed when rotated.

This is a good point - a RAID array of a few HDDs/SSDs scales surprisingly far and is cheaper than many of the cloud services out there, though whether you can or can't use either approach probably relies on compliance requirements and such.

I will definitely add that logs can compress really well - to the point where it's been close to a year since I added Logrotate to a project that didn't have it before, for a pretty basic setup, and I haven't had the need to even look at how many archives are currently retained, given that the disk usage has changed very slightly. And that's for multiple systems that filled up the available storage in months previously.

Of course, my personal gripe is that most of the logging solutions out there are rather complex - something like Graylog feels like one of the simpler self-hostable options while still being fully featured, but in my experience anything that runs ES is really resource hungry. Sometimes it feels like MariaDB/PostgreSQL would be good enough for most of the simpler low logging volume setups out there - if you don't want to manage logs as files, want to ship them somewhere, but don't want the receiving system to be too complex either.


Except you know when you.actually want to do something valuable with all those logs. You _should_ be creating logs (signals) to be valuable in some way (diagnostics, alerting, canaries statistics), etc. If you're just dumping logs into opaque blobs that are never looked at them sure write them to blobs to your heart's content and have fun hunting and pecking for reasons you're users are already screaming at you. That strategy is fine, but the limitations are clear. It's reactive.


Depends entirely on what and why you are logging.

Is it audit logs for security or due to some regulatory requirement? Then huge blobs are fine. Desirable, even.

Transaction logs for machine-loading so you're able to replay an application's state at any given moment in time? Yeah probably gonna end up with huge blobs again.


And what are you going to do when you need a human to read sixteen trillion bytes of compressed logs streamed off a single SATA disk?

Once you face the fact that the performance of a single SATA disk means you can't search the logs in any quick time, and nobody can possibly read that much log data, so nobody will use it, you start to see it as a hoarding disorder not a useful tool.


It's not unheard of to need to retain years even decades worth of logs due to regulatory compliance. Nobody is reading them, they just need to exist. In that scenario you'll probably keep the current year or so fresh on a mechanical drive and past years on tape.


You know that costs money, right?


Not as much as you'd think, and critically, the cost is largely disconnected from how you use the infra.


If your infra is not on-prem, yes it will cost you more money as you are generating more and more and bigger logs.


You actually have to log a damn lot to actually fill up even a single 16 Tb drive with gzip-compressed logs which typically have something like 50x compression for log data.

On top of that, mechanical hard drives are pretty cheap these days. Like it's a dozen dollars per terabyte, if not less.

I don't know, you're either producing just absurd amounts of logs, on the order of a hundred gigabytes a day plain text, at which sure, I guess you could probably log a bit less. Either that or you're operating at a scale with many millions of users where you should have income and be able to afford it.

... well, either that, or you're being fleeced.


> You actually have to log a damn lot to actually fill up even a single 16 Tb drive with gzip-compressed logs which typically have something like 50x compression for log data.

Now count that for queryable data source so running a database of some sort (Elasticsearch probably for logs) 24/7 at fast enough speeds that it is ops-useful

Metrics are significantly cheaper tho, at least if you use some dedicated TSDB with good storage engine like Victoriametrics or influxdb.


VictoriaMetrics author here. I'm working on VictoriaLogs right now, e.g. the logging system on top of VictoriaMetrics architecture ideas. Preliminary results are promising:

- It will need much lower amounts of disk space, disk IO, CPU and RAM comparing to ElasticSearch during data ingestion.

- It will provide fast logs' querying and tailing via easy-to-use query language (LogsQL), with the ability to calculate advanced stats over the selected logs.

- It will accept data in ElasticSearch format, so existing Filebeat and Logstash setups can be switched from ElasticSearch to VictoriaLogs in a few seconds.


Any thoughts about OpenTelemetry ? That covers both logs, tracing and metrics


> You actually have to log a damn lot to actually fill up even a single 16 Tb drive with gzip-compressed logs which typically have something like 50x compression for log data.

If you cannot search quickly in the logs, at least the "hot" ones (i.e. most recent) they don't make too much sense. Well, they still make sense but for other reasons, but you lose many interesting feature of logs. At $DAYJOB we *surely* needs to trim and shave a lot the logs apps are sending to the centralized ELK - which is one of the points of TFA - but we cannot just gzip the text files and be done, we need to be able to search for patterns anbd data in the logs to understand what the app is doing in certain cases (besides having metrics).

P.S. We also store them as gzipped files in an S3 bucket using warm/cold tiers, and it is certainly cheaper than using even magnetic disks.


When people complain about the cost of excessive logging, they are almost certainly not thinking in terms of how much a drive costs.

Services like CloudWatch are an excellent way to burn through money, though it's usually the time series storage and ingestion costs that balloon out of control.


Well, also the kind of people who worry about this are not thinking in terms of "a terabyte", like GP. It's always easy to give advice when your experience has been at a toy level.


That's unnecessarily dismissive. Handling many (or dozens, hundreds) TB worth of logs is anything but "toy level", that's more than the vast majority of businesses will generate in a decade, maybe even their lifetime.


And marginalia_nu, the GP I was referring to, was unnecessarily strident, concluding that others must be naïve or incompetent if they had to handle logs with "ELK or something like that" and that therefore one must have an "over-complicated distributed software design."

Don't move the goalposts to hundreds of TB--this user is giving advice to everyone based on a perspective that you're doing something wrong if all of your logs don't fit on a single hard drive; that you should "log less" if you have the "absurd" quantity of "hundreds of gigabytes" a day of logs, and who seems to think individual hard drive costs is an important driver of the cost of managing logs. Their words, not mine.

There's nothing interesting to be gained from hot takes based on naïve conceptions and lack of experience. Pointing out that giving overly-general advice based on your inexperienced best guesses and the NewEgg price list is not very useful is not "unnecessarily dismissive."


And if you're in a position where you have to manage petabytes+ of logs using on-prem hardware, SSDs are probably a small part of your overall budget!


Also true, though it's not zero. The different cost drivers/cost model between using SaaS and on-prem infrastructure for logs are interesting and drive different decisions. I have done both in large and small environments and I kind of like the SaaS model because it is easier to put cost incentives on product owners and development teams, which is who should own the P&L. On other words, if you pay $1/GB or whatever, you can get back money by logging fewer GBs. It naturally discourages "log whatever into a giant undifferentiated bucket 'in case you need it'".

You can pay less for equivalent on-prem infrastructure but it drives costs quite differently. For example, it tends to be hard to refresh that infrastructure because it doesn't make you money, so it gets worse over time. The unit cost of storage is very low, often because you make availability/durability tradeoffs that aren't even available to you from the SaaS provider or cloud service. But you will find that the Opex associated with it can be either quite high or poor, and this is hard to reflect in terms of investment by P&L owners.

You can do either approach well or poorly. The way SaaS sucks when doing it poorly is mostly that you are paying a huge amount of money. The way on-prem sucks when doing it poorly is much more complicated and is reflected by toil and tech debt across an organization (which is money but harder to tie to what would fix it), poor visibility, lack of insights, and possibly spending too much in Opex or licenses, depending on the technology. The cost of having a bunch of people do on-prem logging "right" is hard to justify, even for these "large organizations" where I guess, people think, money is free. And even if you've correctly identified and wish to fund the cost of delivering the infrastructure (which as you point out has hardware as only part of its cost), it's not like you can necessarily find the five quality engineers to run the thing. And if you could--do you really want these FTEs working on logging infrastructure or do you want them delivering revenue features?


Log frugality and log uniqueness are great concepts. +1 for mentioning both the financial and cognitive costs of excessive logging.


Just as I suspected: logging a hundred equals signs (or other symbol) for every log entry turns out to be a bad idea.


On the levelled logging point, I stopped using levels after switching from Java -> Go and haven't looked back: https://thomshutt.github.io/opinionated-logging-in-go.html


Good point! Log levels are pretty useless most of the time.

I would add that there can be value in having 2 log levels: verbose and non-verbose. It is helpful if you can selectively switch on verbose logging by user or by API endpoint.

In one application which I maintain, when verbose logging is switched on for a particular user, TCP/UDP socket objects are automatically wrapped and packet captures are logged, only for packets sent/received while servicing that particular user's requests. This has been a lifesaver when debugging things like weird, transient authentication problems stemming from upstream providers.


> It is helpful if you can selectively switch on verbose logging by user or by API endpoint.

We currently use two log levels:

- When 'debug = true', debug logs are printed immediately (like a DEBUG log level)

- When 'debug = false', debug logs go into a buffer: if the request-handler succeeds, its debug buffer gets discarded. If it catches an exception, the buffer gets printed.

This avoids the main problem of log levels, which is having to guess up-front which level we might want (and inevitably get it wrong, and have to try re-creating a problem with more verbose logging!)


Nice! It sounds like you would do well to capture the debug output if a request handler takes an unusually long time to return (not just if an uncaught exception occurs).


This completely overlooks many things such as logs used as metrics or even just using them to reason about the state of your application. What if your application isn't throwing errors but some thing is still broke Or you're shipping bad data?

IMO, the entire point of logs is to be able to ask questions of and reason about the current state of your application. If you're only logging errors that you can't recover from you may as well just throw and exception and restart.


> If you're only logging errors that you can't recover from you may as well just throw and exception and restart.

Unironically tho this is a good decision if you design for it from the beginning crash-only software style. Most of what I log is INFO and WARN level because unexpected combinations of business-level state are where the real subtle nasty bugs are. Nil reference or whatever can just crash who cares.


It absolutely matters. Reading messages from a message queue? Have users submitting some subtly bad edge case you've never organically seen? Just a crash-restart isn't saving you, and worse because you have a discipline of who cares, you probably have inadequate code helpers to diagnose the scenario that triggered your flaw. If you're business is dead simple, whatever but there's too much code in the world to be pumped into such a limited way of thinking.


One question I always have about logging: how do I log valid and expected but prohibited actions? That is, the system is behaving as designed but the user is seeing an error message because they're using the system wrong, and I want to know how often this is happening?


Sounds just like INFO to me. There's a difference between logging to users, logging to system administrators, and logging to developers.


There are dedicated tools such as Sentry for cases like this (well in general for error collection and management, but also cover this scenario). They capture all relevant environment, and can help you detect if users are "using it wrong" only on Safari or only since version x.y, indicating a problem elsewhere.


Info, because it is not actionable.


We use SQLite for logging all the things. This sidesteps entire rabbit colonies worth of issues - especially with regard to downstream parsing & reporting.

I have found the extra structure and familiar semantics make it a lot easier to talk about what we log, how we log it and why.


Can you saw more about this? I've never heard of anything like this and can't figure out if its genius or silly. Things I'm curious about:

* Are you working on a SAAS product or embedded/IoT project or hobby project?

* How do you aggregate the SQLite logs together from disparate machines? Seems like you probably can't use fluentbit/filebeat/etc.

* Where do you query these logs?

* How do you structure these logs? (timestamp, machine, message) or something with more columns?

* Are you able to capture stacktraces?

I _love_ the idea of leveraging SQLite for this kind of scenario and possibly skipping a lot of messy plumbing or pricy vendors, but I'm uncertain how this works.


We ship a B2B product that spools to a log.db when running on client machines. We built in-house tooling that is tailored to obtaining and analyzing copies of these databases.

Stack traces, user actions, 3rd party logs, et. al. are meticulously tracked in a schema we thought most appropriate for our business.


Lots of logs can be replaced by metrics. People go crazy with logs.


Very much (e.g. timing data, counts like success and error counts and operations should be metrics). If not metrics, then traces. If not metrics or traces, then business events (e.g. like alerts, audit records). Almost everything that people put in logs actually belongs somewhere else, in my opinion. Which is evidenced by so much of log processing being about turning logs back into whatever it was they were supposed to be in the first place (metrics, distributed traces and business events).


Excuse an old cave man, but in what way does traces and metrics replace logs?


"Operation xyz completed in 15.445 seconds." This can be expressed as a metric (or a trace) so that the "operation_xyz_completed" is a metric and 15.445 seconds is evaluated as a metric data point. The result is an easily chart to graph the average, p99, whatever of the operation to gauge if this is normal or exceptional. It's dead simple to alert on metrics as well often, so it helps to unlock alerting. Log alerts are valid but often more limiting without a bunch of parsing or being really naive.


Well, you usually want metric and trace of it, at the very least if it fails


Because by and large, a bunch of technologies have very spotty support for metrics, and almost always involve third party systems. Logs are dumb simple. I'm not saying metrics aren't valuable (quite the opposite), but getting started in metrics usually involves some level of institutional investment.


I was expecting something about sustainable forestry


Unrelated to the content: I really like the phrasing of the title. Not "…you should follow", not the tired "best practices", simply "Things I do".


I agree

I also can’t stand: “You’re doing logging wrong” / “you’ve been doing logging wrong”

But then again as an industry we seem to like confident bullshitters




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: