Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What?

The only way to find many issues is by looking at logs of unexpected (and especially unhandled) exceptions.

If you have so many of those that they're impossible to dig through, FIX YOUR CODE.



You can keep the logs on the local box and take a peek at them once in a while if you want to clean up the code (which you should), but there is no reason to spend a ton of money and time aggregating the logs.


> You can keep the logs on the local box and take a peek at them once in a while...

That doesn't work very well if you're using stuff like AWS Lambda functions, short-lived VMs, Docker containers that come and go regularly, etc.

Plus, off-server logs are invaluable if the server is compromised.


> That doesn't work very well if you're using stuff like AWS Lambda functions, short-lived VMs, Docker containers that come and go regularly, etc.

Actually that's the use case where it works best. If you have constantly changing infrastructure, how useful are those logs anyway? Unless its something that is happening across the fleet, in which case it should get picked up by your application metrics. Also this is the perfect case for stream processing of the logs where you look for things in real time and then throw away the actual logs (no need to keep them around after you've processed them).

> Plus, off-server logs are invaluable if the server is compromised.

Only if your systems are mutable. If they are immutable then you're much better off looking at the incoming data through an application firewall or looking for unexpected data changes in the data store. The application server should just be a conduit for the user to interact with the data store.


Proposing the ideal world as a solution to the problems of the real world is an engineering antipattern with a long, storied and unhappy history.


Centralized logging is critical to managing security. No real world production systems should be logging locally.


The longer version of my rant stipulates that you have a good application firewall that is looking at the incoming traffic and you are using immutable infrastructure for the fronted so an attacker can't do much damage if they do get in.

Keep in mind I consider the existence of exceptions to be an application metric that should be logged, so if there is a security issue causing exceptions that should show up in the monitoring, and then you can look at the exceptions that happen going forward.

If the box is compromised, your local IDS should catch that (assuming you are allowing writes on the box at all)


you have a good application firewall that is looking at the incoming traffic

I will concede if you have a perfectly configured all-knowing security oracle in front of your application then you don't need proper logging. :)

immutable infrastructure for the fronted so an attacker can't do much damage if they do get in.

I've seen attackers get access then just dump the site databases or copy all the site's code sitting on the compromised servers. Immutable infrastructure doesn't protect against reading things over the network.

The class of security problems is larger and less clearly defined than what can be pre-programmed into static monitoring or analysis tools. It's always good to store logs for after-the-fact forensics if something does go wrong. How often do we see attacks, but then the attacked company doesn't even know what data was compromised or exfiltrated due to lack of logging? It happens daily.

If you do happen to have a magic all-knowing security oracle, you should probably productize it and make a trillion dollars.


> I will concede if you have a perfectly configured all-knowing security oracle in front of your application then you don't need proper logging.

Heh, that's not quite was I saying. :) I was just saying that any analysis you'll do on application logs you can also do with an application firewall.

> How often do we see attacks, but then the attacked company doesn't even know what data was compromised or exfiltrated due to lack of logging? It happens daily.

Again, I would contend that you won't find this information in application logs anyway. You'd find it in your IDS logs that are monitoring outbound network traffic.


you won't find this information in application logs anyway.

Anecdote: I've previously caught an in-progress exploit by seeing mysql errors logged from the application because the exploits were doing dumb things like SELECT * against a table with tens or hundreds of millions of rows. Sometimes the little things let you know.

IDS is nice in theory, but it's not really de rigueur for non-specialized platforms these days. It's about as practical as requesting companies keep complete netflow logs: great in theory, but almost nobody does it.

If these security approaches were deployable in a complete drop-in fashion, we could push for industry wide adoption, but right now everything is custom tailored to individual architectures and environments. ain't nobody got time for that when there's hustlin' to be done and worse is better and perfect is the enemy of good and fail fast and be lean and flaunt your tail feathers towards all the VCs.


> Anecdote: I've previously caught an in-progress exploit by seeing mysql errors logged from the application because the exploits were doing dumb things like SELECT * against a table with tens or hundreds of millions of rows. Sometimes the little things let you know.

Wouldn't that require you to actively watch the logs going by? In a sufficiently large system, you can't really watch the logs scroll by and gain anything from it. In which case you would need some real time filtering, which basically means hand coding an IDS. :)


Wouldn't that require you to actively watch the logs going by?

lol, in that case, yes. we had 500 servers reporting centralized syslog and many people would just tail -f the central log to make sure not too much crazy shit was happening.

(in a perfect software system (spherical cow) obviously all errors would be tagged and categorized with proper monitoring and alerting thresholds. but, in the real world you have a 800,000 lines of php spread across 1,000 files written by 300 mostly junior people (who only stay for 6 to 14 months at a time) over the past 10 years. you make the best of what you've got.)


What has immutable infrastructure to do with logging?

Here's an example of what you typically want to do: "Give me a list of all customers whose contact information was viewed by customer representative X and who's had a Paypal withdrawal made since that event".

How are you going to accomplish that without logs in sufficient detail?

You might call your audit trail something else if you wish, but you are required to do it (immutable logging inaccessible by applications) if you have a nontrivial application with any sort of compliance requirements.


Those are application data logs, which if you have a compliance requirement will be stored directly in a database as the actions happen, by the application itself.

That's not the kind of data you would get with stuff being spit out to syslog.


AFAIK there's nothing in SOC-II or other regulatory regimes that requires you to use a particular storage or retrieval technology for your application or audit logs.

You can totally use syslog as a transport protocol for these if you like. You just want to ensure that you're using the reliable form of the protocol (i.e. TCP, preferably with sender-side disk buffering to guard against unavailability of the collector).

When set up correctly, the practical differences between using, say, syslog-ng PE and some other message bus (e.g. Kafka) for recording events becomes relatively small.

We need to take care not to conflate metrics, audit logs, transport mechanisms, encodings, indices, and storage formats. They're all very different pieces of the complete picture and deserve separate scrutiny.


Sorry what I meant was the kind of data you'd be collecting isn't normally done through syslog. Sure, you can use syslog, but it's not usually the default transport for that kind of data.


You have no idea what you're talking about. It is clear you have not had to perform any incident response or forensics.

Accept the fact that remote logging is necessary (and cheap) for both security and stability reasons.


> You have no idea what you're talking about. It is clear you have not had to perform any incident response or forensics.

This was a nice ad hominem attack but I'll respond anyway. I actually have multiple certifications in computer forensics, and have done forensics and incident response for eBay, PayPal, reddit and Netflix.

> Accept the fact that remote logging is necessary (and cheap) for both security and stability reasons.

I'm an open minded person and I'm willing to change my opinion in the face of new facts, but you haven't actually presented any new facts. Do you have any use cases that support your statement?

I have a few facts that counter them. Central logging is definitely not cheap. It costs a lot of money to store those logs at rest, and more money to store them in a way that is searchable, as those data structures expand pretty quickly. It also isn't necessary to stability, given that we made stability go up after we ditched central logging at Netflix (I will be the first to admin this is correlative and not causative, but still, it isn't necessary for stability).


When security at Netflix needs to investigate for incidents, or to analyze data for anomalies, how do they go about doing it? If I recall correctly, Netflix is an Elasticsearch / Kibana shop right? Are there multiple clusters that they gather info from? How is visibility done for the overall org?

I'm genuinely curious how the security team goes procedures of analysis there.


I'm not sure how much detail I can get into, but yes, there is a large Elasticsearch cluster with a lot of application data as well as web application firewalls and IDS data.


In all my years of lurking on HN, telling jedberg that he has no idea about large scale networking and security procedures is probably the funniest thing I've ever read...

Hint: google his username.


I personally don't understand this. Say your team owns 2 or 3 interconnected services. A customer reports an issue. How do you go about tracing the route cause? You don't know which server services a request so now you need to fetch logs from all your servers. So now you go and retrieve local logs from all of your servers to your local dev box (which could be considerable size given the size of your infrastructure and request volume) and then try and grep them? What if you need low level wire and request logs to debug the issue? After doing this enough times you are probably going to wish you had centralized and searchable log infrastructure.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: