Hacker Newsnew | past | comments | ask | show | jobs | submit | secondcoming's commentslogin

> But as of now there is no such problem on any kind of significant scale.

This is not the same as saying there's no problem.

A fraction of humans will ever compete in the Olympics. People train their whole lives for it. It's not about 'scale', it's about safety and fairness. It's not reasonable to expect them to 'shut up' about it.

I don't want to watch a man beat up a woman in a boxing ring.


You're most likely part of the 2bn that showed no, or a passing interest, in the Olympics.

I sincerely doubt more than half the population of the entire planet showed more than a passing interest in them, and I'm still curious how it'd be possible to measure that.

> Continuously capturing low-overhead performance profiles in production

It suprises me that anything designed by the OTel community could ever meet 'low-overhead' expectations.


The reference implementation of the profiler [1] was originally built by the Optimyze team that Elastic then acquired (and donated to OTEL). That team is very good at what they do. For example, they invented the .eh_frame walking technique to get stack traces from binaries without frame pointers enabled.

Some of the OGs from that team later founded Zymtrace [2] and they're doing the same for profiling what happens inside GPUs now!

[1] https://github.com/open-telemetry/opentelemetry-ebpf-profile...

[2] https://zymtrace.com/article/zero-friction-gpu-profiler/


> For example, they invented the .eh_frame walking technique to get stack traces from binaries without frame pointers enabled.

This is not an accurate summary of what they developed.

Using .eh_frame to unwind stacks without frame pointers is not novel - it is exactly what it is for and perf has had an implementation doing it since ~2010. The problem is the kernel support for this was repeatedly rejected so the kernel samples kilobytes of stack and then userspace does the unwind

What they developed is an implementation of unwinding from an eBPF program running in the kernel using data from eh_frame.


True, I should have been more specific about the context:

Their invention is about pushing down the .eh_frame walking to kernel space, so you don't need to ship large chunks of stack memory to userspace for post-processing. And eBPF code is the executor of that "pushed down" .eh_frame walking.

The GitHub page mentions a patent on this too: https://patents.google.com/patent/US11604718B1/en


I believe this is a case of convergent invention – the idea of pushing DWARF/.eh_frame unwinding into eBPF seems to have occurred to several people around the same time. For example, there's a working implementation discussed as early as March 2021: https://github.com/iovisor/bcc/issues/1234#issuecomment-7875...

OTel Profiling SIG maintainer here: I understand your concern, but we’ve tried our best to make things efficient across the protocol and all involved components.

Please let us know if you find any issues with what we are shipping right now.


Anything to actually add?

Do you feel better now?

If you enforce that the buffer size is a power of 2 you just use a mask to do the

    if (next_head == buffer.size())
        next_head = 0;
part

If it's a power of two, you don't need the branch at all. Let the unsigned index wrap.

You ultimately need a mask to access the correct slot in the ring. But it's true that you can leave unmasked values in your reader/writer indices.

Interesting, I've never heard about anybody using this. Maybe a bit unreadable? But yeah, should work :)


Nice one!


Indeed that's true. That extra constraint enables further optimization

It's mentioned in the post, but worth reiterating!


Nice!

Should be able to push it more if

* we limit data shared to an atomic-writable size and have a sentinel - less mucking around with cached indexes - just spinning on (buffer_[rpos_]!=sentinel) (atomic style with proper sematics, etc..).

* buffer size is compile-time - then mod becomes compile-time (and if a power of 2 - just a bitmask) - and so we can just use a 64-bit uint to just count increments, not position. No branch to wrap the index to 0.

Also, I think there's a chunk of false sharing if the reader is 2 or 3 ahead of the writer - so performance will be best if reader and writer are cachline apart - but will slow down if they are sharing the same cacheline (and buffer_[12] and buffer_[13] very well may if the payload is small). Several solutions to this - disruptor patter or use a cycle from group theory - i.e. buffer[_wpos%9] for example (9 needs to be computed based on cache line size and size of payload).

I've seen these be able to pushed to about clockspeed/3 for uint64 payload writes on modern AMD chips on same CCD.


This was, in fact, mentioned in the article.

UK, California and Brazil, no?

California's law requires that the OS ask the user for their age, and accept the response as-is without doing any verification.

Terry Gilliam's Brazil, California, and geographic Brazil, yes.

If I get a beer with no head I'm assuming the glass was dirty

My first thought is that this is a sign of burn-out.

Has Amazon's advertising TAM product been affected by AI?


Boost is stronger than ever.


Then why are you using rust for these tasks?


I'm not.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: