Hacker Newsnew | past | comments | ask | show | jobs | submit | jerrinot's commentslogin

clock_gettime() goes through the vDSO shim, but whether it avoids a syscall depends on the clock ID and (in some cases) the clock source. For thread-specific CPU user time, the vDSO shim cannot resolve the request in user space and must transit into the kernel. In this specific case, there is absolutely a syscall.


Hi Jonas, thanks for the work on OpenJDK and the post! I swear I hadn't seen your blog :) I finished my draft around Christmas and it’s been in the queue since. Great minds think alike, I guess.

edit: I just read your blog in full and I have to say I like it more than mine. You put a lot more rigor into it. I’m just peeking into things.

edit2: I linked your article from my post.


Thanks for the kind words and the link :).


Courtesy of Brendan Gregg and his flamegraph.pl scripts: https://github.com/brendangregg/FlameGraph

Normally, I use the generator included in async-profiler. It produces interactive HTML. But for this post, I used Brendan’s tool specifically to have a single, interactive SVG.


Note that pprof produces much fancier interactive flame graphs. I'm not sure they're a single SVG though.

Also `samply` and the Firefox profiler are pretty fancy too.

There's really no reason to use the original flamegraph scripts.


That's a brilliant trick. The setup overhead and permission requirements for perf_event might be heavy for arbitrary threads, but for long-lived threads it looks pretty awesome! Thanks for sharing!


Yes you need some lazy setup in thread-local state to use this. And short-lived threads should be avoided anyway :)


I guess if you need the concurrency/throughput you should use a userspace green thread implementation. I’m guessing most implementations of green threads multiplex onto long running os threads anyway


In a system with green threads, you typically want the CPU time of the fiber or tasklet rather than the carrier thread. In that case, you have to ask the scheduler, not the kernel.


Exactly this.


Only for some clocks (CLOCK_MONOTONIC, etc) and some clock sources. For VIRT/SCHED, the vDSO shim still has to invoke the actual syscall. You can't avoid the kernel transition when you need per-thread accounting.


Oh for some time after its introduction, CLOCK_MONOTONIC_RAW wasn't vDSO'd and it took some time and syscall profiling ('huh, why do I see these as syscalls in perf record -e syscalls' ...) to understand what was going on.


Thanks, I really should've looked deeper than that.


no problem at all, I was confused too when I saw the profile for the first time.


Author here. After my last post about kernel bugs, I spent some time looking at how the JVM reports its own thread activity. It turns out that "What is the CPU time of this thread?" is/was a much more expensive question than it should be.


I don't think it is possible to talk about fractions of nanoseconds without having an extremely good idea of the stability and accuracy of your clock. At best I think you could claim there is some kind of reduction but it is super hard to make such claims in the absolute without doing a massive amount of prep work to ensure that the measured times themselves are indeed accurate. You could be off by a large fraction and never know the difference. So unless there is a hidden atomic clock involved somewhere in these measurements I think they should be qualified somehow.


Stability and accuracy, when applied to clocks, are generally about dynamic range, i.e. how good is the scale with which you are measuring time. So if you're talking about nanoseconds across a long time period, seconds or longer, then yeah, you probably should care about your clock. But when you're measuring nanoseconds out of a millisecond or microsecond, it really doesn't matter that much and you're going to be OK with the average crystal oscillator in a PC. (and if you're measuring a 10% difference like in the article, you're going to be fine with a mechanical clock as your reference if you can do the operation a billion times in a row).


This setup is a user space program on a machine that is not exclusively dedicated to the test running all kinds of interrupts (and other tasks) left, right and center through the software under test.


For something like this, you can just take several trials and look at the minimum observed time, which is when there will have been ~no interruptions.

https://github.com/facebook/folly/blob/main/folly/docs/Bench...


You don't actually know that for sure. You have only placed a new upper bound.


This seems like more of a philosophical argument than a practical one.


No, it is a very practical one and I'm actually surprised that you don't see it that way. Benchmarking is hard, and if you don't understand the basics then you can easily measure nonsense.


You raise a fair point about the percentiles. Those are reported as point estimates without confidence intervals and the implied precision overstates what system clock can deliver.

The mean does get proper statistical treatment (t-distribution confidence interval), but you're right that JMH doesn't compute confidence intervals for percentiles. Reporting p0.00 with three significant figures is ... optimistic.

That said I think the core finding survives this critique. The improvement shows up consistently across ~11 million samples at every percentile from p0.50 through p0.999.


Yes, I would expect the 'order of magnitude' value to be relatively close but the absolute values to be very imprecise.


You can compute the confidence intervals all you want but if you can't be sure, in one or another way, that what you're observing (measuring) in your experiment is what you actually wanted to measure (signal), not even confidence interval would help you there to distinguish between the signal and noise.

That said, at your CPU base frequency, 80ns is ~344 cycles, 70ns is ~300 cycles. That's ~40 cycles of difference. That's on the order of ~2x CPU pipeline flushes due to branch mispredictions. Or another example is RDTSCP which, at least on Intel CPUs, forces all prior instructions to retire before executing, and it prevents speculative execution of following instructions until theirs results are available. This can also impose a 10-30 cycle penalty. Both of these can interfere with the measurements of the scale you have so there is a possibility that you're measuring these effects instead of the optimization you thought you implemented.

I am not saying that this is the case, I am just saying it's possible. Since the test is simple enough I would eliminate other similar CPU level gotchas that can screw your hypothesis testing up. In more complex scenarios I would have to consider them as well.

The only reliable way I found to be sure what is really happening is to read the codegen. And I do that _before_ each test run, or to be more precise after each recompile, because compilers do crazy transformations with our code, even when just moving a naively looking function few lines above or adding some naive boolean flag. If I don't do that, I could again end up measuring, observing, and finally drawing the conclusion that I implemented a speedup without realizing that the compiler in that last case decided to eliminate half of the code because of that innocuous boolean flag. Just an example.

radix tree lookup looks interesting and it would be interesting to see at what exact instruction does it idle on. I had a case where the function would be sitting idle, reproducible, but when you look into the function there is nothing obvious you can optimize. It turned out that the CPU pipeline was so saturated that there were no more available CPU ports for the instruction this function was idling for. The fix was to rewrite code elsewhere but in vicinity of this function. This is something flamegraphs can never show you, which is partly the reason I had never been a huge fan of.


Did you look into the large spread on your distributions? Some of these span multiple orders of magnitude which is interesting


Fair point. These were run on a standard dev workstation under load, which may account for the noise. I haven't done a deep dive into the outliers yet, but the distribution definitely warrants a more isolated look.


Very thankful for the 1liner tldr

edit : I had an afterthought about this because it ended up being a low quality comment ;

Bringing up such TLDR give a lot of value to reading content, especially on HN, as it provides way more inertia and let focus on -

reading this short form felt like that cool friend who gave you a heads up.


I was unsure whether to post it or not so I am glad you found it useful!


I have that 10-30s time window to fill when claude might be loading some stuff ; the 1 liner is exactly what fits in that window - it makes me wonder about the original idea of twitter now that I think of it - but since it's not the same kind of content I don't bother with it.It really feels like "here is the stuff, here's more about it if you want to" - really really appreciate that form and will definitely do the same format myself


I have no practical experience with bpftrace, so it did not occur to me. I'll give it a try and perhaps there's gonna be a 2nd part of this investigation.


It's much tougher when it's so hard to reproduce. Perhaps the NMI watchdog could help? https://docs.kernel.org/admin-guide/lockup-watchdogs.html


Wow, someone is actually reading the article in detail, that's a good feeling! In C, the != operator has higher precedence than the || operator. That said, extra parentheses never hurt readability.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: