More

dendibakh · on June 26, 2023

... And the only (?) memory profiler for Windows.

dendibakh · on Nov 18, 2019

Thanks, and don't forget to measure. :)

dendibakh · on Nov 18, 2019

You can also visit easyperf.net (my blog). If you like Linux perf the same way I do, you will like it. Here are some links:

- https://easyperf.net/blog/2018/08/26/Basics-of-profiling-wit...

- https://easyperf.net/blog/2019/04/03/Precise-timing-of-machi...

- https://easyperf.net/blog/2019/02/09/Top-Down-performance-an...

But in the end, my almost every article contains a Linux perf use case scenario.

dendibakh · on Aug 5, 2019

Thanks for the great comment. Where I can read more about the impact of each item on variability?

dendibakh · on Aug 4, 2019

Thanks for the comment. As answered to the previous comment, yes, I agree. Likely I did a poor job of explicitly saying that those advises are not generally applicable. It usually makes sense to have a dedicated machine (pool of machines) that a properly configured to run perf tests. And nothing else is executed on them.

dendibakh · on Aug 4, 2019

Yep. And I think I stated that in the beginning of the article. :)

This one:

> It is important that you understand one thing before we start. If you use all the advices in this article it is not how your application will run in practice. If you want to compare two different versions of the same program you should use suggestions described above. If you want to get absolute numbers to understand how your app will behave in the field, you should not make any artificial tuning to the system, as the client might have default settings.

dendibakh · on May 6, 2019

Yes, that might be possible. However, you probably will get multiple cycle counts for the same function depending on which path was taken. And it works only if the amount of taken branches in the function is not that big (less than 32). Otherwise it will not fit into LBR stack. For example, if you have a loop with more than 32 iterations it will trash the LBR stack with backwards jumps. But yeah, for small functions it might work pretty well.

I would better go for analyzing not the whole function (all basic blocks of the function) but only the Hyper Blocks (typical hot path through the function). Here is the example of how to do it: https://lwn.net/Articles/680985/ chapter "Hot-path analysis".

CalChris · on May 7, 2019

Superblocks rather than Hyperblocks. Except for cmov which is partial predication, x86 doesn’t have predication. But SBs are probably what your optimizer wants anyways.

dendibakh · on May 5, 2019

Thanks. I'm glad you like the article. :)

dendibakh · on Feb 9, 2019

Hi, I'm glad you like the article. The process how I went from TMAM metric to particular event that was used to calculate it is describe in the TMAM metrics table: https://download.01.org/perfmon/TMA_Metrics.xlsx

In the same row for DRAM_bound metric there is precise event PEBS specified that we can use for locating the issue. Sampling on the precise event will let us detect exact place in the code where we have the most amount of L3 misses.

Let me know if you have further questions!

moab · on Feb 10, 2019

Thanks for the reply and the pointer!

dendibakh · on March 23, 2018

Thanks for this clear explanation!

Regarding your example with 1000 muls followed by 1000 loads... That's why in my experiments I interleaved loads and bswaps, because that's what (hopefully) every decent compiler will do.

nkurz · on March 23, 2018

I interleaved loads and bswaps, because that's what (hopefully) every decent compiler will do

Your optimism about compiler behavior is charming. I think you may be disappointed if you expect compilers to interleave loads just because this approach works better on modern processors. My experience has been that GCC goes out of its way to hoist all of your carefully interleaved loads to a big block at the top.

I'm guessing it does this because it's a simple heuristic that was often helpful before out-of-order processors became common. Normally, the performance impact of this is very small, but when it does exist, it's usually negative. If I recall, ICC does a better job of interleaving, or at least leaving things interleaved.

dendibakh · on March 23, 2018

Well, yeah. I don't know much about the current state of the art (because I don't touch the CodeGen on a daily basis), but I kind of look into the future with hope that compilers will better handle at least those "simple" cases. :)

BeeOnRope · on March 24, 2018

Compilers are rarely going to transform the 1000/1000 example into the 10/10 one, even if they were smarter.

Often such a transformation is simply impossible: the effect of interleaving the instructions may be different than what the source dictates.

Also the 1000/1000 example probably doesn't arise as a long stream of explicit instructions: it is probably just a short loop with 1000 iterations! That makes it even less likely that the compiler will simply start interleaving various instructions following the loop into the loop body somehow.