Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I admit I haven't gone yet past the README, but I did a fair share of such programming in the past and there's always the same use-case presented which is fine for clarity, but.. ask yourself when is it that you're doing one of each (transforms, normalizations, etc.)? General use-case is many (MANY) at once, and writing an optimized code for both is vastly different.


Concur on this. Usually reaching for an approximation algorithm (FMM, Barnes Hut etc), and/or serialization and sending to a CUDA kernel. And generally using Rayon to parallelize if not on CUDA. I'm curious how to explore the space on CPU optimization (SIMD, SOA/AOS etc), but don't know anything about it.


I'm curious how to explore the space on CPU optimization (SIMD, SOA/AOS etc), but don't know anything about it.

as with anything in that regard. profile, profile, profile. valgrind, check cache misses and profile. calculate theoretical throughput of a cpu you're working on, like actual bandwidth of reading/writing RAM, with and without caching, and that's your high post; to try to get as close as possible to those limits. If you want to start with that, you can do just that, simple reads/writes and profile and then introduce functions and structures instead and try to reclaim speed as much as possible. graphs over profiling always help, even better graphs over profiling on commits or PRs so you can tell how you're progressing. But that's just like my opinion, man. No right/wrong way, profiling always tells the truth in the end.

tl;dr; read ops for cpu; profile.


I just added a 256-bit SoA SIMD computation. Going to follow your advice and benchmark this against both plain rust, and CUDA. (f32)


awesome, wish you all the best with the project going forward!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: