I admit I haven't gone yet past the README, but I did a fair share of such progr...

the__alchemist · 2025-03-07T16:51:58 1741366318

Concur on this. Usually reaching for an approximation algorithm (FMM, Barnes Hut etc), and/or serialization and sending to a CUDA kernel. And generally using Rayon to parallelize if not on CUDA. I'm curious how to explore the space on CPU optimization (SIMD, SOA/AOS etc), but don't know anything about it.

Keyframe · 2025-03-07T17:13:55 1741367635

I'm curious how to explore the space on CPU optimization (SIMD, SOA/AOS etc), but don't know anything about it.

as with anything in that regard. profile, profile, profile. valgrind, check cache misses and profile. calculate theoretical throughput of a cpu you're working on, like actual bandwidth of reading/writing RAM, with and without caching, and that's your high post; to try to get as close as possible to those limits. If you want to start with that, you can do just that, simple reads/writes and profile and then introduce functions and structures instead and try to reclaim speed as much as possible. graphs over profiling always help, even better graphs over profiling on commits or PRs so you can tell how you're progressing. But that's just like my opinion, man. No right/wrong way, profiling always tells the truth in the end.

tl;dr; read ops for cpu; profile.

the__alchemist · 2025-03-08T17:32:52 1741455172

I just added a 256-bit SoA SIMD computation. Going to follow your advice and benchmark this against both plain rust, and CUDA. (f32)

Keyframe · 2025-03-10T11:45:12 1741607112

awesome, wish you all the best with the project going forward!