Aye, there’s the kicker. The correct configuration of hardware resources to run and multiplex large models is just as much of a trade secret as model weights themselves when it comes to non-hobbyist usage, and I wouldn’t be surprised if optimal setups are in many ways deliberately obfuscated or hidden to keep a competitive advantage
Edit: outside the HPC community specifically, I mean
The economic barrier to entry probably has a lot to do with it. I'd happily dig into this problem and share my findings but it's simply too expensive for a hobbyist that isn't specialized in it.
Edit: outside the HPC community specifically, I mean