Strix halo has 256 GB/s bandwidth for $2500.
The Flash model has 13 GB activations.
256 / 13 = 19.6 tokens per second
Except you cannot fit it into the maximum RAM of 128 GB Strix Halo supports. So move on.
Another option is Threadripper. That's 8 memory channels. Using older DDR4-3200 you get roughly 200 GB/s. For $2000.
200 / 13 = 15.4 tokens per second
But, a chunk of per-token weights is actually always the same and not MoE, so you would offload that to a GPU and get a decent speedup. Say 25 tokens per second total.
Then likely some expensive Mac. No idea.
Eventually you arrive at a mining rig chassis with a beefy board and multiple GPUs. That has the benefit of pipelining. You run part of the model on one GPU and move on, so another batch can start on the first one. Low (say 30-100) tps individually, but a lot more in parallel. Best get it with other people.
I agree. The answer is regulation that outlines rules of engagement for "free" (you are the product) online services.
Australia is famous for having very strong consumer protection laws for purchased products (physical goods). It has been discussed many times here. How does this work in the digital universe?
Steam (ie Valve) used to pretty much not give refunds for games.
That changed after Australian's Competition and Consumer Commission (ACCC) dragged them through Federal Court for it, comprehensively winning against Steam:
Thus the "refunds if you've not played for more than ~2 hours" policy that Steam then implemented (globally).
Probably the relevant quote to answer your question about how things work in the digital universe:
> "This important precedent confirms the ACCC's view that overseas-based companies selling to Australian consumers must abide by our laws. If customers buy a product online that is faulty, they are entitled to the same right to a repair, replacement, or refund, as if they'd walked into a store," ACCC Commissioner Sarah Court said.
"They", being a trillion dollar company, can effortlessly draw this out while you expend all manner of time and resources and still not pay you or resolve the problem. Any regulation that would change this to be better for the average consumer (and therefore worse for the trillion dollar company) will never pass because they have more say in the laws then you do.
This is a defeatist comment I know, but I do feel defeated when it comes to tech companies.
Strix halo has 256 GB/s bandwidth for $2500. The Flash model has 13 GB activations.
256 / 13 = 19.6 tokens per second
Except you cannot fit it into the maximum RAM of 128 GB Strix Halo supports. So move on.
Another option is Threadripper. That's 8 memory channels. Using older DDR4-3200 you get roughly 200 GB/s. For $2000.
200 / 13 = 15.4 tokens per second
But, a chunk of per-token weights is actually always the same and not MoE, so you would offload that to a GPU and get a decent speedup. Say 25 tokens per second total.
Then likely some expensive Mac. No idea.
Eventually you arrive at a mining rig chassis with a beefy board and multiple GPUs. That has the benefit of pipelining. You run part of the model on one GPU and move on, so another batch can start on the first one. Low (say 30-100) tps individually, but a lot more in parallel. Best get it with other people.
reply