> (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minutes per ~word or punctuation.
You have to divide SSD read speed by the size of the active parameters (~16GB at 4 bit quantization) instead of the entire model size. If you are lucky, you might get around one token per second with speculative decoding, but I agree with the general point that it will be very slow.
Yeah thanks for calling that out. I kind of panicked when I reached that part of the explanation and was stuck on whether or not I should go into dense models vs MoE. The question was about ‘big stuff like that’, which they most certainly use MoE, then I even chose an MoE as an example, but then there are giant dense models like Llama, but that’s not what was asked, although it wasn’t not asked because ‘also big league stuff’…anyway, I basically thought “you’re welcome” and “no problem”, then said “you’re problem”.