>>: the on-chip cache is much quicker than conventional architectures as the TLB...

Symmetry · on Feb 8, 2014

Right, most modern caches use the virtual address to get the cache index and use the physical address for tag comparisons[1]. Since on x86 the bits needed for the tag are the same between the virtual and physical address the entire L1 lookup can be done in parallel, though for other architectures like ARM you need to finish the TLB step before the tag comparison[2].

But while I think the Mill people are overselling the direct performance benefits here, the single address space lets them do a lot of other things such as backing up all sorts of things to the stack automatically on a function call and handling any page fault that results in the same way that it would be handled if it was the result of a store instruction. And I think they're backless storage concept requires it too.

[1]http://en.wikipedia.org/wiki/CPU_cache#Address_translation [2] Unless you were to force the use of large page sizes, as some people suggest Apple might have done with their newest iPhone.

jlouis · on Feb 8, 2014

The reason the TLB is so fast is also that it is fairly small and thus misses fairly often. Moving the TLB so it sits before the DRAM means that you can have a 3-4 cycle TLB with thousands of entries.

jacobolus · on Feb 8, 2014

Have you watched Godard’s talks? He goes into considerable detail about this.