Except you still need a fast bus for the CPUs to talk to each other and to access shared memory. So for all but the most embarrassingly parallel workloads, you just move the bottleneck from the memory bus to the shared cache bus, do you not?
A memory bus has long delays to set up a transfer, is typically only 64 bits wide, and only achieves good bandwidth on large burst operations.
The Venray design allows single-cycle random access to full 4096 bit cache lines, at least as described in the earlier iterations. Contention is far less an issue in this model, for many cores on 1 large memory chip. Multi-chip sticks are then akin to multi-socket motherboards.