It looks like the main remaining delay cases are integer mul instructions which get a bypass delay no matter what their source, and a few cases like FMAs fed by non-shuffle integer ops (not that common), or non-shuffle integer ops fed by FMA or integer mul (also not that common).
The key part is that shuffles have zero delays in any configuration, as producer or consumer, except when a shuffle feeds an integer mul. That's good because shuffles are very common as inputs to both integer and FP ops.
It looks like the main remaining delay cases are integer mul instructions which get a bypass delay no matter what their source, and a few cases like FMAs fed by non-shuffle integer ops (not that common), or non-shuffle integer ops fed by FMA or integer mul (also not that common).
The key part is that shuffles have zero delays in any configuration, as producer or consumer, except when a shuffle feeds an integer mul. That's good because shuffles are very common as inputs to both integer and FP ops.