The constraint absolutely was not that, it was that they don't want to have an instruction that can load more than two cache lines worth of data, or that can write to more than two registers.
Is it? These are stack management instructions; you know in advance from the platform ABI which registers are "scratch" and which are "save", so if you allocate any of the "save" registers in your function you emit the corresponding push/pop pairs in the start and end of the function. At worst you push/pop a register you don't need to - but in ARM64, the stack has to be 128-bit aligned, so you have to push/pop pairs.
Then your allocator would need to know that if it's already decided to use one register of a pair that the other half of the pair gets the save/restore for 'free' and is now better than using a different callee-saves register. I suspect that unless your allocator was designed from the start to be able to deal with that kind of "my choice of register here affects costs and thus my decisions about allocation for a completely different value over there" it's not going to be able to do a great job under that kind of constraint.
Callee-saved registers are saved on function entry, all at once. There is no interaction other than the register allocation step choosing how many values need to be preserved across function calls.