Believe it or not, it's a simple as averaging or adding the gradients of each training result before adding it to the model weights. The same thing happens when you train a model using batches of inputs.
It actually isn't. You have to have a synchronizer, batchsize one or else "strange things" can happen and you waste a lot of cycles. Alternatively you can do non-simple changes to your network structure to enable distributed training.
It really is that simple. Yes, there's many different approaches to this (which can become quite clever and complex, which is true of training in general), but it all really boils down to adding or averaging the gradients in most cases.