They do two things - RLHF to make the model itself better aligned to human preferences, and they use an external model, a small one, called text-moderation-001, that tests for a few problematic categories and triggers a warning message on the screen.