It is difficult to tell in quantitative terms how much better the results become. I have a tool to measure the quality of prompts in the project, but it is more oriented to detecting critical degradation of results rather than comparing them (pet-project, not a lot of time for such tools, etc.).
In qualitative terms, it became much less totally wrong / hallucinating tags. I estimate, that number of resulted tags (per news item) just reduced by 30-40% because of it.