AFAIK, using copyrighted data to train does not necessarily make the trained model "toxic". "Authors Guild, Inc. v. Google, Inc." case [1] is viewed as a key precedent for this view.
The phrase is "toxic candy" not "toxic", see the policy for what it means.
Most data is protected by copyright, but I assume you meant proprietary rather than copyrighted. Using proprietary data might not matter under copyright law, but it does matter in terms of the Debian machine learning policy and DFSG, because the non-free data cannot be shipped in Debian main and thus cannot be used to train a model shipped in main.
[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....