Is there any information about the training process? Which data was used, which license was that data under and which tools, drivers and hardware was used for the training?
Basically I'm wondering if these projects count as libre machine learning projects according to the Debian Deep Learning Team's Machine Learning Policy.
They do, the issue is with Tensorflow support iirc and with NVIDIA drivers.
So for the English model they use mostly free/open-source data, but some non-free data:
- train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora. (from https://github.com/coqui-ai/STT/releases/tag/v0.9.3)
As for the hardware for training... it's basically NVIDIA (you need Tensorflow / CUDA and all that guff). For inference it works realtime on a CPU.
There is plenty of free/open-source voice data out there, just it's a question of reaching for a sick bag and installing NVIDIAs stuff.
I don't know if they'd count as free/open-source according to Debian (I'm a Debian user myself), but the team has definitely talked about getting into Debian and would be very open to discussions about it.
Check out the ML policy, it sounds like it would currently be classified as a "toxic candy" model due to the non-free data, but it sounds like you could re-train to avoid that.
Using CUDA also means it wouldn't be considered legitimately free enough, although folks are working on getting AMD ROCm into Debian.
Tensorflow isn't yet in Debian, but there may be folks working on it.
Another problem is that Debian doesn't have the hardware for doing training.
I'd encourage you to talk to the folks on the debian-ai mailing list and IRC channel to discuss these and other issues.
AFAIK, using copyrighted data to train does not necessarily make the trained model "toxic". "Authors Guild, Inc. v. Google, Inc." case [1] is viewed as a key precedent for this view.
The phrase is "toxic candy" not "toxic", see the policy for what it means.
Most data is protected by copyright, but I assume you meant proprietary rather than copyrighted. Using proprietary data might not matter under copyright law, but it does matter in terms of the Debian machine learning policy and DFSG, because the non-free data cannot be shipped in Debian main and thus cannot be used to train a model shipped in main.
Yeah, ROCm is a bit of a mess. I actually have an AMD GPU in a server, but the drivers in the mainline kernel don't work properly, so have never been able to use it.
If I were into conspiracy theories I'd say that AMDs failure to compete in the GPU/DL space is to do with the relationship between the AMD CEO and the NVIDIA one.
Tensorflow is just awful, as is anything that touches bazel :)
I wasted a week trying to replace the scorer component with a NN-based language model. Every time I made a change the whole codebase, including Tensorflow recompiled, so the turnaround time was about an hour per change. It was awful. I mean I get reproducible builds etc. and probably if you're running stuff at Google scale it has all kinds of useful features. But for development on a personal laptop it was torture. Eventually I gave up.
Fwiw, that sounds like a bug or a misconfiguration; it's absolutely supposed to have better caching behavior than that (and does in the few projects I've used it on, even on a personal laptop). If you're interested in pursuing it further (I'd understand if you aren't; that sounds frustrating), I bet the bazel team would be interested in your report.
You don't need _no_ errors, you just need low errors.
Aside from Common Voice, there are also a lot of resources at openslr. Also, the amount of data you need is often vastly overestimated, with advances in pretraining and transfer learning and the fact that most languages don't have as terrible an orthography as English.
Basically I'm wondering if these projects count as libre machine learning projects according to the Debian Deep Learning Team's Machine Learning Policy.
https://salsa.debian.org/deeplearning-team/ml-policy