Is there any information about the training process? Which data was used, which ...

ftyers · on April 14, 2021

They do, the issue is with Tensorflow support iirc and with NVIDIA drivers.

So for the English model they use mostly free/open-source data, but some non-free data:

- train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora. (from https://github.com/coqui-ai/STT/releases/tag/v0.9.3)

As for the hardware for training... it's basically NVIDIA (you need Tensorflow / CUDA and all that guff). For inference it works realtime on a CPU.

I'm preparing pretrained models based on Common Voice data for their STT based on Common Voice data here: https://tepozcatl.omnilingo.cc/manifest.html

There is plenty of free/open-source voice data out there, just it's a question of reaching for a sick bag and installing NVIDIAs stuff.

I don't know if they'd count as free/open-source according to Debian (I'm a Debian user myself), but the team has definitely talked about getting into Debian and would be very open to discussions about it.

pabs3 · on April 15, 2021

Check out the ML policy, it sounds like it would currently be classified as a "toxic candy" model due to the non-free data, but it sounds like you could re-train to avoid that.

Using CUDA also means it wouldn't be considered legitimately free enough, although folks are working on getting AMD ROCm into Debian.

Tensorflow isn't yet in Debian, but there may be folks working on it.

Another problem is that Debian doesn't have the hardware for doing training.

I'd encourage you to talk to the folks on the debian-ai mailing list and IRC channel to discuss these and other issues.

https://lists.debian.org/debian-ai/ ircs://irc.oftc.net/debian-ai

donpark · on April 15, 2021

AFAIK, using copyrighted data to train does not necessarily make the trained model "toxic". "Authors Guild, Inc. v. Google, Inc." case [1] is viewed as a key precedent for this view.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

pabs3 · on April 15, 2021

The phrase is "toxic candy" not "toxic", see the policy for what it means.

Most data is protected by copyright, but I assume you meant proprietary rather than copyrighted. Using proprietary data might not matter under copyright law, but it does matter in terms of the Debian machine learning policy and DFSG, because the non-free data cannot be shipped in Debian main and thus cannot be used to train a model shipped in main.

pabs3 · on April 15, 2021

Hmm, that case doesn't appear to be about ML though, could you explain how it is considered a precedent for ML?

donpark · on April 16, 2021

See https://towardsdatascience.com/the-most-important-supreme-co...

pabs3 · on April 18, 2021

Thanks. Its interesting that this only applies to countries with the concept of fair use, which is unfortunately not widespread.

ftyers · on April 15, 2021

Yeah, ROCm is a bit of a mess. I actually have an AMD GPU in a server, but the drivers in the mainline kernel don't work properly, so have never been able to use it.

If I were into conspiracy theories I'd say that AMDs failure to compete in the GPU/DL space is to do with the relationship between the AMD CEO and the NVIDIA one.

Tensorflow is just awful, as is anything that touches bazel :)

hansvm · on April 15, 2021

> Tensorflow is just awful, as is anything that touches bazel :)

Mind if I ask why?

ftyers · on April 15, 2021

I wasted a week trying to replace the scorer component with a NN-based language model. Every time I made a change the whole codebase, including Tensorflow recompiled, so the turnaround time was about an hour per change. It was awful. I mean I get reproducible builds etc. and probably if you're running stuff at Google scale it has all kinds of useful features. But for development on a personal laptop it was torture. Eventually I gave up.

hansvm · on April 15, 2021

Got it, thank you.

Fwiw, that sounds like a bug or a misconfiguration; it's absolutely supposed to have better caching behavior than that (and does in the few projects I've used it on, even on a personal laptop). If you're interested in pursuing it further (I'd understand if you aren't; that sounds frustrating), I bet the bazel team would be interested in your report.

hutzlibu · on April 15, 2021

"There is plenty of free/open-source voice data out there"

No doubt about that, but you need validated transcripted voice data(no errors) and this is harder to get.

ftyers · on April 15, 2021

You don't need _no_ errors, you just need low errors.

Aside from Common Voice, there are also a lot of resources at openslr. Also, the amount of data you need is often vastly overestimated, with advances in pretraining and transfer learning and the fact that most languages don't have as terrible an orthography as English.