Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech

ftyers · on Aug 5, 2021

One of the most noticeable additions in my opinion is Guarani, the first Indigenous language of the Americas to be added. Indigenous languages are extremely poorly supported and forgotten by all of the major platforms and companies, and it's great to see one getting the attention they deserve. (Disclaimer: I was involved)

_etyf · on Aug 5, 2021

Whoah, 6.5 million native speakers! That's several orders of magnitude more than I was expecting. It's also significantly larger than the native-speaking populations of languages like Catalan, Basque, or Romansh, which might be more familiar to North Americans or Europeans.

kspacewalk2 · on Aug 5, 2021

There are a number of Native American languages that have numerous speakers, but until recently have been marginalized, repressed and ignored (and some to this day). Guarani is the most numerous, but also Quechua, Nahuatl, and the various Mayan languages (spoken by around half of Guatemalans, and another 2.5 million Mexicans).

klipt · on Aug 6, 2021

I am shocked, shocked to learn that countries identifying as Latin America suppress non-Latin derived languages!

jacobolus · on Aug 7, 2021

This is a very weird statement.

The problem is not the Spanish language. The problem is a colonial peasant economy/society that turned into a post-colonial peasant society, with land-owning quasi-nobility ruling over disempowered (in this case, indigenous) laborers and freely exercising their power to steal, rape, kill, etc., without penalty; it is a situation more or less comparable to peasant societies around the world and throughout history, which are always very exploitative and often racist.

Working class Spanish speakers living in towns were in many ways also economically exploited, but considered “better than indigenous people” to be a core part of their identities, and also felt free to beat them, steal from them, etc. where they found the opportunity. It’s a situation broadly comparable to race relations in the US south, where poor whites considered “better than blacks” to be a defining part of their identity.

Perhaps counterintuitively, the history of exploitation of indigenous communities, and the way indigenous people were shut out of many social and economic activities, led to the preservation of native languages.

djoldman · on Aug 5, 2021

Or, 20x more than Icelandic:

https://en.wikipedia.org/wiki/Icelandic_language

hkt · on Aug 5, 2021

Without wishing to get political, is the difference that Iceland is a country but Guarani speakers don't have a nation-state of their own? Or something else?

arp242 · on Aug 5, 2021

Note that Icelandic is currently not well supported either ("In progress" with 384/5000 sentences and 86% Localized). Actually, Guaraní is better supported at the moment, and quite a number of other common smaller-ish languages aren't well supported yet either such as Hebrew, Danish, and even Korean (which is not small or even small-ish at all). Some other smaller languages are, such as Breton or Irish. Overall, it's a bit inconsistent. I suppose that this is because in the end, these things depend on the number of people contributing; there's a reason Esperanto is near the top, as it has a very active community of enthusiasts who love to promote the language.

ftyers · on Aug 6, 2021

It takes about a week to get the interface translated and to start collection, for any language with at least 5000 sentences in the public domain. I helped bootstrap Guarani and Breton and a few other languages spoken by friends of mine, but in the end, it just takes one or two people. I think in general there is a big difference in engagement if STT/ASR already exists for the language (e.g. Hebrew, Danish and Korean) and if it doesn't exist at all.

chudi · on Aug 5, 2021

It's an official language of Paraguay

rudyfink · on Aug 5, 2021

In case anyone else wanted to know more, there are, apparently, 2 official languages and the other is Spanish. https://www.servat.unibe.ch/icl/pa00000_.html#A140_

interactivecode · on Aug 5, 2021

The difference is completely and inherently political.

caymanjim · on Aug 5, 2021

I think this is overly dismissive of other factors. Whether or not a language is supported by something on the Internet has a lot more to do with financial incentives than politics. If there were a huge consumer market clamoring to give their money to a site and the only barrier were language, it'd get exploited pretty quickly.

bedobi · on Aug 5, 2021

This is superficially correct, also completely disingenuous.

The reason why there isn't a huge consumer market for indigenous languages is because they're overwhelmingly systematically unsupported by their respective governments in favor of the non-indigenous colonial languages.

To be clear, that's not Mozilla's fault, and not something they or other random organizations can fix, but as human beings we should all be happy and give credit to those organizations that do their small part.

majewsky · on Aug 6, 2021

I don't think that's the entire picture. I live near a part of East Germany that has a minority language community, the Sorbs. Unlike the language communities that you seem to be thinking off, the Sorbian language is actively supported by the government. Protection of Sorbian language and culture is enshrined in the state constitution. All the street signs are bilingual. Sorbian is being taught at school to everyone who wants to learn it.

Yet I have never seen an application that had a Sorbian translation, for one simple reason: Every Sorb also speaks German, so there is no financial incentive to invest in a Sorbian translation. The only things with Sorbian translations are those produced in the local area, e.g. the websites of local governments or local businesses.

runarberg · on Aug 5, 2021

No it has a lot to do with politics as well. A sovereign nation may find it important to have their languages supported widely on the internet so they might use some of the public funds into funding translation efforts and voice recognition/speech synthesizer contributions.

I know the Icelandic government spends some money for this and it shows. This tiny language has way more support then other way more spoken languages. If the Norwegian government wanted I bet the Sámi languages could have just as good of a support as Icelandic. Or if the Greenlandic government had more funds available I bet we would see Kalaallisut in more places online.

necovek · on Aug 5, 2021

What you are saying is that a small, relatively rich country can invest in supporting their own language: that, to me, is not political, but as raised previously, financial. It's also a good incentive for other big players (Google, Microsoft, Apple) to invest in a language that has prospective customers willing to spend more.

Serbian government would certainly support Serbian language voice recognition and synthesis, but probably not with as much money as Iceland would.

eropple · on Aug 5, 2021

> that, to me, is not political, but as raised previously, financial

The idea that there is a difference between these two things is one of the more pernicious ones of the last hundred years.

Money is power. The exercise of power is politics. They can't be separated.

monocasa · on Aug 5, 2021

> Politics (from Greek: Πολιτικά, politiká, 'affairs of the cities') is the set of activities that are associated with making decisions in groups, or other forms of power relations between individuals, such as the distribution of resources or status.

It certainly sounds like this is a political situation to me, almost to a tautology. The fact that these decisions was made on the basis of financial gain doesn't make them any less political.

ftyers · on Aug 5, 2021

The Norwegian government and Sámi parliament put a lot of effort into language technology for the Sámo languages. A big problem is lack of openness in platform support. E.g. Google and Apple make it very difficult for external developers to do localisation.

air7 · on Aug 5, 2021

Like any feature, perhaps it has to do with the volume of anticpated use vs the effort to support.

moron4hire · on Aug 5, 2021

Nation-states are political entities, so choosing languages by such a distinction would absolutely be political.

singlow · on Aug 5, 2021

I'm sure having a nation-state is a major factor, but I bet it also has to do with the average wealth, geographic location, historical alliances. However, I'd put my money on skin color as the biggest factor.

runarberg · on Aug 5, 2021

As an example in favor of your conclusion, I propose Greenlandic. Geographically really close to Iceland, is the sole official language of an autonomous country, significant cultural heritage (with even a famous [possible] dwarf planet named after one of their historic gods). However—unlike Iceland—Greenland is not a wealthy country, and tend to have darker skin color then Icelanders.

puchatek · on Aug 5, 2021

Autonomous territory, not a country.

runarberg · on Aug 5, 2021

You are being overly pedantic. Country is not a strictly defined term. Sometimes it is used as you imply here as sovereign states which Greenland is not, but often it is used for other political entities as well. E.g. you often hear people speak of Puerto Rico as a country, and you also hear people from the UK pride them self that they are a country of countries (the former singular “country” meaning sovereign states, and the latter plural “countries” something else).

If the UK is a country of countries then Greenland is most certainly an autonomous country. The Wikipedia article for Greenland has the word “country” mentioned 14 times, so I’m certainly not the only using the term this way.

andrepd · on Aug 5, 2021

>It is one of the official languages of Paraguay (along with Spanish), where it is spoken by the majority of the population, and where half of the rural population is monolingual.

Wow, I had no idea

victorlf · on Aug 5, 2021

Catalan has about 10 million speakers.

pimterry · on Aug 6, 2021

In total, yes, but only about 4 million _native_ speakers.

runarberg · on Aug 5, 2021

As an Icelander I am always really impressed with how well my language—a language spoken by a few hundred thousand people worldwide—is supported on various platforms and technologies. This is probably in no small part thanks to active participation by native speakers and even some government funding.

However I at the same time I’m also deeply disappointed by the lack of support for Iceland’s closest neighbour’s language—Greenlandic—which is an indigenous language, the sole official language of an autonomous country.

matsemann · on Aug 5, 2021

I saw the same when I was younger for Norwegian. Bokmål is the most commonly written form of Norwegian, but New Norwegian is used by about ~15%. Most software included Bokmål support, but you could bet some hardcore user of New Norwegian had made a language pack available as well.

necovek · on Aug 5, 2021

Ah, I remember "Nynorsk" (sorry for the bad spelling and ASCIIation) localisation of GNOME from early 2000s!

Generally, it takes only a few dedicated people to get software localised if good enough infrastructure is provided by the community!

I hope that's what we see with Mozilla Common Voice too!

Sharlin · on Aug 5, 2021

"Nynorsk" is correct, no non-ASCII shenanigans in that word :)

jpetso · on Aug 6, 2021

For Mozilla Common Voice, it looks like even Bokmål isn't listed as dataset yet. Language packs have the advantage that a single dedicated user can come up with the entire thing, but for voice collections you need a large variety of different people and ideally tons of them. For any language with a small native speaker population, even a rich one like Norway and especially a fractional subset like Nynorsk, getting enough speakers to participate in open source collection efforts will remain a challenge. Purportedly, even for commercial companies it's hard to find enough Norwegians willing to speak a few sentences for a nominal payment unlike most other countries.

Luckily, speech recognition research is making some good progress on dealing with low-resource languages so hopefully we'll see some acceptable models made from the little available open data that's out there.

simongray · on Aug 6, 2021

> However I at the same time I’m also deeply disappointed by the lack of support for Iceland’s closest neighbour’s language—Greenlandic—which is an indigenous language, the sole official language of an autonomous country.

I'm not sure "autonomous country" is an accurate description of what Greenland is. It is - for all intents and purposes - a devolved region of Denmark. It is still way too reliant on economic aid to be able to be independent and, honestly, probably couldn't exist as a developed nation without a patron (Denmark) or without selling its land/resources to a great power (USA, China). And the population is only 1/6 the size of Iceland's and is very dispersed on a massive arctic island, with most people living in tiny isolated villages by the coast.

With that in mind, you wouldn't expect great language support unless the Danish state steps in and spends some serious dough on it. I actually work on Danish language technology at the University of Copenhagen and let me tell you something... the Danish state hardly spends any money on Danish language resources either. We envy the kind of funding that researchers in countries like Iceland and Norway have access too.

runarberg · on Aug 6, 2021

> the Danish state hardly spends any money on Danish language resources either.

I’m actually a little disappointed that there is not more collaboration between the language departments in Iceland and Greenland. Iceland does spend some money on foreign languages and there is much interest in general for foreign languages in Iceland. The former president Vigdís Finnbogadóttir is a huge language buff and advocates for foreign languages a lot. So much so that the house of foreign languages at the University is named after her (https://vigdis.hi.is/).

It is generally believed in Iceland that setting up Icelandic cultural institutions in Reykjavík played a big part in our independence. Institutions such as the University, libraries and the National Theater. There is also big interest for Greenlandic independence in Iceland. Therefor it would make sense for a rich country like Iceland to spend some money in progressing the status of Kalaallisut, both in Iceland (by shared cultural events), Greenland (by help funding cultural institutions) and internationally (by help funding online language efforts).

runarberg · on Aug 7, 2021

I’m writing this as a separate comment since it is an aside (i.e. not about investments in progressing indigenous languages online).

I don’t think it is wrong to call Greenland a country. As mentioned elsewhere, the word country is not strictly defined. Sometimes it means strictly independent nations, but most of the time it doesn’t. E.g. here is CIA calling Greenland a country (https://www.cia.gov/the-world-factbook/countries/greenland/).

rasz · on Aug 5, 2021

Whats the point when they killed DeepSpeech in exchange for adapting closed Nvidia thing?

https://venturebeat.com/2021/04/12/mozilla-winds-down-deepsp...

https://blog.mozilla.org/en/mozilla/mozilla-partners-with-nv...

$1.5mil for shutting down open source initiative, almost half of CEO salary right there.

jononor · on Aug 5, 2021

Open-source speech recognition is doing pretty good with projects such as VOSK, Athena, ESPNet and SpeechBrain. These days models are the easy part of ML, and data is the hard one. So for Mozilla to focus on Common Voice over DeepSpeech seems reasonable.

tkinom · on Aug 5, 2021

Would one use the youtube as training date?

Especially for the videos with Close Caption....

As simple as extracting the Audio and CC text?

soapdog · on Aug 5, 2021

You can't really do it because of licensing reasons. One cool thing Common Voice brings to the table, besides all the fantastic data, is the licensing.

anonymfus · on Aug 5, 2021

YouTube still allows uploaders to mark their videos as CC BY 3.0 licensed, and it's still possible to check that via YouTube's API.

(See https://support.google.com/youtube/answer/2797468 and the part about status.license here: https://developers.google.com/youtube/v3/docs/videos)

m-p-3 · on Aug 6, 2021

And the audio recordings are also curated by the volunteers, ensuring the audio snippets matches the text, etc.

jpetso · on Aug 6, 2021

Which, it must be said, isn't always as bullet-proof as it could be. There's a not insignificant amount of transcription (or pronunciation) errors in those datasets and Mozilla might want to find ways to increase the quality of already-released data over time.

ma2rten · on Aug 5, 2021

Are you sure it's not fair use? I believe most legal experts agree that language models such as GPT-3 are not violating copyright due to fair use.

M2Ys4U · on Aug 6, 2021

Fair use isn't a feature of copyright in every juristiction, which could make this a less than useful idea trying to create a global corpus of speech data.

humanistbot · on Aug 5, 2021

Fair use is whatever a judge and/or a jury says it is.

amelius · on Aug 5, 2021

Source?

NavinF · on Aug 5, 2021

This is incorrect. Pretty much every state of the art model uses copyrighted data. This is considered fair use and it has never been a problem outside of concern trolling.

tinus_hn · on Aug 7, 2021

As a lot of that cc text is automatically generated it seems like you’d just be creating a clone of other software, which might be an intellectual property issue.

hkt · on Aug 5, 2021

Having an open corpus means that researchers building the next thing in voice research - which may or may not follow DeepSpeech - have something to work with. This is enormously important and their change of direction lets a thousand flowers bloom. Meanwhile, their partnership with Nvidia provides a fertile ground to prove the value of the open corpus in action. Nvidia get access to Mozilla's (presumably superior) ability to build said corpus, while Mozilla lay the foundations for others to contribute work in the open. It is a great example of comparative advantage, and a win win choice, IMO.

rasz · on Aug 5, 2021

So in other words we provide data for free to Mozilla, and Mozilla turns around and sells it for millions to Nvidia to fund ... not open source, they killed that so umm ee, to fund ceo salary?

nmstoker · on Aug 5, 2021

You seem to imply that Nvidia are paying for data that is freely available.

Anyone can use the Common Voice data within the terms of the license and NVIDIA contributing towards the continued gathering of data (that will continue to be made publicly available) won't change that.

It's a huge shame that Mozilla didn't continue the DeepSpeech project but Coqui is taking on the mantle there and there are plenty of others working on open source solutions too, all whilst the existence of CV will make a big difference to research, in the academic, commercial and open source spheres.

robbedpeter · on Aug 5, 2021

Coqui is phenomenally good and well done, so this new data should lower the barrier to entry for the represented languages.

danShumway · on Aug 5, 2021

> and sells it

If that was true that would be a profoundly bad purchase for NVidia since the data is already freely licensed and available for anyone to use at no cost.

This is like saying that Epic "bought" Blender when they gave it a development grant, or that Google contributing patches to upstream Linux means they own it now. Mozilla didn't give NVidia any kind of special license, when NVidia contributes data to Common Voice they're doing so under Common Voice's license, not their own.

We want to encourage more companies to treat software and training data as a public commons that is collectively maintained, this is a good thing.

rasz · on Aug 5, 2021

Its the kind of "bad" Nvidia purchase like when they pay game publishers for incorporation of physx/cuda/hairworks/gameworks resulting in

https://techreport.com/news/14707/ubisoft-comments-on-assass...

https://techreport.com/review/21404/crysis-2-tessellation-to...

https://arstechnica.com/gaming/2015/05/amd-says-nvidias-game...

Here it appears they purchased this https://venturebeat.com/2021/04/12/mozilla-winds-down-deepsp...

danShumway · on Aug 5, 2021

This is silly. Common Voice is not adding NVidia-specific features; what would that even look like for a database? There is no comparison to be made between donating resources to an openly licensed database and encouraging developers to optimize their games for proprietary APIs.

And the assumption the shutting down Deep Speech was specifically for NVidia's benefit seems like a fairly large leap to me, given that Deep Speech is already mature, still being developed under Coqui.ai, and surrounded by a wide diversity of other deep learning projects that also aren't controlled by NVidia.

Decreasing barriers of entry for those models and providing raw data is probably the right thing for Mozilla to be focusing on right now. Any team can build a language model, only companies like Mozilla can coordinate mass data collection for those models.

mazoza · on Aug 5, 2021

I know the old speech team continues as Coqui https://github.com/coqui-ai/

tmalsburg2 · on Aug 5, 2021

About their TTS system: "These models provide speech synthesis with ~0.12 real-time factor on a GPU and ~1.02 on a CPU." The quality of the samples is really impressive but, wow, but isn't this computationally too expensive for many applications?

nyanpasu64 · on Aug 5, 2021

>If, for example, it takes 8 hours of computation time to process a recording of duration 2 hours, the real time factor is 4. When the real time factor is 1, the processing is done in real time. It is a hardware-dependent value.

I think real-time factors smaller than 1 are faster than real-time (not slower) and use less than 100% of a resource's computational power to keep up.

tmalsburg2 · on Aug 6, 2021

Not sure what you're quoting because I didn't write that, but

> I think real-time factors smaller than 1 are faster than real-time (not slower) and use less than 100% of a resource's computational power to keep up.

Sure, but who has the necessary GPUs installed? And on CPUs it will apparently take longer to generate speech than the duration of that speech. Unusable for many UIs and it will also drain the batteries of any portable device.

jpetso · on Aug 6, 2021

You're not wrong, but with so many chips incorporating some sort of dedicated "AI" or "tensor" functionality, perhaps the issue will resolve itself for most portable devices in a few years. Plus there's always the option of optimizing a little more and/or abusing other available hardware such as DSP chips to get the real time factor down. Anything over 1 isn't great, but it's not a bad start.

mazoza · on Aug 6, 2021

I means it is faster than real time almost 10x

So it is the contrary

tmalsburg2 · on Aug 6, 2021

It’s 8.3 times faster than real time if you have a beefy GPU, which most devices don’t have. On a desktop CPU it‘s real time and on smartphones worse.

stegrot · on Aug 5, 2021

Deepspeech is still alive in a way, the team founded the company coqui.ai after the Mozilla layoffs and they keep everything open source.

jononor · on Aug 5, 2021

What closed NVidia thing did they adopt? I don't see any evidence of that here.

option · on Aug 5, 2021

https://github.com/NVIDIA/NeMo which is open source, Pytorch based and regularly publishes new models and checkpoints.

Seirdy · on Aug 5, 2021

The source code is under a FLOSS license, but it only works on Nvidia GPUs and uses proprietary Nvidia-specific technologies like CUDA.

It's significantly closer to "nonfree" on the free-nonfree spectrum than it should be, and is another example of the difference between the guiding philosophies behind "free software" and "open source"

yorwba · on Aug 5, 2021

Can't you run it on CPU? And looking at the code, it seems like they're using Numba to JIT their CUDA kernels, so I guess someone could come along and provide a compatibility shim to make the kernels run on a non-CUDA accelerator?

rasz · on Aug 5, 2021

Im sure they signed on adopting "something", otherwise it would be receiving $1.5 million grant for closing open source initiative. $3 million a year lawyer wouldn never be this blatant.

moralestapia · on Aug 5, 2021

Lol, these guys sell themselves for peanuts.

danShumway · on Aug 5, 2021

I don't really have anything of substance to add here, but I'm very happy to see Mozilla continuing to put effort into this, happy to see effort being put into broadening the support beyond just English and major languages, and I'm grateful for the work that people (inside and outside of Mozilla) have already put into getting the project this far.

mgarciaisaia · on Aug 5, 2021

You arguably have something of substance to add - you can help improve the datasets by speaking or validating phrases in the project's website

https://commonvoice.mozilla.org/

There are many languages available to pick from.

orra · on Aug 5, 2021

Indeed, it's great to see open data corpuses expand.

S5yDyAk3XoQH5 · on Aug 6, 2021

Everyone has something to add. Go read stuff on their site in the languages you know. If people would actually do it for a few months all languages would have been done years ago.

johnnyApplePRNG · on Aug 5, 2021

Just tried rating some of the English voices and I am conflicted.

Most of them were definitely speaking English, but in an Indian intonation that I was barely able to understand coming from an English as a First Language country.

Some of them were reading words syllable by syllable, which is definitely English, but I would hate to have to listen to an ebook or webpage read aloud to me in that manner.

By clicking yes am I training the system to speak English with an Indian intonation?

Should I click no, not English?

Should/does english even have a "proper" intonation?

marc_abonce · on Aug 5, 2021

From https://commonvoice.mozilla.org/en/criteria

> Varying Pronunciations

> Be cautious before rejecting a clip on the ground that the reader has mispronounced a word, has put the stress in the wrong place, or has apparently ignored a question mark. There are a wide variety of pronunciations in use around the world, some of which you may not have heard in your local community. Please provide a margin of appreciation for those who may speak differently from you.

> On the other hand, if you think that the reader has probably never come across the word before, and is simply making an incorrect guess at the pronunciation, please reject. If you are unsure, use the skip button.

ma2rten · on Aug 5, 2021

I think this dataset is mainly for speech recognition and not text to speech. Speech recognition should be able to recognize as many different accents as possible.

usr1106 · on Aug 6, 2021

I think the reality is that there are more speakers of bad English than native speakers. I speak 2 foreign languages (including English) daily and 2 others occasionally. I know I make mistakes in all of them. In English I don't think I make a lot of pronunciation mistakes (there are some mistakes in grammar for sure). In Finnish I make a lot of pronunciation mistakes, although I speak better than many other non-native speakers. How much that really hurts understanding I have no idea. The amount of misunderstandings between humans does not seem to vary greatly between those languages or even my mother tongue.

Text to speech should work correctly. But speech recognition should tolerate even clear mistakes. Of course not for the price of misunderstanding correct pronunciation.

jturpin · on Aug 5, 2021

Wow you're right. This is conflicting as many of the words are not pronounced properly at all. Maybe it doesn't matter to the accuracy of the speech-to-text system, but it feels like training it with bad data.

humanistbot · on Aug 5, 2021

That's the point! When the postal service has to OCR mailing addresses, they need to do the messy scribbles more than the professionally printed labels.

jturpin · on Aug 6, 2021

That's fair, I'll have to think about that.

ohgodplsno · on Aug 5, 2021

Different accents isn't bad data. Your vision of the world of "english is only spoken with an american accent" is what leads to horrendous speech recognition APIs, like Google's.

If your ML model can't handle multiple accents, it is worthless.

jturpin · on Aug 5, 2021

There's a difference between an accent and pronouncing words wrong. I would expect an English speech recognition system to handle the various accents there are in the world (the US has several accents of course), but it shouldn't handle incorrect pronunciation of syllables if it comes at the expense of recognizing clean data. If it doesn't come at its expense then I guess it's fine.

jpetso · on Aug 6, 2021

Unfortunately, there's always a trade-off. You want both quality data for your use case, but you also want lots of data so it generalizes well. Those are conflicting goals.

Fortunately, splitting models into separate accent-specialized variants and helping them out with language model training will often help in case the model doesn't cope well enough with the cognitive dissonance.

topspin · on Aug 5, 2021

"english is only spoken with an american accent"

Which american accent?

magicalhippo · on Aug 5, 2021

Common Voice is not for generating speech, it's for detecting speech.

So don't worry about weird intonation as long as they correctly pronounce the sentences, that way even more people can enjoy the fruit of this labor.

satya71 · on Aug 5, 2021

> The top five languages by total hours are English (2,630 hours), Kinyarwanda (2,260), German (1,040), Catalan (920), and Esperanto (840)

Some unusual suspects among the top languages, there!

ftyers · on Aug 5, 2021

That's what happens when people have the opportunity and tools to support their own languages and not just rely on hand outs from big tech :)

1-6 · on Aug 5, 2021

You have a point there. I've been disappointed that Korean has been stuck in the 'In Progress' state. The Korean tech giants already have APIs to do common speech recognition. I hope more Korean grassroots efforts focus on tools that are open and accessible so it can be built scalable and better.

yorwba · on Aug 5, 2021

It looks like Korean still needs a fully localized interface and a sufficiently large collection of sentences to record. You can help by translating the interface https://pontoon.mozilla.org/projects/common-voice/ and collecting public-domain sentences https://commonvoice.mozilla.org/sentence-collector/ and of course by getting Koreans you know excited about the project so they'll help, too.

fleaaaa · on Aug 5, 2021

Thank you for pointing it out, I had no idea but I'd happy to contribute on this one. There is indeed a decent korean natural language process engine but it's severely tied to own ecosystem AFAIK.

https://papago.naver.com/

Anon1096 · on Aug 5, 2021

Esperanto is a hobby language for upper-middle class people in developed countries. It isn't anyone's "own language".

crvdgc · on Aug 5, 2021

> Esperanto is a hobby language for upper-middle class people in developed countries.

I wonder what gave you such an impression of Esperanto. My personal experience of Esperanto is quite different.

I started to casually self-learn Esperanto about one year ago as my second foreign language apart from English. After about half a year, I was confident enough to join online Esperanto communities and it gave me a surprisingly much more diverse experience than any community I had encountered on the Internet.

For example, in an online chat group, active users mainly come from US, South America, and Russia. As an person from East Asia, there is little chance for me to get in touch with the latter two groups otherwise. And there are often new users from South America who speak only Spanish and Esperanto.

I myself do not identify as a upper-middle class person, and I don't know enough to assess other Esperanto speakers' class status.

The impression of Esperanto speakers being upper-middle class may come from the fact people learn Esperanto as a hobby. But people not in the upper-middle class can have other hobbies, why is Esperanto different? It doesn't come with the many benefits that people may expect from learning a "practical" language, but it takes significantly less effort. I'd say it's about as hard as learning a new instrument. So it is not that exclusive to only upper-middle class people.

After one year of casual learning, I am now able to contribute to the Common Voice project in Esperanto (175 recordings and 123 validations) and I actually use it as a source of learning material.

jpetso · on Aug 6, 2021

You must be a fast learner. After one year of learning a new language, I personally would not feel comfortable speaking it well enough to use as examples for others.

crvdgc · on Aug 7, 2021

Thanks to the design of the language, each letter of Esperanto has a fixed pronunciation, and the stress is always on the second-to-last syllable. So after you learn the alphabet and some diphthongs, you are able to pronounce every Esperanto text in the canonical way (even if you don't know a single thing about the meaning). No exception. This is also a great feature for self-learning.

Of course, it takes time to fluently "read out" the words, and in practice, it's much easier if you just know the word and pull the pronunciation from your memory.

For the Common Voice project, there are usually two or three words in a batch of five sentences that I don't know. And there are unfamiliar places and names, since most of the text come from Wikipedia. In such case, I'll take my time to use the spelling to infer the correct pronunciation and practice it several times, until I can put it into the sentence. Then I'll record. And I know it must be correct.

If I am not sure about the meaning of the new word (you can usually guess from etymology or word formation), I look it up in the dictionary and learn a new word.

stegrot · on Aug 5, 2021

You are not wrong, but besides the upper-middle-class hobby people, there is also a 130 years old culture that exists parallel to it. I've met a few native Esperanto speakers, and for them Esperanto is their identity. Traditional Esperanto clubs exists in countries like Iran, Japan, China, Burundi, Nigeria and many more. So Esperanto is both, a nerdy hobby and an old culture.

samtheDamned · on Aug 5, 2021

They weren't exclusively talking about Esperanto. I read it as a reference to Kinyarwanda and Catalan more than anything else. In the bigger scheme of things there are a lot of languages here that are definitely a product of being able to share your own language. There's multiple native languages that are being shared here, like the thread above about Guarani.

bradrn · on Aug 5, 2021

Well, it has native speakers: https://en.wikipedia.org/wiki/Native_Esperanto_speakers

krrrh · on Aug 5, 2021

Technically there are a few hundred L1 speakers of Esperanto, but that doesn’t really contradict your point.

https://cogsci.ucsd.edu/~bkbergen/papers/NEJCL.pdf

hkt · on Aug 5, 2021

Weirdly judgemental.

Esperanto was designed to be easy to learn. It isn't an elite pursuit in the way you suggest, because its community isn't gatekept. I personally have met people of all social classes who have been interested in it.

It was also never meant to be a first language, it is an auxiliary language. It is possible for an English speaker to have a conversation with a Mandarin speaker with no intermediary if both know the (comparatively easy to learn) Esperanto. Its original purpose wasn't trivial either: it was created to stop groups without a common language in the same city (Warsaw, I think?) fighting, created on the basis that they'd stop doing so if only they could speak a common language.

Think of it as JVM bytecode for people.

least · on Aug 5, 2021

Auxiliary languages are kind of inherently doomed to fail to function as they're intended because in order for them to function as such, commitment needs to be made to adopt it multilaterally by governments with sufficient influence. If today the United States and China bilaterally decided to force Esperanto into their school curriculum it'd likely be adopted very quickly by everyone else, but that isn't the case and I doubt it ever would be under almost any circumstance, because learning English is just immediately more practical, even if it's a significantly more difficult language to be picked up.

And that's how it's played out. Nearly every developed nation teaches English as a second language or is a native population of English speakers. The universal language is English. The JVM bytecode for people is English.

hkt · on Aug 5, 2021

Spoken like an anglophone. Tell that to Latin America and East Asia..

least · on Aug 5, 2021

I don't have to, you can look at pretty much any of their language curriculum and find a huge presence of English in nearly all their education systems.

Certainly you will find people learning other languages for trade depending on the region, but even in East Asia, as you say, English is taught in China, Japanese, Korea. In Singapore English is the language everyone learns (and is taught in). In Vietnam the primary foreign language taught is English. In the Philippines one of its official languages is English. Argentina teaches English in elementary school. In Brazil students from grade 6 have to learn a language, which is usually English. In Venezuela English is taught from age 5.

So what exactly do I have to tell them?

yongjik · on Aug 5, 2021

Not sure about Latin America, but bring someone from each of China/Japan/Korea and they'll talk to each other in English.

voidnullnil · on Aug 5, 2021

> The JVM bytecode for people is English.

What are you telling me? That I need to drop English?

jl6 · on Aug 5, 2021

My takeaway is that nobody should speak English, but instead people should compose their sentences in a different language and then translate them to English at the point of speaking (with small pauses in the conversation for you to collect your thoughts on this garbage).

umeshunni · on Aug 5, 2021

Ah yes, major world languages with 10s or 100s of millions of speakers (Bengali, Korean, Malayalam) are ignored or are perpetually stuck "in progress" while hobby languages like Esperanto are supported.

stegrot · on Aug 5, 2021

Hey, I work on the Esperanto version of CV. You are right, many languages should be bigger than Esperanto, and we never planned to become this big, it just happened. We are around ten active people and a telegram group with a few hundred motivated donors. Plus, we write about the project in Esperanto magazines and talk about it on Esperanto congresses.

The point is: the only reason Bengali Korean and Malayalam are stuck "in progress" is that no one is working on them. No language but English is actively supported by Mozilla, it all comes from the communities. And the success of Esperanto shows that every language can make it. I hope that people take our work as a motivation. Every language can become big if a few motivated people work on it for a year or two. Even the smallest language can make it. You just need a lot of public domain sentences, a few thousand donors and some technical knowledge then your language will grow as well :)

umeshunni · on Aug 5, 2021

Sure, I was responding to the factitious comment above.

When I can use Google or Facebook in any of these languages for 10+ years, it's silly of this project to claim some high moral ground when you can't support some of the most widely spoken languages in the world and stick to languages that hipsters in San Francisco think is cool.

yorwba · on Aug 5, 2021

It can support those languages, they just need some people who actually speak them to come along and make it happen. If you can help, I'm sure it will be appreciated.

yorwba · on Aug 5, 2021

The project seems to have some serious government backing in Rwanda: https://digitalumuganda.com/

donhaker · on Aug 5, 2021

Let's take the time to appreciate the effort of Mozilla. To add new languages with others came from the minorities, we can't deny that they are continuously putting effort into the community.

Jnr · on Aug 5, 2021

The great open source community around Mozilla helps a lot.

When I did not see my own language in the list a year ago, and I had no clue how to get it there, I reached out to my university contacts that I know used to translate Firefox years ago.

With their help we quickly translated the whole common voice site (it was a prerequisite to start contributing a language) and provided first sets of text to start contributing.

In about a week we started contributing voice for a new language. The Common Voice project is awesome and very well made.

dabinat · on Aug 5, 2021

Common Voice is a great project that I’m glad Mozilla kept alive.

One problem is that data for speech recognition needs to be extremely accurate (i.e. the speech matches the transcript perfectly) and the human review process is infallible and there are quite a number of bad clips that made it past the review process (to be fair, Mozilla provides no official guidance to reviewers or recorders).

Plus in the early days, they were recording the same small sentence pool over and over again, so the first 700 hours or so are duplicates.

I hope there will be efforts in the future to clean up the existing dataset to improve its quality.

lunixbochs · on Aug 5, 2021

I'm an ASR researcher shipping high quality English models trained on limited resources, and while I've needed to include other datasets to make the model more robust to different kinds of text, Common Voice is a substantial part of my training process. I did not do any manual transcript accuracy cleanup. Most of my automated cleanup was done with very basic (low quality) models. My latest models trained this way are competitive with e.g. Google or Apple English speech recognition accuracy.

I'm going to disagree that there's a universal need for perfect training data in ASR. I'm sure it helps with some model types and training processes, but it simply hasn't been a factor in my use of Common Voice (English). I'll also note my best model can hit around 10% WER on Common Voice Test without any language model, which is better than any public numbers I've seen posted for it so far (I'm not even using a separate transformer decoder or RNN decoder layers for this number, just the raw output of CTC greedy decode).

None of the above even factors in techniques like wav2vec and IPL (iterative pseudo labeling) with noisy student, which suggest you can hit extremely competitive accuracy with very little correctly labeled data. These techniques are the underpinnings of the current state of the art models.

stegrot · on Aug 5, 2021

Here are some draft guidelines for validation that have been translated a lot: https://discourse.mozilla.org/t/discussion-of-new-guidelines...

But you are right, the process has some flaws. Maybe we can review the dataset automatically on some common errors, once an STT system is ready for a language?

The only other option I can think about is a validation process that includes more people per sentence. Right now, only two people validate a sentence, and if they disagree a third person decides. We could at least double check sentences with one "no" vote one more time.

dabinat · on Aug 5, 2021

The community guidelines are good but they’re hidden away on the forum. I was asking them for years to just make those the official guidelines and link them prominently on the CV site but they never did.

However, Hillary, the new community manager, seems good and she’s making a lot of positive changes so hopefully this will be addressed soon.

Long-term the best approach may be some kind of user onboarding before they can record / validate.

heyhillary · on Aug 10, 2021

Hey,

Thank you for the compliment and feedback.

Following community feedback voice validation criteria is now available on Common Voice platform (released as part of the recent dataset).

This is one of many steps we are making to improve Common Voice contributors and everyone using the dataset.

ma2rten · on Aug 5, 2021

Why does data for speech recognition need to be prefect. That's certainly not the case for other machine learning applications. Can you train the less clean data and fine-tune on a clean subset?

dabinat · on Aug 5, 2021

Well that was kind of my point: you need to manually figure out what’s clean and what isn’t.

ma2rten · on Aug 6, 2021

But it's easy to do that for a small subset for finetuning compared to cleaning up the entire dataset.

_gtly · on Aug 5, 2021

A direct link to where you can donate your voice here: https://commonvoice.mozilla.org/en

olejorgenb · on Aug 5, 2021

I find the recording UI a bit annoying. They make it unnecessary hard to re-record a clip. Re-recording the previous clip is likely to be a common thing to do. Instead of providing a shortcut for this, they have shortcuts for re-recording each of the individual 5 clips..

It's also impossible (?) to undo a clip. Eg.: If I've already recorded 3 clips and mistakenly begin a clip I simply can't pronounce correctly, there's no way of removing that clip without discarding the whole set. (EDIT: it is possible by re-recording that clip and pressing skip)

Vinnl · on Aug 5, 2021

Re-recording a clip is very rare for me. Keep in mind that it's supposed to emulate real-world conditions, with all its messiness.

SilverRed · on Aug 5, 2021

Yeah I think minor mess ups as long as the words are correct is actually good. As well as a bit of background noise. Problem is if Moz builds up a dataset of pure recordings and someone tries to use it but they are in a noisy room and the ML was never prepared for this.

alpb · on Aug 5, 2021

This may be off-topic but: What's the relationship between Coqui (an OSS TTS startup) https://coqui.ai/about and Mozilla? I recall that the project at one point was called mozilla/TTS (https://github.com/mozilla/TTS/) and now I see that has a fork in the startup's own repo (https://github.com/coqui-ai/TTS). Presumably Common Voice is used to train mozilla/TTS and other OSS TTS solutions?

ftyers · on Aug 5, 2021

Common Voice is mostly used for STT not TTS. TTS requires single speaker, clean audio. STT requires multi speaker, noisy audio.

LoriP · on Aug 5, 2021

Tips & Tricks incoming... I find that if I can't sleep and want something that's kind of useful to do without getting too involved, contributing to common voice is a great way to spend half an hour and relax/forget whatever it is I was churning about. I would recommend it for that, plus it's a great project. Both listening and voicing...

jalopy · on Aug 5, 2021

Going along with this: What are the latest and greatest open source speech-to-text models and/or tools out there?

Would love to hear from experienced practitioners and a bit of detail on the experience.

Thanks HN community!

orra · on Aug 5, 2021

Mozilla announced Deep Speech[1] around the same time as Common Voice.

Mozilla Deep Speech is an open source speech recognition engine, based upon Baidu's Deep Speech research paper[2].

Unsurprisingly, Deep Speech requires a corpus such as... Common Voice.

[1] https://github.com/mozilla/DeepSpeech

[2] https://arxiv.org/abs/1412.5567

rasz · on Aug 5, 2021

They killed this after Nvidia grant.

orra · on Aug 5, 2021

Ah, damn. Didn't realise.

It also looks like Baidu are now developing their Deep Speech as open source? https://github.com/PaddlePaddle/DeepSpeech

kcorbitt · on Aug 5, 2021

I've had good results with https://github.com/flashlight/flashlight/blob/master/flashli.... Seems to work well with spoken english in a variety of accents. Biggest limitation is that the architecture they have pretrained models for doesn't really work well with clips longer than ~15 seconds, so you have to segment your input files.

blackcat201 · on Aug 5, 2021

I created edgedict [0] a year ago part of my side projects. At that time this is the only open source STT with streaming capabilities. If anyone is interested the pretrained weights for english and chinese is available.

[0] https://github.com/theblackcat102/edgedict

lazyresearcher · on Aug 6, 2021

Kaldi and DeepSpeech both support streaming, right?

mazoza · on Aug 5, 2021

https://github.com/coqui-ai/STT

jononor · on Aug 5, 2021

Have used VOSK a bit recently. The out-of-the-box experience was great compared to earlier projects (looking at you Kaldi and Sphinx...). Word-level audio segmentation was one usecase, https://stackoverflow.com/a/65370463/1967571

woodson · on Aug 5, 2021

Vosk is built on Kaldi.

stegrot · on Aug 5, 2021

Kdenlive supports automatic subtitles created with VOSK now btw. This makes it a lot more accessible for non-tech folks.

zerop · on Aug 5, 2021

Vosk is my favourite. I have used deep speech too. Vosk works better.

nshm · on Aug 5, 2021

Thank you. I deeply appreciate you mention our efforts. We spend quite some time and knowledge to build accurate speech recognition. Not that easy to get as much mentions as Mozilla, so we are thankful for every single one!

zerop · on Aug 6, 2021

Vosk just works good and it works on mobile platforms too. One suggestion is to put lisence on alphacephei site. GitHub repo has it, but site doesn't.

woodson · on Aug 5, 2021

NVidia NeMo: https://github.com/NVIDIA/NeMo

thom · on Aug 5, 2021

Same question for text-to-speech!

fisxoj · on Aug 5, 2021

If anyone has interest contributing, I've found this app for Android makes it very easy! https://www.saveriomorelli.com/commonvoice/

nmstoker · on Aug 5, 2021

Why on Earth would anyone use an app for this when mobile browsers work perfectly well for adding audio to Common Voice?

We could possibly give the developer the benefit of the doubt that they're not doing anything inappropriate with the data but frankly why pass your data through a third party that's not part of the project.

And why install an app requiring access to your shared local storage? The GitHub repo claims the website an animations are slow which sounds like BS to me. It works fine on a five year old phone I use for submitting.

Just contribute here if you're so inclined, much more sensible:

https://commonvoice.mozilla.org/en

commoner · on Aug 5, 2021

The unofficial CV Project Android app is entirely open source and available on F-Droid:

https://github.com/Sav22999/common-voice-android

https://f-droid.org/packages/org.commonvoice.saverio/

nmstoker · on Aug 5, 2021

Yes, I referenced the GitHub repo comments.

Sure, you can get the source but as I said it's still a pointless step to go via a third party

stegrot · on Aug 5, 2021

The app has a few nice features the website doesn't have, such as changing the speed during validation. It always surprises me as well, but many people hate to use web apps on mobile. I don't really know why, they simply ask for an app and refuse to use a browser.

totetsu · on Aug 5, 2021

because mozilla fired all the cv team, and the app is under active development?

nmstoker · on Aug 5, 2021

You aren't distinguishing the projects correctly. The CV project isn't the same as the DeepSpeech project (even though they were related).

And your point makes little sense, because if the site was not working how could the app get voice data into the project. I've had some involvement with these projects over the years so I'm not just firing off arm-chair comments on this. They wouldn't have been able to add this new voice data if the site was under developed as you imply.

totetsu · on Aug 6, 2021

so what happend since this? https://discourse.mozilla.org/t/mozilla-org-wide-updates-imp...

bravura · on Aug 5, 2021

Is anyone aware of classification (e.g. word prediction) datasets for low-resource and endangered languages?

If so, we would like to use it for the HEAR NeurIPS competition: https://github.com/microsoft/DNS-Challenge/tree/master/datas...

The challenge is restricted only to classification tasks, and sequence modeling like full ASR is unfortunately beyond the scope of the competition.

pkz · on Aug 5, 2021

Openly licensed speech data for smaller languages is great! I hope as many as possible contribute in order to get better representation across ages and pronunciation. In the end, this may be what is needed for the hyperscale companies to support speech assistants in more languages?

fareesh · on Aug 5, 2021

Is voice transcription accessible to mere mortals yet?

I have tried pretty much every API offered by big tech, and also various open source models. All of them seem to have incredibly high word error rates. This is mostly for conversations with various Indian accents.

nshm · on Aug 5, 2021

Did you try Vosk Indian English model? It is specifically built for Indian accent English

https://alphacephei.com/vosk/models/vosk-model-en-in-0.4.zip

In case you want more accuracy you can share a file with an example, we can take a look on how to make the best accuracy.

For Indian ASR it is also worth to mention recently introduced Vakyansh project which builds model for major Indian languages:

https://github.com/Open-Speech-EkStep/vakyansh-models

say_it_as_it_is · on Aug 5, 2021

"The top five languages by total hours are English (2,630 hours), Kinyarwanda (2,260) , German (1,040), Catalan (920), and Esperanto (840)."

How did they get almost as much training for Kinyarwanda as they have English?

stegrot · on Aug 5, 2021

The German Federal Ministry for Economic Cooperation and Development supported this language: https://www.bmz.de/de/aktuelles/intelligente-sprachtechnolog...

say_it_as_it_is · on Aug 5, 2021

Interesting! There's a market for this kind of audio data entry? What was the total cost for that many hours? The English data was entirely volunteer driven, correct? Maybe it's worth funding the English corpus for the additional hours needed to reach the sweet spot?

nshm · on Aug 8, 2021

Data cost plunges these days with self-supervised and semi-supervised learning. You don't need annotated and clean data anymore, there is abundance of it. Projects like Voxpopuli or Gigaspeech with 400 thousand hours (100 times more than Mozilla's) of data easily available.

nyx-aiur · on Aug 5, 2021

I love the datasets but they are still way to small especially for exotic languages.

arghwhat · on Aug 5, 2021

People seem to speak extremely mechanically in these samples, which I suspect may lead to training bias against native as speech if used.

I think it should be explained that one should speak naturally when reading the lines.

stegrot · on Aug 6, 2021

Many people are also speaking very mechanically when they use a voice assistant, though ;) I believe we need a good mix, but telling people to speak a little more naturally certainly would help.

tsjq · on Aug 5, 2021

nice. !

news from the past about this :

Initial Release of Mozilla’s Open Source Speech Recognition Model and Voice Data : https://news.ycombinator.com/item?id=15808124

Mozilla releases the largest to-date public domain transcribed voice dataset https://news.ycombinator.com/item?id=19270646

junon · on Aug 5, 2021

Historically not been the biggest fan of Mozilla but I really, really love this project. I'm glad they're keeping it alive.

Edman274 · on Aug 5, 2021

I'm guessing that of the 4,600 new hours of speech, maybe 4,100 of those hours are of men's voices and 500 hours are of women's voices, yeah?

heyhillary · on Aug 16, 2021

Thanks so much for sharing your comment. Gender equality in participation in Common Voice, is something we really want to improve and champion. As part of the Kiswahili Language community engagement, our team are implementing a gender action plan that includes both participation and use cases for the dataset. We hope to consult, adapt and replicate gender inclusion that has been done by community members and gender action plan to improve representation and involvement of all genders in open source projects such as Common Voice.

LoriP · on Aug 5, 2021

To be fair not sure that's the best guess :) there seem to be more female voices than men to me. Anyhow, I'd wager there's at least a 50:50 mix.

Edman274 · on Aug 6, 2021

I probably should've looked this up before I decided to comment but at least according to this:

https://commonvoice.mozilla.org/en/datasets

The ratio of male to female tagged voices in the English dataset is 45 percent male to 15 percent female. (The remaining 40 percent is untagged.) Odds are good that the ratio is closer to 75 25 than 50 50, at least by hours of recorded audio.