One of the most noticeable additions in my opinion is Guarani, the first Indigenous language of the Americas to be added. Indigenous languages are extremely poorly supported and forgotten by all of the major platforms and companies, and it's great to see one getting the attention they deserve. (Disclaimer: I was involved)
Whoah, 6.5 million native speakers! That's several orders of magnitude more than I was expecting. It's also significantly larger than the native-speaking populations of languages like Catalan, Basque, or Romansh, which might be more familiar to North Americans or Europeans.
There are a number of Native American languages that have numerous speakers, but until recently have been marginalized, repressed and ignored (and some to this day). Guarani is the most numerous, but also Quechua, Nahuatl, and the various Mayan languages (spoken by around half of Guatemalans, and another 2.5 million Mexicans).
The problem is not the Spanish language. The problem is a colonial peasant economy/society that turned into a post-colonial peasant society, with land-owning quasi-nobility ruling over disempowered (in this case, indigenous) laborers and freely exercising their power to steal, rape, kill, etc., without penalty; it is a situation more or less comparable to peasant societies around the world and throughout history, which are always very exploitative and often racist.
Working class Spanish speakers living in towns were in many ways also economically exploited, but considered “better than indigenous people” to be a core part of their identities, and also felt free to beat them, steal from them, etc. where they found the opportunity. It’s a situation broadly comparable to race relations in the US south, where poor whites considered “better than blacks” to be a defining part of their identity.
Perhaps counterintuitively, the history of exploitation of indigenous communities, and the way indigenous people were shut out of many social and economic activities, led to the preservation of native languages.
Without wishing to get political, is the difference that Iceland is a country but Guarani speakers don't have a nation-state of their own? Or something else?
Note that Icelandic is currently not well supported either ("In progress" with 384/5000 sentences and 86% Localized). Actually, Guaraní is better supported at the moment, and quite a number of other common smaller-ish languages aren't well supported yet either such as Hebrew, Danish, and even Korean (which is not small or even small-ish at all). Some other smaller languages are, such as Breton or Irish. Overall, it's a bit inconsistent. I suppose that this is because in the end, these things depend on the number of people contributing; there's a reason Esperanto is near the top, as it has a very active community of enthusiasts who love to promote the language.
It takes about a week to get the interface translated and to start collection, for any language with at least 5000 sentences in the public domain. I helped bootstrap Guarani and Breton and a few other languages spoken by friends of mine, but in the end, it just takes one or two people. I think in general there is a big difference in engagement if STT/ASR already exists for the language (e.g. Hebrew, Danish and Korean) and if it doesn't exist at all.
I think this is overly dismissive of other factors. Whether or not a language is supported by something on the Internet has a lot more to do with financial incentives than politics. If there were a huge consumer market clamoring to give their money to a site and the only barrier were language, it'd get exploited pretty quickly.
This is superficially correct, also completely disingenuous.
The reason why there isn't a huge consumer market for indigenous languages is because they're overwhelmingly systematically unsupported by their respective governments in favor of the non-indigenous colonial languages.
To be clear, that's not Mozilla's fault, and not something they or other random organizations can fix, but as human beings we should all be happy and give credit to those organizations that do their small part.
I don't think that's the entire picture. I live near a part of East Germany that has a minority language community, the Sorbs. Unlike the language communities that you seem to be thinking off, the Sorbian language is actively supported by the government. Protection of Sorbian language and culture is enshrined in the state constitution. All the street signs are bilingual. Sorbian is being taught at school to everyone who wants to learn it.
Yet I have never seen an application that had a Sorbian translation, for one simple reason: Every Sorb also speaks German, so there is no financial incentive to invest in a Sorbian translation. The only things with Sorbian translations are those produced in the local area, e.g. the websites of local governments or local businesses.
No it has a lot to do with politics as well. A sovereign nation may find it important to have their languages supported widely on the internet so they might use some of the public funds into funding translation efforts and voice recognition/speech synthesizer contributions.
I know the Icelandic government spends some money for this and it shows. This tiny language has way more support then other way more spoken languages. If the Norwegian government wanted I bet the Sámi languages could have just as good of a support as Icelandic. Or if the Greenlandic government had more funds available I bet we would see Kalaallisut in more places online.
What you are saying is that a small, relatively rich country can invest in supporting their own language: that, to me, is not political, but as raised previously, financial. It's also a good incentive for other big players (Google, Microsoft, Apple) to invest in a language that has prospective customers willing to spend more.
Serbian government would certainly support Serbian language voice recognition and synthesis, but probably not with as much money as Iceland would.
> Politics (from Greek: Πολιτικά, politiká, 'affairs of the cities') is the set of activities that are associated with making decisions in groups, or other forms of power relations between individuals, such as the distribution of resources or status.
It certainly sounds like this is a political situation to me, almost to a tautology. The fact that these decisions was made on the basis of financial gain doesn't make them any less political.
The Norwegian government and Sámi parliament put a lot of effort into language technology for the Sámo languages. A big problem is lack of openness in platform support. E.g. Google and Apple make it very difficult for external developers to do localisation.
I'm sure having a nation-state is a major factor, but I bet it also has to do with the average wealth, geographic location, historical alliances. However, I'd put my money on skin color as the biggest factor.
As an example in favor of your conclusion, I propose Greenlandic. Geographically really close to Iceland, is the sole official language of an autonomous country, significant cultural heritage (with even a famous [possible] dwarf planet named after one of their historic gods). However—unlike Iceland—Greenland is not a wealthy country, and tend to have darker skin color then Icelanders.
You are being overly pedantic. Country is not a strictly defined term. Sometimes it is used as you imply here as sovereign states which Greenland is not, but often it is used for other political entities as well. E.g. you often hear people speak of Puerto Rico as a country, and you also hear people from the UK pride them self that they are a country of countries (the former singular “country” meaning sovereign states, and the latter plural “countries” something else).
If the UK is a country of countries then Greenland is most certainly an autonomous country. The Wikipedia article for Greenland has the word “country” mentioned 14 times, so I’m certainly not the only using the term this way.
>It is one of the official languages of Paraguay (along with Spanish), where it is spoken by the majority of the population, and where half of the rural population is monolingual.
As an Icelander I am always really impressed with how well my language—a language spoken by a few hundred thousand people worldwide—is supported on various platforms and technologies. This is probably in no small part thanks to active participation by native speakers and even some government funding.
However I at the same time I’m also deeply disappointed by the lack of support for Iceland’s closest neighbour’s language—Greenlandic—which is an indigenous language, the sole official language of an autonomous country.
I saw the same when I was younger for Norwegian. Bokmål is the most commonly written form of Norwegian, but New Norwegian is used by about ~15%. Most software included Bokmål support, but you could bet some hardcore user of New Norwegian had made a language pack available as well.
For Mozilla Common Voice, it looks like even Bokmål isn't listed as dataset yet. Language packs have the advantage that a single dedicated user can come up with the entire thing, but for voice collections you need a large variety of different people and ideally tons of them. For any language with a small native speaker population, even a rich one like Norway and especially a fractional subset like Nynorsk, getting enough speakers to participate in open source collection efforts will remain a challenge. Purportedly, even for commercial companies it's hard to find enough Norwegians willing to speak a few sentences for a nominal payment unlike most other countries.
Luckily, speech recognition research is making some good progress on dealing with low-resource languages so hopefully we'll see some acceptable models made from the little available open data that's out there.
> However I at the same time I’m also deeply disappointed by the lack of support for Iceland’s closest neighbour’s language—Greenlandic—which is an indigenous language, the sole official language of an autonomous country.
I'm not sure "autonomous country" is an accurate description of what Greenland is. It is - for all intents and purposes - a devolved region of Denmark. It is still way too reliant on economic aid to be able to be independent and, honestly, probably couldn't exist as a developed nation without a patron (Denmark) or without selling its land/resources to a great power (USA, China). And the population is only 1/6 the size of Iceland's and is very dispersed on a massive arctic island, with most people living in tiny isolated villages by the coast.
With that in mind, you wouldn't expect great language support unless the Danish state steps in and spends some serious dough on it. I actually work on Danish language technology at the University of Copenhagen and let me tell you something... the Danish state hardly spends any money on Danish language resources either. We envy the kind of funding that researchers in countries like Iceland and Norway have access too.
> the Danish state hardly spends any money on Danish language resources either.
I’m actually a little disappointed that there is not more collaboration between the language departments in Iceland and Greenland. Iceland does spend some money on foreign languages and there is much interest in general for foreign languages in Iceland. The former president Vigdís Finnbogadóttir is a huge language buff and advocates for foreign languages a lot. So much so that the house of foreign languages at the University is named after her (https://vigdis.hi.is/).
It is generally believed in Iceland that setting up Icelandic cultural institutions in Reykjavík played a big part in our independence. Institutions such as the University, libraries and the National Theater. There is also big interest for Greenlandic independence in Iceland. Therefor it would make sense for a rich country like Iceland to spend some money in progressing the status of Kalaallisut, both in Iceland (by shared cultural events), Greenland (by help funding cultural institutions) and internationally (by help funding online language efforts).
I’m writing this as a separate comment since it is an aside (i.e. not about investments in progressing indigenous languages online).
I don’t think it is wrong to call Greenland a country. As mentioned elsewhere, the word country is not strictly defined. Sometimes it means strictly independent nations, but most of the time it doesn’t. E.g. here is CIA calling Greenland a country (https://www.cia.gov/the-world-factbook/countries/greenland/).
Open-source speech recognition is doing pretty good with projects such as VOSK, Athena, ESPNet and SpeechBrain.
These days models are the easy part of ML, and data is the hard one. So for Mozilla to focus on Common Voice over DeepSpeech seems reasonable.
You can't really do it because of licensing reasons. One cool thing Common Voice brings to the table, besides all the fantastic data, is the licensing.
Which, it must be said, isn't always as bullet-proof as it could be. There's a not insignificant amount of transcription (or pronunciation) errors in those datasets and Mozilla might want to find ways to increase the quality of already-released data over time.
Fair use isn't a feature of copyright in every juristiction, which could make this a less than useful idea trying to create a global corpus of speech data.
This is incorrect. Pretty much every state of the art model uses copyrighted data. This is considered fair use and it has never been a problem outside of concern trolling.
As a lot of that cc text is automatically generated it seems like you’d just be creating a clone of other software, which might be an intellectual property issue.
Having an open corpus means that researchers building the next thing in voice research - which may or may not follow DeepSpeech - have something to work with. This is enormously important and their change of direction lets a thousand flowers bloom. Meanwhile, their partnership with Nvidia provides a fertile ground to prove the value of the open corpus in action. Nvidia get access to Mozilla's (presumably superior) ability to build said corpus, while Mozilla lay the foundations for others to contribute work in the open. It is a great example of comparative advantage, and a win win choice, IMO.
So in other words we provide data for free to Mozilla, and Mozilla turns around and sells it for millions to Nvidia to fund ... not open source, they killed that so umm ee, to fund ceo salary?
You seem to imply that Nvidia are paying for data that is freely available.
Anyone can use the Common Voice data within the terms of the license and NVIDIA contributing towards the continued gathering of data (that will continue to be made publicly available) won't change that.
It's a huge shame that Mozilla didn't continue the DeepSpeech project but Coqui is taking on the mantle there and there are plenty of others working on open source solutions too, all whilst the existence of CV will make a big difference to research, in the academic, commercial and open source spheres.
If that was true that would be a profoundly bad purchase for NVidia since the data is already freely licensed and available for anyone to use at no cost.
This is like saying that Epic "bought" Blender when they gave it a development grant, or that Google contributing patches to upstream Linux means they own it now. Mozilla didn't give NVidia any kind of special license, when NVidia contributes data to Common Voice they're doing so under Common Voice's license, not their own.
We want to encourage more companies to treat software and training data as a public commons that is collectively maintained, this is a good thing.
This is silly. Common Voice is not adding NVidia-specific features; what would that even look like for a database? There is no comparison to be made between donating resources to an openly licensed database and encouraging developers to optimize their games for proprietary APIs.
And the assumption the shutting down Deep Speech was specifically for NVidia's benefit seems like a fairly large leap to me, given that Deep Speech is already mature, still being developed under Coqui.ai, and surrounded by a wide diversity of other deep learning projects that also aren't controlled by NVidia.
Decreasing barriers of entry for those models and providing raw data is probably the right thing for Mozilla to be focusing on right now. Any team can build a language model, only companies like Mozilla can coordinate mass data collection for those models.
About their TTS system: "These models provide speech synthesis with ~0.12 real-time factor on a GPU and ~1.02 on a CPU." The quality of the samples is really impressive but, wow, but isn't this computationally too expensive for many applications?
>If, for example, it takes 8 hours of computation time to process a recording of duration 2 hours, the real time factor is 4. When the real time factor is 1, the processing is done in real time. It is a hardware-dependent value.
I think real-time factors smaller than 1 are faster than real-time (not slower) and use less than 100% of a resource's computational power to keep up.
Not sure what you're quoting because I didn't write that, but
> I think real-time factors smaller than 1 are faster than real-time (not slower) and use less than 100% of a resource's computational power to keep up.
Sure, but who has the necessary GPUs installed? And on CPUs it will apparently take longer to generate speech than the duration of that speech. Unusable for many UIs and it will also drain the batteries of any portable device.
You're not wrong, but with so many chips incorporating some sort of dedicated "AI" or "tensor" functionality, perhaps the issue will resolve itself for most portable devices in a few years. Plus there's always the option of optimizing a little more and/or abusing other available hardware such as DSP chips to get the real time factor down. Anything over 1 isn't great, but it's not a bad start.
The source code is under a FLOSS license, but it only works on Nvidia GPUs and uses proprietary Nvidia-specific technologies like CUDA.
It's significantly closer to "nonfree" on the free-nonfree spectrum than it should be, and is another example of the difference between the guiding philosophies behind "free software" and "open source"
Can't you run it on CPU? And looking at the code, it seems like they're using Numba to JIT their CUDA kernels, so I guess someone could come along and provide a compatibility shim to make the kernels run on a non-CUDA accelerator?
Im sure they signed on adopting "something", otherwise it would be receiving $1.5 million grant for closing open source initiative. $3 million a year lawyer wouldn never be this blatant.
I don't really have anything of substance to add here, but I'm very happy to see Mozilla continuing to put effort into this, happy to see effort being put into broadening the support beyond just English and major languages, and I'm grateful for the work that people (inside and outside of Mozilla) have already put into getting the project this far.
Everyone has something to add. Go read stuff on their site in the languages you know. If people would actually do it for a few months all languages would have been done years ago.
Just tried rating some of the English voices and I am conflicted.
Most of them were definitely speaking English, but in an Indian intonation that I was barely able to understand coming from an English as a First Language country.
Some of them were reading words syllable by syllable, which is definitely English, but I would hate to have to listen to an ebook or webpage read aloud to me in that manner.
By clicking yes am I training the system to speak English with an Indian intonation?
Should I click no, not English?
Should/does english even have a "proper" intonation?
> Be cautious before rejecting a clip on the ground that the reader has mispronounced a word, has put the stress in the wrong place, or has apparently ignored a question mark. There are a wide variety of pronunciations in use around the world, some of which you may not have heard in your local community. Please provide a margin of appreciation for those who may speak differently from you.
> On the other hand, if you think that the reader has probably never come across the word before, and is simply making an incorrect guess at the pronunciation, please reject. If you are unsure, use the skip button.
I think this dataset is mainly for speech recognition and not text to speech. Speech recognition should be able to recognize as many different accents as possible.
I think the reality is that there are more speakers of bad English than native speakers. I speak 2 foreign languages (including English) daily and 2 others occasionally. I know I make mistakes in all of them. In English I don't think I make a lot of pronunciation mistakes (there are some mistakes in grammar for sure). In Finnish I make a lot of pronunciation mistakes, although I speak better than many other non-native speakers. How much that really hurts understanding I have no idea. The amount of misunderstandings between humans does not seem to vary greatly between those languages or even my mother tongue.
Text to speech should work correctly. But speech recognition should tolerate even clear mistakes. Of course not for the price of misunderstanding correct pronunciation.
Wow you're right. This is conflicting as many of the words are not pronounced properly at all. Maybe it doesn't matter to the accuracy of the speech-to-text system, but it feels like training it with bad data.
That's the point! When the postal service has to OCR mailing addresses, they need to do the messy scribbles more than the professionally printed labels.
Different accents isn't bad data. Your vision of the world of "english is only spoken with an american accent" is what leads to horrendous speech recognition APIs, like Google's.
If your ML model can't handle multiple accents, it is worthless.
There's a difference between an accent and pronouncing words wrong. I would expect an English speech recognition system to handle the various accents there are in the world (the US has several accents of course), but it shouldn't handle incorrect pronunciation of syllables if it comes at the expense of recognizing clean data. If it doesn't come at its expense then I guess it's fine.
Unfortunately, there's always a trade-off. You want both quality data for your use case, but you also want lots of data so it generalizes well. Those are conflicting goals.
Fortunately, splitting models into separate accent-specialized variants and helping them out with language model training will often help in case the model doesn't cope well enough with the cognitive dissonance.
You have a point there. I've been disappointed that Korean has been stuck in the 'In Progress' state. The Korean tech giants already have APIs to do common speech recognition. I hope more Korean grassroots efforts focus on tools that are open and accessible so it can be built scalable and better.
Thank you for pointing it out, I had no idea but I'd happy to contribute on this one. There is indeed a decent korean natural language process engine but it's severely tied to own ecosystem AFAIK.
> Esperanto is a hobby language for upper-middle class people in developed countries.
I wonder what gave you such an impression of Esperanto. My personal experience of Esperanto is quite different.
I started to casually self-learn Esperanto about one year ago as my second foreign language apart from English. After about half a year, I was confident enough to join online Esperanto communities and it gave me a surprisingly much more diverse experience than any community I had encountered on the Internet.
For example, in an online chat group, active users mainly come from US, South America, and Russia. As an person from East Asia, there is little chance for me to get in touch with the latter two groups otherwise. And there are often new users from South America who speak only Spanish and Esperanto.
I myself do not identify as a upper-middle class person, and I don't know enough to assess other Esperanto speakers' class status.
The impression of Esperanto speakers being upper-middle class may come from the fact people learn Esperanto as a hobby. But people not in the upper-middle class can have other hobbies, why is Esperanto different? It doesn't come with the many benefits that people may expect from learning a "practical" language, but it takes significantly less effort. I'd say it's about as hard as learning a new instrument. So it is not that exclusive to only upper-middle class people.
After one year of casual learning, I am now able to contribute to the Common Voice project in Esperanto (175 recordings and 123 validations) and I actually use it as a source of learning material.
You must be a fast learner. After one year of learning a new language, I personally would not feel comfortable speaking it well enough to use as examples for others.
Thanks to the design of the language, each letter of Esperanto has a fixed pronunciation, and the stress is always on the second-to-last syllable. So after you learn the alphabet and some diphthongs, you are able to pronounce every Esperanto text in the canonical way (even if you don't know a single thing about the meaning). No exception. This is also a great feature for self-learning.
Of course, it takes time to fluently "read out" the words, and in practice, it's much easier if you just know the word and pull the pronunciation from your memory.
For the Common Voice project, there are usually two or three words in a batch of five sentences that I don't know. And there are unfamiliar places and names, since most of the text come from Wikipedia. In such case, I'll take my time to use the spelling to infer the correct pronunciation and practice it several times, until I can put it into the sentence. Then I'll record. And I know it must be correct.
If I am not sure about the meaning of the new word (you can usually guess from etymology or word formation), I look it up in the dictionary and learn a new word.
You are not wrong, but besides the upper-middle-class hobby people, there is also a 130 years old culture that exists parallel to it. I've met a few native Esperanto speakers, and for them Esperanto is their identity. Traditional Esperanto clubs exists in countries like Iran, Japan, China, Burundi, Nigeria and many more. So Esperanto is both, a nerdy hobby and an old culture.
They weren't exclusively talking about Esperanto. I read it as a reference to Kinyarwanda and Catalan more than anything else. In the bigger scheme of things there are a lot of languages here that are definitely a product of being able to share your own language. There's multiple native languages that are being shared here, like the thread above about Guarani.
Esperanto was designed to be easy to learn. It isn't an elite pursuit in the way you suggest, because its community isn't gatekept. I personally have met people of all social classes who have been interested in it.
It was also never meant to be a first language, it is an auxiliary language. It is possible for an English speaker to have a conversation with a Mandarin speaker with no intermediary if both know the (comparatively easy to learn) Esperanto. Its original purpose wasn't trivial either: it was created to stop groups without a common language in the same city (Warsaw, I think?) fighting, created on the basis that they'd stop doing so if only they could speak a common language.
Auxiliary languages are kind of inherently doomed to fail to function as they're intended because in order for them to function as such, commitment needs to be made to adopt it multilaterally by governments with sufficient influence. If today the United States and China bilaterally decided to force Esperanto into their school curriculum it'd likely be adopted very quickly by everyone else, but that isn't the case and I doubt it ever would be under almost any circumstance, because learning English is just immediately more practical, even if it's a significantly more difficult language to be picked up.
And that's how it's played out. Nearly every developed nation teaches English as a second language or is a native population of English speakers. The universal language is English. The JVM bytecode for people is English.
I don't have to, you can look at pretty much any of their language curriculum and find a huge presence of English in nearly all their education systems.
Certainly you will find people learning other languages for trade depending on the region, but even in East Asia, as you say, English is taught in China, Japanese, Korea. In Singapore English is the language everyone learns (and is taught in). In Vietnam the primary foreign language taught is English. In the Philippines one of its official languages is English. Argentina teaches English in elementary school. In Brazil students from grade 6 have to learn a language, which is usually English. In Venezuela English is taught from age 5.
My takeaway is that nobody should speak English, but instead people should compose their sentences in a different language and then translate them to English at the point of speaking (with small pauses in the conversation for you to collect your thoughts on this garbage).
Ah yes, major world languages with 10s or 100s of millions of speakers (Bengali, Korean, Malayalam) are ignored or are perpetually stuck "in progress" while hobby languages like Esperanto are supported.
Hey, I work on the Esperanto version of CV. You are right, many languages should be bigger than Esperanto, and we never planned to become this big, it just happened. We are around ten active people and a telegram group with a few hundred motivated donors. Plus, we write about the project in Esperanto magazines and talk about it on Esperanto congresses.
The point is: the only reason Bengali Korean and Malayalam are stuck "in progress" is that no one is working on them. No language but English is actively supported by Mozilla, it all comes from the communities. And the success of Esperanto shows that every language can make it. I hope that people take our work as a motivation. Every language can become big if a few motivated people work on it for a year or two. Even the smallest language can make it. You just need a lot of public domain sentences, a few thousand donors and some technical knowledge then your language will grow as well :)
Sure, I was responding to the factitious comment above.
When I can use Google or Facebook in any of these languages for 10+ years, it's silly of this project to claim some high moral ground when you can't support some of the most widely spoken languages in the world and stick to languages that hipsters in San Francisco think is cool.
It can support those languages, they just need some people who actually speak them to come along and make it happen. If you can help, I'm sure it will be appreciated.
Let's take the time to appreciate the effort of Mozilla. To add new languages with others came from the minorities, we can't deny that they are continuously putting effort into the community.
The great open source community around Mozilla helps a lot.
When I did not see my own language in the list a year ago, and I had no clue how to get it there, I reached out to my university contacts that I know used to translate Firefox years ago.
With their help we quickly translated the whole common voice site (it was a prerequisite to start contributing a language) and provided first sets of text to start contributing.
In about a week we started contributing voice for a new language. The Common Voice project is awesome and very well made.
Common Voice is a great project that I’m glad Mozilla kept alive.
One problem is that data for speech recognition needs to be extremely accurate (i.e. the speech matches the transcript perfectly) and the human review process is infallible and there are quite a number of bad clips that made it past the review process (to be fair, Mozilla provides no official guidance to reviewers or recorders).
Plus in the early days, they were recording the same small sentence pool over and over again, so the first 700 hours or so are duplicates.
I hope there will be efforts in the future to clean up the existing dataset to improve its quality.
I'm an ASR researcher shipping high quality English models trained on limited resources, and while I've needed to include other datasets to make the model more robust to different kinds of text, Common Voice is a substantial part of my training process. I did not do any manual transcript accuracy cleanup. Most of my automated cleanup was done with very basic (low quality) models. My latest models trained this way are competitive with e.g. Google or Apple English speech recognition accuracy.
I'm going to disagree that there's a universal need for perfect training data in ASR. I'm sure it helps with some model types and training processes, but it simply hasn't been a factor in my use of Common Voice (English). I'll also note my best model can hit around 10% WER on Common Voice Test without any language model, which is better than any public numbers I've seen posted for it so far (I'm not even using a separate transformer decoder or RNN decoder layers for this number, just the raw output of CTC greedy decode).
None of the above even factors in techniques like wav2vec and IPL (iterative pseudo labeling) with noisy student, which suggest you can hit extremely competitive accuracy with very little correctly labeled data. These techniques are the underpinnings of the current state of the art models.
But you are right, the process has some flaws. Maybe we can review the dataset automatically on some common errors, once an STT system is ready for a language?
The only other option I can think about is a validation process that includes more people per sentence. Right now, only two people validate a sentence, and if they disagree a third person decides. We could at least double check sentences with one "no" vote one more time.
The community guidelines are good but they’re hidden away on the forum. I was asking them for years to just make those the official guidelines and link them prominently on the CV site but they never did.
However, Hillary, the new community manager, seems good and she’s making a lot of positive changes so hopefully this will be addressed soon.
Long-term the best approach may be some kind of user onboarding before they can record / validate.
Why does data for speech recognition need to be prefect. That's certainly not the case for other machine learning applications. Can you train the less clean data and fine-tune on a clean subset?
I find the recording UI a bit annoying. They make it unnecessary hard to re-record a clip. Re-recording the previous clip is likely to be a common thing to do. Instead of providing a shortcut for this, they have shortcuts for re-recording each of the individual 5 clips..
It's also impossible (?) to undo a clip. Eg.: If I've already recorded 3 clips and mistakenly begin a clip I simply can't pronounce correctly, there's no way of removing that clip without discarding the whole set. (EDIT: it is possible by re-recording that clip and pressing skip)
Yeah I think minor mess ups as long as the words are correct is actually good. As well as a bit of background noise. Problem is if Moz builds up a dataset of pure recordings and someone tries to use it but they are in a noisy room and the ML was never prepared for this.
This may be off-topic but: What's the relationship between Coqui (an OSS TTS startup) https://coqui.ai/about and Mozilla? I recall that the project at one point was called mozilla/TTS (https://github.com/mozilla/TTS/) and now I see that has a fork in the startup's own repo (https://github.com/coqui-ai/TTS). Presumably Common Voice is used to train mozilla/TTS and other OSS TTS solutions?
Tips & Tricks incoming... I find that if I can't sleep and want something that's kind of useful to do without getting too involved, contributing to common voice is a great way to spend half an hour and relax/forget whatever it is I was churning about. I would recommend it for that, plus it's a great project. Both listening and voicing...
I've had good results with https://github.com/flashlight/flashlight/blob/master/flashli.... Seems to work well with spoken english in a variety of accents. Biggest limitation is that the architecture they have pretrained models for doesn't really work well with clips longer than ~15 seconds, so you have to segment your input files.
I created edgedict [0] a year ago part of my side projects. At that time this is the only open source STT with streaming capabilities. If anyone is interested the pretrained weights for english and chinese is available.
Have used VOSK a bit recently. The out-of-the-box experience was great compared to earlier projects (looking at you Kaldi and Sphinx...). Word-level audio segmentation was one usecase, https://stackoverflow.com/a/65370463/1967571
Thank you. I deeply appreciate you mention our efforts. We spend quite some time and knowledge to build accurate speech recognition. Not that easy to get as much mentions as Mozilla, so we are thankful for every single one!
Why on Earth would anyone use an app for this when mobile browsers work perfectly well for adding audio to Common Voice?
We could possibly give the developer the benefit of the doubt that they're not doing anything inappropriate with the data but frankly why pass your data through a third party that's not part of the project.
And why install an app requiring access to your shared local storage? The GitHub repo claims the website an animations are slow which sounds like BS to me. It works fine on a five year old phone I use for submitting.
Just contribute here if you're so inclined, much more sensible:
The app has a few nice features the website doesn't have, such as changing the speed during validation. It always surprises me as well, but many people hate to use web apps on mobile. I don't really know why, they simply ask for an app and refuse to use a browser.
You aren't distinguishing the projects correctly. The CV project isn't the same as the DeepSpeech project (even though they were related).
And your point makes little sense, because if the site was not working how could the app get voice data into the project. I've had some involvement with these projects over the years so I'm not just firing off arm-chair comments on this. They wouldn't have been able to add this new voice data if the site was under developed as you imply.
Openly licensed speech data for smaller languages is great! I hope as many as possible contribute in order to get better representation across ages and pronunciation. In the end, this may be what is needed for the hyperscale companies to support speech assistants in more languages?
Is voice transcription accessible to mere mortals yet?
I have tried pretty much every API offered by big tech, and also various open source models. All of them seem to have incredibly high word error rates. This is mostly for conversations with various Indian accents.
Interesting! There's a market for this kind of audio data entry? What was the total cost for that many hours? The English data was entirely volunteer driven, correct? Maybe it's worth funding the English corpus for the additional hours needed to reach the sweet spot?
Data cost plunges these days with self-supervised and semi-supervised learning. You don't need annotated and clean data anymore, there is abundance of it. Projects like Voxpopuli or Gigaspeech with 400 thousand hours (100 times more than Mozilla's) of data easily available.
Many people are also speaking very mechanically when they use a voice assistant, though ;) I believe we need a good mix, but telling people to speak a little more naturally certainly would help.
Thanks so much for sharing your comment. Gender equality in participation in Common Voice, is something we really want to improve and champion. As part of the Kiswahili Language community engagement, our team are implementing a gender action plan that includes both participation and use cases for the dataset. We hope to consult, adapt and replicate gender inclusion that has been done by community members and gender action plan to improve representation and involvement of all genders in open source projects such as Common Voice.
The ratio of male to female tagged voices in the English dataset is 45 percent male to 15 percent female. (The remaining 40 percent is untagged.) Odds are good that the ratio is closer to 75 25 than 50 50, at least by hours of recorded audio.