Update Article: 79

Original Title

Sanitized Title

Clean Title

Source ID

Article Id01

Article Id02

Corpus ID

Dup

Dup ID

Url

Publication Url

Download Url

Original Abstract

Despite recent availability of large transcribed Kinyarwanda speech data,
achieving robust speech recognition for Kinyarwanda is still challenging. In
this work, we show that using self-supervised pre-training, following a simple
curriculum schedule during fine-tuning and using semi-supervised learning to
leverage large unlabelled speech data significantly improve speech recognition
performance for Kinyarwanda. Our approach focuses on using public domain data
only. A new studio-quality speech dataset is collected from a public website,
then used to train a clean baseline model. The clean baseline model is then
used to rank examples from a more diverse and noisy public dataset, defining a
simple curriculum training schedule. Finally, we apply semi-supervised learning
to label and learn from large unlabelled data in five successive generations.
Our final model achieves 3.2% word error rate (WER) on the new dataset and
15.6% WER on Mozilla Common Voice benchmark, which is state-of-the-art to the
best of our knowledge. Our experiments also indicate that using syllabic rather
than character-based tokenization results in better speech recognition
performance for Kinyarwanda.Comment: 9 pages, 2 figures, 5 table

Clean Abstract

Tags

Original Full Text

KinSPEAK: Improving speech recognition for Kinyarwanda viasemi-supervised learning methodsAntoine NzeyimanaUniversity of Massachusetts, Amherst, USAanzeyimana@umass.eduAbstractDespite recent availability of large transcribedKinyarwanda speech data, achieving robustspeech recognition for Kinyarwanda is stillchallenging. In this work, we show that us-ing self-supervised pre-training, following asimple curriculum schedule during fine-tuningand using semi-supervised learning to lever-age large unlabelled speech data significantlyimprove speech recognition performance forKinyarwanda. Our approach focuses on us-ing public domain data only. A new studio-quality speech dataset is collected from a pub-lic website, then used to train a clean base-line model. The clean baseline model is thenused to rank examples from a more diverseand noisy public dataset, defining a simplecurriculum training schedule. Finally, we ap-ply semi-supervised learning to label and learnfrom large unlabelled data in five successivegenerations. Our final model achieves 3.2%word error rate (WER) on the new dataset and15.6% WER on Mozilla Common Voice bench-mark, which is state-of-the-art to the best of ourknowledge. Our experiments also indicate thatusing syllabic rather than character-based tok-enization results in better speech recognitionperformance for Kinyarwanda.1 IntroductionExpanding access to automatic speech recognition(ASR) technology to the large number of speak-ers of low resource languages can improve thequality of their interactions with computing andmulti-media devices. Moreover, it makes it possi-ble to perform analytics and retrieval on the largeamount of speech data produced by these commu-nities. This is more significant for speakers of lowresource languages such as Kinyarwanda and otherlanguage communities that disproportionately usemore oral than written forms of communication.However, achieving robust speech recognitionfor a language such as Kinyarwanda is challeng-ing. First, the lack of tone markings in the writtenTable 1: Results on Mozilla Common Voice benchmarkMethods CER % WER %Conformer + CTC 8.4 25.3+ Self-PT 6.8 21.2+ Curriculum 6.4 20.1+ Semi-SL Gen. 1 5.7 18.0+ Semi-SL Gen. 5 4.7 15.6NVIDIA NeMo: CTC 5.5 18.4NVIDIA NeMo: Transducer 5.7 16.3form of a tonal language requires a lot more di-verse data for speech recognition to work well inpractice. Second, a morphologically rich language(MRL) like Kinyarwanda has a very large vocabu-lary, making it difficult to achieve low word errorrates (WER) on public benchmarks. For example,while the data scarcity problem for Kinyarwandahas been largely alleviated, thanks to Mozilla Com-mon Voice (Ardila et al., 2020) (MCV) crowd-sourcing efforts, off-the-shelf models accuracy onthe MCV test set still has room for improvement.These challenges motivate us to explore language-specific treatments towards more robust speechrecognition for Kinyarwanda.Recent advances in deep learning techniques forend-to-end speech recognition and the availabilityof open source frameworks and datasets allow usto empirically explore different ways to improveASR performance for Kinyarwanda. While recentexperimental reports and studies (Ravanelli et al.,2021; Ritchie et al., 2022) have shown improve-ment in ASR for Kinyarwanda, mostly via self-supervised pre-training (Self-PT) representationssuch as wav2vec2.0 (Baevski et al., 2020), therehaven’t been exploration of using Kinyarwanda-only speech data for Self-PT pre-training and howto improve performance beyond using Self-PT rep-resentations. In this work, we report empiricalexperiments showing how ASR performance forKinyarwanda can be improved though Self-PT pre-arXiv:2308.11863v3  [eess.AS]  2 Mar 2024training on Kinyarwanda-only speech data, fol-lowing a simple curriculum learning schedule dur-ing fine-tuning and using semi-supervised learn-ing (Semi-SL) to leverage large unlabelled data.We also compare the impact of two different tok-enization techniques on ASR performance on thisparticular language.Our approach focuses on leveraging availablepublic domain data. First, we collect 22,000 hoursof Kinyarwanda-only speech data from YouTubeand use it for Self-PT pre-training. Then, we fine-tune a clean baseline model on studio-quality tran-scribed readings from JW.ORG website1. Theclean baseline model is then used to rank MCVtraining examples by character error rates (CER),so that we can formulate a curriculum learningschedule for a final model trained on the combineddataset. Since JW.ORG website readings are longand not segmented, we developed an easy to usemobile application that annotators used to alignaudio data to text, yielding about 89 hours of clean,segmented and transcribed speech data. A simplecurriculum learning schedule was then devised totrain an intermediate model in 6 stages, by subse-quently introducing harder and harder examplesto the training process. Additionally, we experi-ment with semi-supervised learning (Semi-SL) bytranscribing and filtering audio segments from ourYouTube speech data, and then adding this newdataset to our original training dataset and resum-ing training on the new combined dataset. Thisprocess is repeated for five generations. Our finalmodel achieves 15.6% WER on MCV benchmarkversion 12 and 3.2% WER on the new dataset fromJW.ORG website.Specifically, we make the following contribu-tions:• We empirically show how ASR perfor-mance for Kinyarwanda can be improved byleveraging available public domain speechdata only, and combining three differ-ent technique, namely, self-supervised pre-training, curriculum-based fine-tuning andsemi-supervised learning.• We devise a simple mobile design methodol-ogy for speech-text alignment to enable col-lection of labelled utterances from long un-aligned speech and text data. This design al-lows crowd-workers to easily annotate data1https://www.jw.org/rw/isomero/using mobile devices, making the whole pro-cess scalable.• By exploring two alternative tokenizationtechniques, we empirically show that us-ing syllabic instead of the more commoncharacter-based tokenization leads to bet-ter performance in end-to-end ASR forKinyarwanda.2 Related workThe primary transcribed speech dataset we usedcomes from the Mozilla Common Voice (MCV)speech corpus (Ardila et al., 2020). This datasetcontains more than 2,000 hours of transcribedKinyarwanda speech which was collected througha crowd-sourcing effort led by a Kigali-based com-pany called Digital Umuganda 2.The secondary public domain speech data weused was downloaded from Jehovah’s Witnesseswebsite, JW.ORG 1. Due to the high-quality multi-lingual speech and text data on JW.ORG, the datafrom this website has been subject to studies inlanguage technology research, most notably ma-chine translation (Agić and Vulić, 2019). However,due to license restrictions, this data cannot be re-distributed outside of the JW.ORG website. Tothe best of our knowledge, we are the first onesto experiment speech recognition using JW.ORGdata.As the MCV dataset evolved through multipleversions, there have been several studies and ex-perimental results (Ritchie et al., 2022; Ravanelliet al., 2021; Kuchaiev et al., 2019) reported onthe dataset. The best results are generally ob-tained by fine-tuning pre-trained models such aswav2.vec2.0 (Baevski et al., 2020) which are typ-ically pre-trained on large English-only or multi-lingual speech data. Our approach focuses on usinga moderately sized Kinyarwanda-only speech datadownloaded from YouTube.Our model architecture is inspired by the workin (Zhang et al., 2020) which is based on re-cent end-to-end speech recognition architecturesand techniques including Convolution-augmentedtransformers (Conformer) (Gulati et al., 2020),SpecAugment (Park et al., 2020), self-supervisedpre-training (Baevski et al., 2020; Hsu et al.,2021) and connectionist temporal classification(CTC) (Graves et al., 2006).2https://digitalumuganda.comThere have been a number of studies of usingsyllable-based tokenization for ASR in differentlanguages (Savithri and Ganesan, 2023; Zhou et al.,2018). However its effectiveness varies from lan-guage to language. To the best of our knowledge,we are the first ones to investigate the suitability ofsyllable-based tokenization for Kinyarwanda ASR.3 Methods3.1 Collecting studio-quality transcribedutterances via speech-text alignmentCurriculum learning has been shown to improverobustness in automatic speech recognition (Braunet al., 2017). Given that the publicly available MCVdataset is diverse and noisy in many cases, defin-ing a measure for ranking the examples is hard.While one can use the number of up-votes anddown-votes provided by MCV corpus (Ardila et al.,2020), these may not be reliable given that mostvalidated examples have two up-votes and therecan only be a maximum of three votes per exam-ple. Instead, we chose to use a extrinsic assessmentby training a clean baseline model using a studio-quality dataset and then using the clean baselinemodel to rank MCV examples based on charactererror rates (CER).To develop the clean baseline model, we reliedon publicly available readings from JW.ORG 1 web-site. The readings are made of a web-page with amedia player playing the content of the page. How-ever, the readings are typically long and the textis not aligned with the utterances in most cases.Therefore, it is not trivial to segment the data forASR training and there is no automated tool thatcan reliably do the alignment.In order to align speech and text data fromJW.ORG website 1, we developed an easy to usemobile application and asked volunteers to alignthe text to speech segments using the developedapplication. This required first downloading theHTML pages with the associated audio clips, ex-tracting clean text from HTML and marking long-enough silences in the audio clips as segmentationboundaries. The clean text, the link to the audioclip and silence markings (time index in the audioclip) were then stored as metadata for the mobileapplication.During its use, the application would play eachaudio clip while showing the text and pausing ateach silence point for the user to mark the last spo-ken word in the displayed text. Five KinyarwandaFigure 1: Speech-text alignment mobile interface. Theannotator is asked to touch the last word played by theaudio clip before the pause. The selected segment ishighlighted and then cut out and the process repeatsuntil the end of the document.native speakers then volunteered to annotate themetadata. These annotators were asked to listento the audio clips and mark the silence points bytouching the last spoken word on the touch-screendisplaying the text. After the touch, the text seg-ment up to the last touched word is then highlightedand then cut out, and the process of listening andtouching continues until the end of the audio clipor the end of the text. The index of the last spokenword in the text and the corresponding silence in-dex in the audio clip were then stored on a back-endserver and later post-processed for creating the finaldataset for ASR model training. A screen captureof the mobile interface is shown in Figure 1.3.2 ASR Model architectureAutomatic speech recognition (ASR) aims atmapping a sequence of speech representationsX = {xi ∈ Rd} to a sequence of text unitsY = {yj ∈ V} in a given language. The modelingproblem is then to find the most probably sequenceof text units Ŷ among all vocabulary combinationsV∗ ("strings") given a speech input, i.e.:Ŷ = argmaxY ∈V∗p(Y | X). (1)The current practice in large vocabulary speechrecognition is to use end-to-end neural networks forjoint optimization instead of designing independentdiscrete components. One approach treats ASR asa sequence prediction problem (i.e. sequence-to-sequence) and thus uses encoder-decoder architec-tures (Chan et al., 2016) like machine translationor text summarization. A different approach usesonly an encoder to learn a monotonic mapping ofencoder outputs to target text units. This is moti-vated by the fact that speech is mostly produced atconstant speed, where each text unit can be mappedto a specific temporal segment in the input speech.In this case, a loss function based on Connection-ist Temporal Classification (CTC) (Graves et al.,2006) can be used to optimize the encoder outputsover all valid token alignments.Our ASR model architecture is shownin Figure 2. The model is based on the con-former architecture (Gulati et al., 2020) anduses a contrastive objective for self-supervisedpre-training in a way closely similar to (Zhanget al., 2020). The input to the model consists of thelog mel-spectrogram of an utterance. The input isthen processed by a convolutional sub-samplingmodule which creates a reduced-length input tothe rest of the network.Conformer LayersCNN LayersLinear moduleLinear moduleDropoutCTC LossContrastiveLossMasking Linear moduleOnly during pre-trainingOnly during fine-tuningLog mel spectrogramPosition embeddingFigure 2: ASR Model architectureDuring the pre-training phase, the output of theconvolutional sub-sampling module is masked andsent to the conformer layers on one hand, andalso projected through a linear layer and then usedfor contrastive learning. The masking parame-ters come from Wav2Vec2.0 (Baevski et al., 2020).However, differently from Wav2Vec2.0, we do notuse quantization before our contrastive loss calcu-lation. Instead, we use the linear projection layerbefore contrasting masked positions. This methodwas proposed by (Zhang et al., 2020) and is suffi-cient for pre-training our network. During normaltraining, which we call fine-tuning, both the mask-ing and contrastive projection are removed and anew projection layer is added on top of the con-former layers for Connectionist Temporal Classifi-cation (CTC) (Graves et al., 2006) -based training.Differently from (Zhang et al., 2020), our con-former model formulation uses untied position em-bedding method called TUPE (Ke et al., 2020)which adds position embedding vectors as a bias tothe multi-head self-attention sub-module. We alsouse a pre-LayerNorm configuration (Nguyen andSalazar, 2019) of the conformer modules.3.3 Multi-staged curriculum schedule fortrainingCurriculum learning (Bengio et al., 2009) has beenshown to improve robustness in speech recogni-tion (Braun et al., 2017). We designed a multi-stagecurriculum schedule, whereby the model learnsfrom cleaner examples earlier than noisy examples.In our formulation, we follow a coarse-grainedmulti-stage schedule. At stage 0, we train themodel on JW.ORG-only training set which con-tains about 80 hours of data. At each subsequentstage, we double the amount of training examplesby adding harder and harder examples from MCVtraining set, until the whole MCV training set iscovered. We train each intermediate stage for 10epochs, while the final stage containing all trainingdata is trained for 49 epochs. At each new stage,we keep the model weights, while resetting theoptimizer state and learning rate schedule.After completing the training on JW.ORGand MCV dataset, we explore the use of semi-supervised learning (Semi-SL) by transcribing ut-terances from the YouTube dataset and addingthose examples to the training set and resumingtraining. This method was also explored in (Zhanget al., 2020), where it’s called noisy student train-ing (NST). Our approach follows a curriculumschedule, whereby easier examples are added first,based on ranking by an external language model.Specifically, we use our model trained on JW.ORGand MCV datasets to transcribe a large portion ofYouTube utterances, then rank their transcripts andfinally pick the top ranked examples, add them tothe training set and then resume training. For rank-ing transcribed YouTube utterances, we use boththe CTC beam search (Zenkel et al., 2017) decodescores and log-probabilities produced by an exter-nal tokenization-aligned language model. We applythresholds to both scores to pick the top ranked ut-terances and their transcriptions. Unlike the NSTscheme in (Zhang et al., 2020), we don’t restarttraining from the pre-trained model at each genera-tion. Instead, we resume training from the modeltrained on JW.ORG and MCV data. In our experi-ments, we repeated this semi-supervised learningprocess for four generations to get our final model.Table 2: Syllable-based tokenization: Kinyarwandavowels, consonants and consonant clusters.i u o a eb c d f gh j k m np r l s tv y w zbw by cw cy dwfw gw hw kw jwjy ny mw my nwpw py rw ry swsy tw ty vw vyzw pf ts sh shymp mb mf mv ncnj nk ng nt ndns nz nny nyw bywryw shw tsw pfy mbwmby mfw mpw mpy mvwmvy myw ncw ncy nshndw ndy njw njy nkwngw nsw nsy ntw ntynzw shyw mbyw mvyw nshynshw nshyw njyw3.4 Syllable-based tokenizationMany end-to-end deep learning models forspeech recognition use character-based tokeniza-tion. However, like many other Bantu languages,Kinyarwanda only has open syllables (Walli-Sagey,1986), whereby each syllable ends with a vowel.Also, when Kinyarwanda orthography is beingtaught in elementary school 3, students are first in-troduced to vowels, then simple consonants and fi-nally consonant clusters as basic orthographic units.Therefore, we wanted to explore this syllable-based3https://elearning.reb.rw/course/view.php?id=293tokenization, where output tokens are either vow-els, simple consonants and consonant clusters asthought in elementary school Kinyarwanda orthog-raphy.In our experiments, we empirically comparethe effectiveness of this syllable-based tokeniza-tion against the more common character-based to-kenization. Table 2 shows the basic text units ofKinyarwanda orthography which we use in our ex-periments. In addition to these basic orthographicunits, we add foreign characters ‘x’ and ‘q’ whichare typically used in foreign words and names.Therefore, the main difference between character-and syllable-based tokenization is that syllable-based tokenization allows consonant clusters inthe vocabulary while character-based tokenizationdoesn’t (i.e. it must learn to produce consonantclusters from single consonants during inference).We also add six basic punctuation marks as part ofour vocabulary, namely, full stops (.), commas (,),question marks (?), exclamation marks (!), colons(:), and apostrophes (’). This allows our modelsto produce basic punctuation from speech with-out without the need for a dedicated punctuationrestoration model (Tilk and Alumäe, 2015). Dur-ing evaluation, we omit these punctuation marksto have fair comparison with other open sourcemodels which don’t include punctuation marks.4 Experimental setup4.1 JW.ORG speech data gathering andtext-speech alignmentThe text and audio clips from the JW.ORG websitewere identified and collected through web crawling.This process happened in August 2021 and we iden-tified 792 documents with 139 hours of speech data.The text was extracted from HTML documents us-ing jsoup 4 Java library. The mobile application fortext to speech alignment was developed for the An-droid operating system. Any user of the applicationwas assigned to annotate all documents in a ran-dom order. While more than 50 anonymous usersattempted to volunteer to participate in the annota-tion process, only data from five participants wereused to compile the final dataset. These includedone author of this paper, one trained and paid anno-tator and two volunteers who are known and relatedto the author and another anonymous volunteer. Be-cause of the tedious work involved, only the paid4https://jsoup.org/Table 3: Ablation results: Character error rates (CER %) and word error rates (WER %) on the validation (Dev.)and test sets across different training configurations. Best ablation results are shown in bold. JW: JW.ORG dataset.MCV: Mozilla Common Voice dataset.JW Dev. JW Test MCV Dev. MCV TestTokenization Self-PT Curriculum CER WER CER WER CER WER CER WERCharacter - - 3.8 14.4 3.8 14.2 6.0 20.6 8.5 26.0Syllable - - 3.5 13.1 3.4 12.8 5.7 19.8 8.4 25.3Character Yes - 1.5 4.9 1.4 4.7 5.1 17.8 6.9 22.2Syllable Yes - 1.4 4.6 1.3 4.4 5.0 17.3 6.8 21.2Character Yes Yes 1.3 4.2 1.1 3.9 4.8 16.8 6.7 20.8Syllable Yes Yes 1.3 3.9 1.1 3.7 4.7 16.3 6.4 20.1annotator annotated all documents while others an-notated about 5% of all documents. The other anno-tators’ data was used to evaluate the inter-annotatoragreement. The inter-annotator agreement ratiois calculated as the number of agreeing silencemarker-last spoken word pairs divided by the totalnumber of silence markers in the commonly anno-tated documents. This inter-annotator agreementratio was 90.7% between the author and the paid an-notator and 82% between the author and the otherannotators. Most of the disagreements happened inhandling text in parentheses and biblical references.In the post-processing stage, we removed data seg-ments that were deemed too short (audio length< 2 seconds or text length < 5 characters) or toolong (audio length > 30 seconds or text length >400 characters). We also removed those segmentswhose number of syllables per second was morethan 1.3 standard deviations from the average. Fi-nally, we got a final clean audio-text dataset withtotal 86 hours of speech, which we randomly splitinto training, validation and test sets in the ratiosof 90%, 5% and 5% respectively.4.2 ASR model implementationOur ASR model was implemented with Py-Torch (Paszke et al., 2019) framework version1.13.1. The CNN component is made of two layersof 3x3 convolutions with stride of 2. The con-former component is made of 16 layers of con-former blocks, each with 768 hidden dimension, 8attention heads and 3072 feed-forward dimension.The whole model has about 229 million parameters.The input to the model are log mel-spectrogramswhich are computed with 1024-point short-timeFourier transform (STFT) on 25 milliseconds ofraw audio frames (sampled at 16 Hz), with a hoplength of 10 milliseconds, and using 80 mel filters.4.3 Training processWe collected about 22,000 hours of unlabelledYouTube data from 37 channels that publish speechcontent in Kinyarwanda on various topics includingnews and political discussions and social conver-sations. The data was randomly segmented into 5to 25 seconds-long segments. During contrastivepre-training, we used a mask probability of 0.5and a static mask length of 10, in a way similarto Wav2Vec2.0 (Baevski et al., 2020). We used aglobal batch size of 1.6 hours of data, a peak learn-ing rate of 5e-4, 544,000 training steps, 32,000warm-up steps and a linear learning rate decay. Pre-training took 25 days on two NVIDIA RTX 4090GPUs. During ASR model fine-tuning, we useda global batch size of 800 seconds, peak learningrate of 2e-4, 5000 warm-up steps, inverse squareroot learning rate decay, and the training was donefor about 50 epochs in each case. ASR modelfine-tuning experiments were done on NVIDIARTX 3090 GPUs, taking 1.25 seconds per stepone one GPU. For decoding, we use CTC beamsearch (Zenkel et al., 2017) algorithm which weimplemented in C++. Through experimentation onthe development set, we set the beam width to 24.4.4 EvaluationWe use both character error rates (CER %) andword error rates (WER %) as evaluation metrics.CER are particularly important for this case be-cause Kinyarwanda is a morphologically-rich lan-guage and thus tends to have an open vocabulary.Furthermore, as shown in 5.3, there exist some writ-ing ambiguities where a Kinyarwanda reader mayconsider multiple spellings of the same word aslegit. Therefore, using WER metrics alone is notsufficient. We use TorchMetrics package (Detlef-sen et al., 2022) version 0.11 for our metrics com-Table 4: Semi-supervised learning results: Character error rates (CER %) and word error rates (WER %) on thevalidation (Dev.) and test sets through iterative generations. Overall best results are shown in bold and underlined.JW: JW.ORG dataset. MCV: Mozilla Common Voice datasetUnlabelled data JW Dev. JW Test MCV Dev. MCV TestGeneration size (hours) CER WER CER WER CER WER CER WER1 1K 1.2 3.4 1.1 3.4 4.4 15.3 5.7 18.02 3K 1.1 3.2 1.0 3.2 4.2 14.8 5.3 16.93 10K 1.1 3.1 1.0 3.1 4.1 14.6 5.0 16.54 10K 1.2 3.3 1.0 3.2 4.0 14.3 4.8 15.95 20K 1.2 3.2 1.0 3.2 3.9 14.2 4.7 15.6NVIDIA Nemo: CTC 2.3 9.7 2.2 9.4 4.3 15.3 5.5 18.4NVIDIA Nemo: Transducer 2.2 8.7 2.2 8.8 4.5 14.1 5.7 16.3putation.5 Results and discussion5.1 Effects of tokenization, self-supervisedpre-training and curriculum learningOur first ablation results are presented in Table 3.We compare character-based tokenization andsyllable-based tokenization across three trainingand evaluation setups: with and without self-supervised pre-training, and finally adding the de-vised curriculum-based training schedule. Theseresults were obtained after about 50 epochs oftraining on JW.ORG and MCV datasets. Overall,syllable-based tokenization consistently achieveslower error rates than character-based tokeniza-tion. Using the curriculum learning schedule alsoproduces significantly better results. Without self-supervised pre-training (Self-PT) , error rates ofthe baseline models are increased significantly. Wealso remark that model performance on JW.ORGdataset (JW) is much better than on MCV data; thisis because JW.ORG data has very little environ-mental noise.5.2 Semi-supervised learning resultsOur best and final model resulted from using semi-supervised learning (Semi-SL) to leverage the unla-belled YouTube dataset. The results are presentedin Table 4. We applied Semi-SL in five generations,first using top 1,000 hours of YouTube dataset, then3,000 hours, 10,000 hours twice and finally 20,000hours of data. This best model achieved 1.0% CER/ 3.2% WER on JW.ORG test set and 4.7% CER /15.6% WER on MCV test set.Our final model performance is compared totwo open source ASR models for Kinyarwandaavailable on Hugging Face website, which weretrained the MCV dataset of similar size to ours.There are two versions of models with NVIDIANeMo (Kuchaiev et al., 2019), one based onconformer architecture with CTC-based training5,the other using encoder-decoder architecture (i.e.Transducer)6. While the transducer model achievesWERs comparable to our final model performance,it significantly achieves worse CERs across bothJW.ORG and MCV benchmarks.Table 5: Model performance across demographic groups# Train MCV Dev. MCV TestGroup (x1000) CER WER CER WERmale 517 5.1 16.5 3.8 14.0female 379 3.4 12.7 3.2 13.2teens 182 3.2 13.5 4.6 15.1twenties 605 3.7 13.5 3.5 13.5thirties 135 7.1 19.5 4.2 15.2Since the MCV dataset includes speaker genderand age group labels, we also evaluated our bestmodel across gender and age groups (See Table 5).Overall, examples from female speakers and thosefrom speakers in their twenties resulted in signifi-cantly lower WER. Having listened to many noisyMCV utterances, we hypothesise that these perfor-mance differences might be more indicative of oc-cupational differences across demographic groups.This is because the MCV Kinyarwanda dataset wasmainly contributed to by volunteers throughouttheir daily activities.5https://huggingface.co/nvidia/stt_rw_conformer_ctc_large6https://huggingface.co/nvidia/stt_rw_conformer_transducer_large5.3 Error analysisWhile the achieved WER for Kinyarwanda ASRlook particularly higher than those accustomed toin state-of-the-art ASR for high resource languageslike English, there are multiple factors that explainthis observation. First, Kinyarwanda is morpholog-ically rich and thus has a very large (almost open)vocabulary. But, we also qualitatively found threemajor types of ambiguities that mostly contributeto the observed high WER. These ambiguities areinherent to written Kinyarwanda and are less de-pendent on the acoustic model:1. Spelling ambiguities for some consonants:In this case, multiple spellings can be consid-ered legit and understood by readers. Exam-ples:- mujye / muge (‘you should’)- poritiki / politiki (‘politics’)- incuti / inshuti (‘friend’)2. Vowel assimilation: This is a knownfact (Mpiranya and Walker, 2005) that vowelsat the end of word followed by another wordstarting with a vowel can be assimilated intothe next vowel. Example:- avuga abantu / avuge abantu (‘say peo-ple ...’3. Loanword and foreign word spelling: Thereexist no standard for spelling foreign wordsand names in Kinyarwanda, and differentadaptations can be found in writing. Exam-ples:- Patirisiya / Patricia- Venezuwera / VenezuelaTherefore, since these ambiguities are commonin writing (including benchmark data), they canaffect the realized WER.6 Conclusion and future workThis work demonstrates the effectiveness of com-bining self-supervised pre-training, curriculumlearning, and semi-supervised learning methodsto achieve better speech recognition performancefor Kinyarwanda language. It is also shown that us-ing syllabic tokenization improves upon the morecommon character-based tokenization for end-to-end ASR for Kinyarwanda. This work can be usedby practitioners wanting to develop state-of-the-art ASR systems for other languages that are lessprevalent in current language technology research.Future research will focus on incorporating thedeveloped ASR models into language technologyapplications such as machine translation and in-formation retrieval. We will also investigate ontechniques to make the models accessible to mo-bile architectures such as smartphones, where theycan benefit the large community of Kinyarwandaspeakers.ReferencesŽeljko Agić and Ivan Vulić. 2019. Jw300: A wide-coverage parallel corpus for low-resource languages.In Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics, pages 3204–3210.Rosana Ardila, Megan Branson, Kelly Davis, MichaelKohler, Josh Meyer, Michael Henretty, ReubenMorais, Lindsay Saunders, Francis Tyers, and Gre-gor Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the12th Language Resources and Evaluation Confer-ence, pages 4218–4222.Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,and Michael Auli. 2020. wav2vec 2.0: A frameworkfor self-supervised learning of speech representations.Advances in neural information processing systems,33:12449–12460.Yoshua Bengio, Jérôme Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. InProceedings of the 26th annual international confer-ence on machine learning, pages 41–48.Stefan Braun, Daniel Neil, and Shih-Chii Liu. 2017. Acurriculum learning method for improved noise ro-bustness in automatic speech recognition. In 201725th European Signal Processing Conference (EU-SIPCO), pages 548–552. IEEE.William Chan, Navdeep Jaitly, Quoc Le, and OriolVinyals. 2016. Listen, attend and spell: A neuralnetwork for large vocabulary conversational speechrecognition. In 2016 IEEE international conferenceon acoustics, speech and signal processing (ICASSP),pages 4960–4964. IEEE.Nicki Skafte Detlefsen, Jiri Borovec, Justus Schock,Ananya Harsh Jha, Teddy Koker, Luca Di Liello,Daniel Stancl, Changsheng Quan, Maxim Grechkin,and William Falcon. 2022. Torchmetrics-measuringreproducibility in pytorch. Journal of Open SourceSoftware, 7(70):4101.Alex Graves, Santiago Fernández, Faustino Gomez, andJürgen Schmidhuber. 2006. Connectionist temporalclassification: labelling unsegmented sequence datawith recurrent neural networks. In Proceedings of the23rd international conference on Machine learning,pages 369–376.Anmol Gulati, James Qin, Chung-Cheng Chiu, NikiParmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,Zhengdong Zhang, Yonghui Wu, and Ruoming Pang.2020. Conformer: Convolution-augmented Trans-former for Speech Recognition. In Proc. Interspeech2020, pages 5036–5040.Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-rahman Mohamed. 2021. Hubert: Self-supervisedspeech representation learning by masked predictionof hidden units. IEEE/ACM Transactions on Audio,Speech, and Language Processing, 29:3451–3460.Guolin Ke, Di He, and Tie-Yan Liu. 2020. Rethinkingpositional encoding in language pre-training. In In-ternational Conference on Learning Representations.Oleksii Kuchaiev, Jason Li, Huyen Nguyen, OleksiiHrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri-man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook,et al. 2019. Nemo: a toolkit for building ai ap-plications using neural modules. arXiv preprintarXiv:1909.09577.Fidèle Mpiranya and Rachel Walker. 2005. Sibilant har-mony in kinyarwanda and coronal opacity. Handoutof paper presented at GLOW, 28.Toan Q Nguyen and Julian Salazar. 2019. Transformerswithout tears: Improving the normalization of self-attention. In Proceedings of the 16th InternationalConference on Spoken Language Translation.Daniel S Park, Yu Zhang, Chung-Cheng Chiu,Youzheng Chen, Bo Li, William Chan, Quoc V Le,and Yonghui Wu. 2020. Specaugment on large scaledatasets. In ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Process-ing (ICASSP), pages 6879–6883. IEEE.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. Pytorch: An imperative style,high-performance deep learning library. Advances inneural information processing systems, 32.Mirco Ravanelli, Titouan Parcollet, Peter Plantinga,Aku Rouhe, Samuele Cornell, Loren Lugosch, CemSubakan, Nauman Dawalatabad, Abdelwahab Heba,Jianyuan Zhong, et al. 2021. Speechbrain: Ageneral-purpose speech toolkit. arXiv preprintarXiv:2106.04624.Sandy Ritchie, You-Chi Cheng, Mingqing Chen, Ra-jiv Mathews, Daan van Esch, Bo Li, and Khe ChaiSim. 2022. Large vocabulary speech recogni-tion for languages of africa: multilingual model-ing and self-supervised learning. arXiv preprintarXiv:2208.03067.Anoop Chandran Savithri and Ramakrishnan AngaraiGanesan. 2023. Suitability of syllable-based model-ing units for end-to-end speech recognition in san-skrit and other indian languages. Expert Systems withApplications, page 119722.Ottokar Tilk and Tanel Alumäe. 2015. Lstm for punctu-ation restoration in speech transcripts. In Sixteenthannual conference of the international speech com-munication association.Elisabeth Walli-Sagey. 1986. On the representa-tion of complex segments and their formation inkinyarwanda. Studies in compensatory lengthening,ed. by Leo Wetzels and Engin Sezer, pages 251–95.Thomas Zenkel, Ramon Sanabria, Florian Metze, JanNiehues, Matthias Sperber, Sebastian Stüker, andAlex Waibel. 2017. Comparison of decoding strate-gies for ctc acoustic models. Proc. Interspeech 2017,pages 513–517.Yu Zhang, James Qin, Daniel S Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V Le, andYonghui Wu. 2020. Pushing the limits of semi-supervised learning for automatic speech recognition.arXiv preprint arXiv:2010.10504.Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu.2018. A comparison of modeling units in sequence-to-sequence speech recognition with the transformeron mandarin chinese. In Neural Information Process-ing: 25th International Conference, ICONIP 2018,Siem Reap, Cambodia, December 13–16, 2018, Pro-ceedings, Part V 25, pages 210–220. Springer.

Clean Full Text

Language

Doi

Arxiv

Mag

Acl

Pmid

Pmcid

Pub Date

Pub Year

Journal Name

Journal Volume

Journal Page

Publication Types

Tldr

Tldr Version

Generated Tldr

Search Term Used

Reference Count

Citation Count

Influential Citation Count

Last Update

Status

Aws Job

Last Checked

Modified

Created