Update Article: 102

Original Title

Sanitized Title

Clean Title

Source ID

Article Id01

Article Id02

Corpus ID

Dup

Dup ID

Url

Publication Url

Download Url

Original Abstract

Morphological modeling in neural machine translation (NMT) is a promising
approach to achieving open-vocabulary machine translation for
morphologically-rich languages. However, existing methods such as sub-word
tokenization and character-based models are limited to the surface forms of the
words. In this work, we propose a framework-solution for modeling complex
morphology in low-resource settings. A two-tier transformer architecture is
chosen to encode morphological information at the inputs. At the target-side
output, a multi-task multi-label training scheme coupled with a beam
search-based decoder are found to improve machine translation performance. An
attention augmentation scheme to the transformer model is proposed in a generic
form to allow integration of pre-trained language models and also facilitate
modeling of word order relationships between the source and target languages.
Several data augmentation techniques are evaluated and shown to increase
translation performance in low-resource settings. We evaluate our proposed
solution on Kinyarwanda - English translation using public-domain parallel
text. Our final models achieve competitive performance in relation to large
multi-lingual models. We hope that our results will motivate more use of
explicit morphological information and the proposed model and data
augmentations in low-resource NMT.Comment: NAACL Findings 202

Clean Abstract

Tags

Original Full Text

Low-resource neural machine translation with morphological modelingAntoine NzeyimanaUniversity of Massachusetts Amherstanthonzeyi@gmail.comAbstractMorphological modeling in neural machinetranslation (NMT) is a promising approach toachieving open-vocabulary machine translationfor morphologically-rich languages. However,existing methods such as sub-word tokeniza-tion and character-based models are limited tothe surface forms of the words. In this work,we propose a framework-solution for model-ing complex morphology in low-resource set-tings. A two-tier transformer architecture ischosen to encode morphological informationat the inputs. At the target-side output, a multi-task multi-label training scheme coupled witha beam search-based decoder are found to im-prove machine translation performance. Anattention augmentation scheme to the trans-former model is proposed in a generic form toallow integration of pre-trained language mod-els and also facilitate modeling of word orderrelationships between the source and target lan-guages. Several data augmentation techniquesare evaluated and shown to increase translationperformance in low-resource settings. We eval-uate our proposed solution on Kinyarwanda↔English translation using public-domain paral-lel text. Our final models achieve competitiveperformance in relation to large multi-lingualmodels. We hope that our results will motivatemore use of explicit morphological informationand the proposed model and data augmenta-tions in low-resource NMT.1 IntroductionNeural Machine Translation (NMT) has becomea predominant approach in developing machinetranslation systems. Two important innovations inrecent state-of-the-art NMT systems are the useof the Transformer architecture (Vaswani et al.,2017) and sub-word tokenization methods suchas byte-pair encoding (BPE) (Sennrich et al.,2016). However, for morphologically-rich lan-guages(MRLs), BPE-based tokenization is onlylimited to the surface forms of the words and lessgrounded on exact lexical units (i.e. morphemes),especially in the presence morphographemic al-ternations (Bundy and Wallen, 1984) and non-concatenative morphology (Kastner et al., 2019).In this work, we tackle the challenge of model-ing complex morphology in low-resource NMTand evaluate on Kinyarwanda, a low-resource andmorphologically-rich language spoken by morethan 15 million people in Eastern and CentralAfrica1.To model the complex morphology of MRLsin machine translation, one has to consider bothsource-side modeling (i.e. morphological encod-ing) and target-side generation of inflected forms(i.e. morphological prediction). We explicitly usethe morphological structure of the words and theassociated morphemes, which form the basic lex-ical units. For source-side encoding, morphemesare first produced by a morphological analyzer be-fore being passed to the source encoder throughan embedding mechanism. On the target side, themorphological structure must be predicted alongwith morphemes, which are then consumed by aninflected form synthesizer to produce surface forms.Therefore, this approach enables open vocabularymachine translation since morphemes can be mean-ingfully combined to form new inflected forms notseen during training.Previous research has shown that certain adap-tations to NMT models, such as the integrationof pre-trained language models (Zhu et al., 2020;Sun et al., 2021), can improve machine transla-tion performance. We explore this idea to im-prove low-resource machine translation betweenKinyarwanda and English. Our model augmenta-tion focuses on biasing the attention computationin the transformer model. Beside augmentationfrom pre-trained language model integration, wealso devise an augmentation based solely on the1https://en.wikipedia.org/wiki/KinyarwandaarXiv:2404.02392v1  [cs.CL]  3 Apr 2024word order relationship between source and targetlanguages. These model augmentations bring sub-stantial improvement in translation performancewhen parallel text is scarce.One of the main challenges facing machine trans-lation for low-resource languages obviously is par-allel data scarcity. When the training data haslimited lexical coverage, the NMT model maytend to hallucinate (Raunak et al., 2021; Xu et al.,2023). Additionally, for a morphology-aware trans-lation model, there is a problem of misaligned vo-cabularies between source and target languages.This makes it harder for the model to learn tocopy unknown words and other tokens that needto be copied without translation such as propernames. To address these challenges, we take adata-centric approach by developing tools to extractmore parallel data from public-domain documentsand websites. We also use various data augmen-tation techniques to increase lexical coverage andimprove token copying ability where necessary. Bycombining these data-centric approaches with ourmorphology-aware NMT model, we achieve com-petitive translation performance in relation to largermulti-lingual NMT models. To have a comprehen-sive evaluation, we evaluate our models on threedifferent benchmarks covering different domains,namely Wikipedia, News and Covid-19.In short, our contribution in this work can besummarized as follow:• We propose and evaluate methods forsource-side and target-side morphologicalmodeling in neural machine translation ofmorphologically-rich languages.• We propose a generic method for attentionaugmentation in the transformer architecture,including a new cross-positional encodingtechnique to fit word order relationships be-tween source and target language.• We evaluate on Kinyarwanda↔English trans-lation across three benchmarks and achievecompetitive performance in relation to exist-ing large multi-lingual NMT models.• We release tools for parallel corpus construc-tion from public-domain sources and makeour source code publicly available to allowreproducibility2.2https://github.com/anzeyimana/KinMT_NAACL20242 MethodsMachine translation (MT) can be considered as thetask of accurately mapping a sequence of tokens(e.g. phrase, sentence, paragraph) in the source lan-guage S = (s1, s2, ...sn) to a sequence of tokensin the target language T = (t1, t2, ...tm) with thesame meaning. The learning problem is then to esti-mate a conditional probability model that producesthe optimal translation T ∗, that is:T ∗ = argmaxTP (T |S, T<; Θ),where T< accounts for the previous output con-text and Θ are parameters of the model (that is aneural network in the case of NMT).In this section, we describe our model archi-tecture as an extension of the basic Transformerarchitecture (Vaswani et al., 2017) to enable mor-phological modeling and attention augmentation.We also describe our data-centric approaches todataset development and augmentation in the con-text of the low-resource Kinyarwanda↔Englishmachine translation.2.1 Model architectureThe transformer architecture (Vaswani et al., 2017)for machine translation uses a multi-layer bidirec-tional encoder to process source language input,and then feeds to an auto-regressive decoder to pro-duce the target language output. Our adaptation ofthe transformer encoder is depicted in Figure 1while the decoder is shown in Figure 2. Theyboth use pre-LayerNorm configuration (Nguyenand Salazar, 2019) of the transformer.The attention module of the transformer archi-tecture is designed as querying a dictionary madeof key-value pairs using a softmax function andthen projecting a weighted sum of value vectors toan output vector, that is:Attention(Q,K, V ) = Softmax(QKT√d)V,(1)where K,Q,V are projections of the hidden rep-resentations of inputs at a given layer. Given ahidden representation of a token vi attending toa sequence of tokens with hidden representations(w1, w2, ...wn), the output of the attention modulecorresponding to vi can be formulated as:v′i =n∑j=1exp(αij)∑nj′=1 exp(αij′)(wjWV ),where the logits αij =1√d(viWQ)(wjWK)T ,(2)with WQ ∈ Rd×dK ,WK ∈ Rd×dK , and WV ∈Rd×dV being learnable projection matrices. d, dKand dV are the dimensions of the input, key andvalue respectively.Enc - BERT Enc - SelfAdd & NormNormFFNBERTHBHEl-1HELL xPOSMorpho-EncoderTags & MorphemesHE0AddFigure 1: Encoder architecture2.1.1 Attention augmentationsKe et al. (2020) proposed to add bias terms to thelogits αij in Equation 2 as untied positional encod-ing, disentangling a mixing of token and positionembeddings. We generalize this structure by al-lowing more biases to be added to the logits αij inEquation 2.Specifically, we explore augmenting two atten-tion components in the transformer architecture bymaking the following extensions:1. For Source-to-source self-attention at eachencoder layer: We integrate embeddings froma pre-trained BERT (Devlin et al., 2019)Dec -- SelfDec - BERT Dec - EncAdd & NormAdd & NormFFNNormHB HELHDl-1HDLL xPOSX-POSMorpho-EncoderTags & MorphemesHD0MTML PredictionTags & MorphemesAddFigure 2: Decoder architecturemodel. This adds rich contextual informa-tion as BERT models are pre-trained on largemonolingual data and perform well on lan-guage understanding tasks. We also add posi-tional encodings at this level, similar to (Keet al., 2020). Therefore, the logits αij at en-coder layer l become:αij =1√3d[(x(l)i W(l)Q )(x(l)j W(l)K )T+ (piUQ)(pjUK)T + rj−i+ (x(l)i V(l)Q )(bjV(l)K )T ],(3)where x(l)i and x(l)j are hidden representationsof source tokens at positions i and j respec-tively of the encoder layer l; pi and pj areabsolute position embeddings; rj−i is a rela-tive position embedding; bj is a pre-trainedBERT embedding of token at position j andW(l)Q ,W(l)K , UQ, UK , V(l)Q and V(l)K are learn-able projection matrices. We note that this for-mulation requires the source encoder to matchthe same token vocabulary as the BERT em-bedding model.2. For target-to-source cross-attention at eachdecoder layer l, we also augment the atten-tion logits with pre-trained BERT embeddingsof the source sequence. Additionally, wepropose a new type of embedding: cross-positional embeddings. These are embed-dings that align target sequence positions toinput sequence positions. Their role can bethought as of learning word order relation-ships between source and target languages.Their formulation is closely similar to theuntied positional encoding proposed by (Keet al., 2020), but they cross from target tosource positions, thus, we name them cross-positional (XPOS) encodings. The attentionlogits α′ij at this level thus become:α′ij =1√3d[(y(l)i W′(l)Q )(x(L)j W′(l)K )T+ (p′iU′Q)(p′jU′K)T + r′j−i+ (y(l)i V′(l)Q )(bjV′(l)K )T ],(4)where y(l)i is the hidden representation of thetarget token at position i and x(L)j is the hid-den representation of source token at posi-tion j of the final encoder layer L. p′i andp′j are absolute target and source XPOS em-beddings, r′j−i is a target-to-source relativeXPOS embedding, bj is a pre-trained BERTembedding of source token at position j andW′(l)Q ,W′(l)K , U′Q, U′K , V′(l)Q and V′(l)K are thelearnable projection matrices.2.1.2 Morphological encodingFor most transformer-based encoder-decoder mod-els, the first layer inputs are usually formed by map-ping each sub-word token, such as those producedby BPE, to a learnable embedding vector. How-ever, BPE-produced tokens do not always carry ex-plicit lexical meaning. In fact, they cannot modelnon-concatenative morphology and other morpho-graphemic processes as these BPE tokens are solelybased on the surface forms. Inspired by the workof Nzeyimana and Rubungo (2022), we exploreusing a small transformer encoder to form a word-compositional model based on the morphologicalstructure and the associated morphemes.Depicted at the input layers in Figure 1and Figure 2, the morphological encoder orMorpho-Encoder is a small transformer encoderthat processes a set of four embedding units atthe word composition level: (1) the stem, (2) avariable number of affixes, (3) a coarse-grainedpart-of-speech (POS) tag and, (4) a fine-grainedaffix set index. An affix set represents one of manyfrequent affix combinations observed empirically.Thus, the affix set index is equivalent to a fine-grained morphological tag.The Morpho-Encoder processes all word-compositional units as a set without any orderinginformation. This is because none of these unitscan be repeated in the same word. In cases of stemreduplication phenomena (Inkelas and Zoll, 2000),only one stem is used, while the reduplication struc-ture is captured by the affix set. At the output ofthe Morpho-Encoder, hidden representations corre-sponding to units other than the affixes are pulledand concatenated together to form a word hiddenvector to feed to the main sequence model. In ad-dition to this, a new stem embedding vector at thesequence level is also concatenated with the pulledvectors from the Morpho-Encoder to form the finalhidden vector representing the word.In our experiments, we use 24,000 most frequentaffix combinations as affix sets. Any infrequentcombination of affixes can always be reduced toa frequent one by removing one or more affixes.However, all affixes still contribute to the wordcomposition via the Morpho-Encoder.We note that the Morpho-Encoder applies toboth the encoder and the auto-regressive decoder’sinput layers for all types of tokens. While the word-compositional model above relates mostly to in-flected forms, we are able generalize this to othertyped of tokens such as proper names, numbersand punctuation marks. For these other tokens, weprocess them using BPE and consider the resultingsub-word tokens as special stems without affixes.2.1.3 Target-side morphology learningConsidering the morphological model employedat the input layer, the decoder outputs for amorphologically-rich target language must be usedto predict the same types of morphological unitsused at the input layer, namely, the stem, affixes,the POS tag and the affix set. This becomes a multi-task and multi-label (MTML) classification prob-lem which requires optimizing multiple objectives,corresponding to 4 types loss functions:ℓS = ℓCE(fS(hL), yS)ℓ(i)A = ℓBCE(fA(hL), y(i)A ),∀i ∈ AℓP = ℓCE(fP (hL), yP )ℓAS = ℓCE(fAS(hL), yAS),(5)where hL is the decoder output,A is the set of affixindices, fS , fA, fP , and fAS are prediction headstransforming the decoder output to probabilitiesover the sets of stems, affixes, POS tags and affixsets respectively. yS , yA, yP , and yAS respectivelycorrespond to the stem, affixes, POS tags and affixset of a target word y. ℓCE is a cross-entropy lossfunction and ℓBCE is a binary cross-entropy lossfunction.A naive approach to the MTML problem con-sists of summing up all the losses and optimiz-ing the sum. However, this can lead to a biasedoutcome since individual losses take on differentranges and have varying levels of optimization diffi-culty. Complicating the problem further is the factthat individual objectives can contribute conflictinggradients, making it harder to train the multi-taskmodel with standard gradient descent algorithms.A potential solution to this problem comes form themulti-lingual NMT literature with a scheme calledGradient Vaccine (Wang et al., 2020). This methodattempts to mediate conflicting gradient updatesfrom individual losses by encouraging more geo-metrically aligned parameter updates. We evaluateboth the naive summation and the Gradient Vaccinemethods in our experiments.2.1.4 Morphological inferenceThe decoder architecture presented in subsection2.1.3 only predicts separate probabilities for stem,POS tag, affix set and affixes. But the translationtask must produce surface forms to generate theoutput text. The challenge of this task is that greed-ily picking the items with maximum probabilitymay not produce the best output and may even pro-duce incompatible stems and affixes, that is, wemust produce a stem and affixes of the same in-flection group (e.g. verb, noun, pronoun, etc..). Itis also known that beam search algorithm gener-ally produces better sequence outputs than greedydecoding. Therefore, we design an adaptation ofthe beam search algorithm, where at each step, weproduce a list of scored candidate surface formstogether with their morphological information tofeed back to the decoder’s input. The design cri-teria is to make sure the top predicted items canform compatible pairs of stems and affix sets. Themain requirement for the algorithm is the avail-ability of a morphological synthesizer that can pro-duce surface forms given an inflection group, thestem and compatible affixes. The morphologicalsynthesizer must also respect all existing morpho-graphemic rules for the language. We we providedetailed pseudocode for the decoding algorithm inAppendix C. The algorithm has 4 basic steps:1. voting on inflection group2. filtering out less probable stems and affixes3. selecting target affixes, and finally4. morphological synthesis for each final stemand affixes combination.2.2 DatasetDataset development and data-centric approachesto neural machine translation (NMT) are ofparamount importance for low-resource languages.This is because the most limiting factor is thescarcity of parallel data. While describing the datacollection process and pre-processing steps is im-portant, it is equally important to fully disclose thedata provenance as there are typically a limitednumber of sources of parallel data per language.We conduct our experiments using public-domainparallel text. In this section, we describe our par-allel data gathering process as well as the reliablesources we used to source Kinyarwanda-Englishbitext. We also describe simple data augmentationtechniques we used to boost the performance ofour experimental models. Due to copyright andlicensing restrictions, we cannot redistribute ourexperimental dataset. Instead, we release the toolsused for their construction from original sources.The sizes of the parallel datasets we gathered areprovided in Appendix B.2.3 Official GazetteOfficial gazettes are periodical government journalstypically with policy and regulation content. Whena country has multiple official languages, contentmay be available as parallel text with each para-graph of the journal available in each official lan-guage. We took this opportunity and collected anexperimental parallel text from the Official Gazetteof the Republic of Rwanda3, where Kinyarwanda,3https://www.minijust.gov.rw/official-gazetteFrench, English and Swahili are all official lan-guages. This is an important source of parallel textgiven that it covers multiple sectors and is usuallywritten with high standards by professionals, partof a dedicated government agency.The main content of Rwanda’s official gazetteis provided in a multi-column portable documentformat (PDF), mostly 3 columns for Kinyarwanda,English and French. In our experiments, we pro-cess page content streams by making low levelmodifications to Apache PDFBox Java library4,where the inputs are unordered set of raw charac-ters with their X-Y page coordinates and font infor-mation. We track columns by detecting margins (bysorting X-coordinates of glyphs) and reconstructtext across consecutive pages. A key opportunityfor parallel alignment comes from the fact that mostofficial gazettes paragraphs are grouped by consec-utive article numbers such as “Ingingo ya 1/Article1”, “Ingingo ya 2/Article 2”, and so on. We usethese article enumerations as anchors to findingparallel paragraphs across the three languages. Alanguage identification component is also requiredto know which column correspond with which lan-guage as the column ordering has been changingover time.2.4 Jw.org websiteJw.org website publishes religious and biblicalteachings by Jehovah’s Witnesses, with cross-references into multiple languages. While this web-site data has been used for low-resource machinetranslation before (Agić and Vulić, 2019), the iso-lated corpus is no longer available due to license re-strictions. However, the content of the original web-site is still available to web browsers and crawlers.We take this advantage and gather data from thesite to experiment with English↔Kinyarwanda ma-chine translation.2.5 Bilingual dictionariesBilingual dictionaries are also useful for low-resource machine translation. While most of theirparallel data are made of single words, they canstill contribute to the translation task, albeit with-out any sentence-level context. The 2006 versionof the Iriza dictionary (Habumuremyi and Uwama-horo, 2006) is generally publicly available in PDFformat. Similar to the Official Gazette case, weuse low level modifications to the Apache PDF-4https://pdfbox.apache.org/Box library and extract dictionary entries groupedby a source word and a target synset. Anotherbilingual dictionary we used is kinyarwanda.netwebsite5, which was developed by volunteers tohelp people learning Kinyarwanda or English. Inaddition to these bilingual dictionaries, we man-ually translated about 8,000 Kinyarwanda wordswhose stems could not be found in any of the ex-isting parallel data sources. Some of these termsinclude recently incorporated but frequently usedKinyarwanda words such as loanwords and alsoalternate common spellings. Examples includewords such as: ‘abazunguzayi’ (hawkers), ‘ak-abyiniro’ (night club), ‘canke’ (from Kirundi: or),‘mitiweli’ (from French: “mutuelle santé”). To-gether with data from bilingual dictionaries, theabove dataset forms a special training subset wecall ‘lexical data’, because it augments the lexi-cal coverage of our main dataset. We evaluate itseffectiveness in our experiments.2.6 Monolingual dataBacktranslation (Edunov et al., 2018) is a proventechnique for leveraging monolingual data in ma-chine translation. We developed a corpus ofKinyarwanda text by crawling more than 200 web-sites and extracting text from several books to forma monolingual dataset to use for back-translation.The final corpus contains about 400 million wordsand tokens or 16 million sentences. We also formedan English text corpus of similar size by crawlingeight major Rwandan and East African Englishnewspapers (3.3 million sentences) in addition toWikipedia English corpus (7.3 million sentences)6and global English news data (5.4 million sen-tences)7.2.7 Data augmentationsSource to target copying in NMT is a desirableability when faced with untranslatable terms suchas proper names. However, when the source andtarget vocabularies are not shared, it is harder forthe model to learn this ability. In order to enforcethis copying ability in our NMT model, we take adata-centric approach by including untranslatableterms in our dataset with the same source and tar-get text. This augmentation includes the following5https://kinyarwanda.digital/6https://www.kaggle.com/datasets/mikeortman/wikipedia-sentences7https://data.statmt.org/news-crawl/en/news.2020.en.shuffled.deduped.gzdatasets: (1) All numeric tokens and proper namesfrom our Kinyarwanda monolingual corpus, (2)Names of locations from the World Cities dataset8,and (3) Names of people from CMU Names cor-pus (Kantrowitz and Ross, 2018) and the NamesDataset (Remy, 2021).We also add synthetic data for number spellingsby using rule-based synthesizers to spell 200,000random integers between zero and 999 billions. ForKinyarwanda side, we developed our own synthe-sizer, while we used inflect python package9 forEnglish.Code-switching is one characteristic of somelow-resource languages such as Kinyarwanda. Tocope with this issue, we add foreign languageterms and their English translations to the train-ing dataset for Kinyarwanda → English modelsas if the foreign terms were valid Kinyarwandainputs. For this, we include all English phrasesfrom kinyarwanda.net online dictionary, 100 popu-lar French terms and 20 popular Swahili terms.3 Experiments3.1 Experimental setupThe model presented in section 2.1 was im-plemented from scratch using PyTorch frame-work (Paszke et al., 2019) version 1.13.1. Ourmodel hyper-parameters along with training andinference hyper-parameters are provided in ap-pendix A. Training was done using a hardwareplatform with 8 Nvidia RTX 4090 GPUs, with256 gigabytes of system memory, on a Linux op-erating system. We used mixed precision trainingwith lower precision in BFLOAT-16 format. ForKinyarwanda→English model, one gradient updatestep takes 0.5 second and convergence is achievedafter 40 epochs. For English→Kinyarwanda modelwith Gradient Vaccine scheme, one gradient up-date step takes 1.1 seconds while convergence isachieved after 8 epochs.In all our experiments on Kinyarwanda ↔English translation, only Kinyarwanda side (assource or target) is morphologically modelled,while the English side always uses sub-word to-kenization. For Kinyarwanda source side with at-tention augmentation, we use a pre-trained BERTmodel similar to KinyaBERT (Nzeyimana andRubungo, 2022) whose input units/token ids are8https://github.com/datasets/world-cities/blob/master/data/world-cities.csv9https://pypi.org/project/inflect/the same as for the NMT encoder. In fact, thispre-trained KinyaBERT model has the same two-tier architecture as the NMT encoder. Therefore,they are aligned to the same words/tokens. Weuse a Kinyarwanda morphological analyzer10 toperform both tokenization, morphological analysisand disambiguation (Nzeyimana, 2020).On English sides (either source or target), wedo not perform morphological analysis and onlyuse a standard single-tier transformer architecture.On the source side, we use a BPE-based tokeniza-tion and a corresponding pre-trained RoBERTAmodel provided by fairseq package (Ott et al.,2019). Similarly, on English target side, we use aBPE-based tokenization from a Transformer-basedauto-regressive English language model (Ng et al.,2019) from the same fairseq package.3.2 EvaluationWe evaluate our models on three different bench-marks that include Kinyarwanda, namely FLORES-200 (Costa-jussà et al., 2022), MAFAND-MT (Ade-lani et al., 2022) and TICO-19 (Anastasopouloset al., 2020). This allow us to have a picture onhow the models perform on different domains, re-spectively Wikipedia, News and Covid-19. Ourmain evaluation metric is ChrF++ (Popović, 2017)which includes both character-level and word-leveln-gram evaluation, does not rely to any sub-wordtokenization and has been shown to correlate betterwith human judgements than the more traditionalBLUE score. We use TorchMetrics (Detlefsen et al.,2022) package’s default implementation of ChrF++.For Kinyarwanda→ English translation, we alsoevaluate with BLEURT scores (Sellam et al., 2020),an embedding-based metric with higher correlationwith human judgement. We use a pre-trained Py-Torch implementation of BLEURT11. We did notuse BLEURT scores for English→Kinyarwanda be-cause there was no available pre-trained BLEURTmodel for Kinyarwanda and the pre-training cost isvery high.The baseline BPE-based models in Table 4and Table 5 use a SentencePiece (Kudo andRichardson, 2018) tokenizer, with 32K-token vo-cabularies for either source or target. The Senten-cePiece tokenizers are trained/optimized on 16Msentences of text for each language. Source and tar-get vocabularies are not shared. The NMT models10https://github.com/anzeyimana/DeepKIN11https://github.com/lucadiliello/bleurt-pytorchFLORES-200 MAFAND-MT TICO-19 AverageUse copy data? BLEURT ChrF++ BLEURT ChrF++ BLEURT ChrF++ BLEURT CHR++No 55.5 39.2 52.3 37.1 52.2 33.4 53.3 36.6Yes 56.8 40.3 54.8 39.6 52.9 34.0 54.9 38.0Table 1: Impact of proper name copying ability: Kinyarwanda→ English. Maximum scores are shown in bold.#Params FLORES-200 MAFAND-MT TICO-19 AverageSetup (× 1M) BLEURT ChrF++ BLEURT ChrF++ BLEURT ChrF++ BLEURT CHR++Morpho 188 57.1 40.9 54.7 39.8 53.1 34.7 55.0 38.5+ XPOS 190 57.7 41.1 55.6 39.9 54.1 35.3 55.8 38.8+ BERT 190 59.4 42.5 57.1 40.4 56.1 36.3 57.5 39.7+ BERT + XPOS 192 59.9 43.1 58.0 41.3 56.7 37.0 58.2 40.5Table 2: Impact of attention augmentation: Kinyarwanda→ EnglishFLORES-200 MAFAND-MT TICO-19 AverageSetup BLEURT ChrF++ BLEURT ChrF++ BLEURT ChrF++ BLEURT CHR++Morpho + BERT + XPOS without Lexical Data 58.5 41.6 56.3 39.9 55.5 36.0 56.8 39.2Morpho + BERT + XPOS + Lexical Data 59.9 43.1 58.0 41.3 56.7 37.0 58.2 40.5Table 3: Impact of lexical data (bilingual dictionaries): Kinyarwanda→ English#Params FLORES-200 MAFAND-MT TICO-19 AverageSetup (× 1M) BLEURT ChrF++ BLEURT ChrF++ BLEURT ChrF++ BLEURT CHR++BPE Seq2Seq + XPOS 187 50.1 35.5 48.5 34.2 47.4 30.7 48.7 33.5Morpho + XPOS 190 57.7 41.1 55.6 39.9 54.1 35.3 55.8 38.8Table 4: Impact of source side morphological modeling: Kinyarwanda→ EnglishFLORES-200 MAFAND-MT TICO-19 AverageSetup #Params Dev Test Dev Test Dev Test Dev Test(× 1M) ChrF++ ChrF++ ChrF++ ChrF++ ChrF++ ChrF++ ChrF++ ChrF++BPE Seq2Seq + XPOS 187 35.0 35.2 37.0 37.8 30.1 31.1 34.0 34.7Morpho + XPOS (Loss summation) 196 36.9 37.2 39.2 40.9 32.0 33.0 36.0 37.0Morpho + XPOS + GradVacc 196 37.6 38.2 41.0 42.4 32.8 33.5 37.1 38.0Table 5: Impact of target side morphological modeling: English→ KinyarwandaFLORES-200 MAFAND-MT TICO-19Setup #Params Dev Test Dev Test Dev Testx 1M chrF2 chrF2 chrF2 chrF2 chrF2 chrF2Morpho + XPOS + BERT + Backtransl. (Ours) 403 53.2 53.1 58.2 61.9 48.7* 50.2*Helsinki-opus-mt (Tiedemann and Thottingal, 2020) 76 35.5 36.7 34.3 37.3 27.5 27.2NLLB-200 600M (distilled) (Costa-jussà et al., 2022) 600 45.8 45.5 50.4 52.7 44.8 46.3mBART (Liu et al., 2020) fine-tuned on our dataset 610 48.7 48.5 52.4 54.1 43.7 45.2NLLB-200 3.3B (Costa-jussà et al., 2022) 3,300 50.6 50.9 57.3 58.6 50.0 52.4Google Translate N/A 59.1 60.0 76.6 87.5 46.5 49.6Table 6: English→ Kinyarwanda: Comparison of our large model performance after back-translation in relationto open-source models and Google Translate. chrF2 scores are computed using SacreBLEU (Post, 2018) with 10000bootstraps for significance testing. Highest scores among open source models (p-value < 0.002) are shown in bold.Overall best scores are underlined. *On TICO-19, our model outperforms Google Translate (p-value < 0.002).FLORES-200 MAFAND-MT TICO-19Setup #Params Dev Test Dev Test Dev Testx 1M chrF2 chrF2 chrF2 chrF2 chrF2 chrF2Morpho + XPOS + BERT + Backtransl. (Ours) 396 54.6 54.8 54.7 59.2 49.4 50.1Helsinki-opus-mt (Tiedemann and Thottingal, 2020) 76 35.4 35.1 33.7 35.1 29.6 29.7NLLB-200 600M (distilled) (Costa-jussà et al., 2022) 600 53.1 52.3 51.1 54.9 47.7 48.6mBART (Liu et al., 2020) fine-tuned on our dataset 610 43.7 43.1 44.5 46.0 38.8 38.8NLLB-200 3.3B (Costa-jussà et al., 2022) 3,300 56.8 56.0 55.4 59.6 53.4 54.1Google Translate N/A 60.0 59.1 57.3 64.0 51.8 52.4Table 7: Kinyarwanda→ English: Comparison of our large model performance after back-translation in relationto open-source models and Google Translate. chrF2 scores are computed using SacreBLEU (Post, 2018) with 10000bootstraps for significance testing. Highest scores among open source models (p-value < 0.002) are shown in bold.Overall best scores are underlined.in this case use the same Transformer backbone asthe morphological models, but without morpholog-ical modeling or BERT attention augmentation.3.3 ResultsResults in Table 1 through Table 5 show our abla-tion study results, evaluating the various contribu-tions. In Table 1, we show the improvement acrossall three benchmarks from adding proper namesdata to induce token-copying ability. In Table 2,we evaluate the impact of our attention augmen-tation scheme. The results show substantial im-provement by adding BERT and XPOS attentionaugmentations. Table 3 confirms the effectivenessof adding bilingual dictionary data to the trainingset. In Table 4 and Table 5, we find a large per-formance gap between standard transformer withBPE-based tokenization (BPE Seq2Seq) and ourmorphology-based models (Morpho), which con-firms the effectiveness of our morphological mod-eling. Finally, in Table 6 and Table 7, we useback-translation and train 400M-parameter modelsthat perform better than strong baselines includ-ing NLLB-200 (3.3B parameters for English→Kinyarwanda, 600M parameters for both direc-tions) and fine-tuned mBART (610M parameters).For English→Kinyarwanda, we achieve perfor-mance exceeding that of Google Translate on theout-of-domain TICO-19 benchmark.4 Related WorkMorphological modeling in NMT is an activelyresearched subject often leading to improvementsin translation. However most of this research hasfocused on European languages. Ataman and Fed-erico (2018) shows that an RNN-based word com-positional model improves NMT on several lan-guages. Weller-Di Marco and Fraser (2020) evalu-ates both source-side and target-side morphologymodeling between English and German using alemma+tag representation. Passban et al. (2018)proposes using multi-task learning of target-sidemorphology with a weighted average loss func-tion. However, Macháček et al. (2018) does notfind improvement when using unsupervised mor-phological analysers. Our studies differs in that ituses a different morphological representation, thatis the two-tier architecture, and we also evaluate ona relatively lower resourced language.The idea of model augmentation with pre-trainedlanguage models (PLM) have been previously ex-plored by Sun et al. (2021), and Zhu et al. (2020)who use a drop-net scheme to integrate BERT em-beddings. Also, there have been attempts to modelword order relationships between source and targetlanguages (Li et al., 2017; Murthy et al., 2019).Our model architecture provides a more genericapproach through the attention augmentation.5 ConclusionThis work combines three techniques of morpho-logical modeling, attention augmentation and dataaugmentation to improve machine translation per-formance for a low-resource morphologically-richlanguage. Our ablation results indicate improve-ment from each individual contribution. Baselineimprovements from morphological modeling aremore pronounced at the target side than at thesource side. This work expands the landscape ofmodeling complex morphology in NMT and pro-vides a potential framework-solution for machinetranslation of low-resource morphologically richlanguages.6 LimitationsOur morphological modeling proposal requires aneffective morphological analyzer and was only eval-uated on one morphologically-rich language, thatis Kinyarwanda. Morphological analyzers are notavailable for all languages and this will limit theapplicability of our technique. The proposed dataaugmentation technique for enabling proper namecopying ability works in most cases, but we alsoobserved some few cases where inexact copies areproduced. Similarly, even with lexical data addedto our training, we still observe some cases of hal-lucinated output words, mostly when the modelencounters unseen words. Finally, the proposedmorphological decoding algorithm is slower thanstandard beam search because of the filtering stepsand morphological synthesis performed before pro-ducing a candidate output token.Given the above limitations, our model does notgrant complete reliability and the produced trans-lations still require post-editing to be used in high-stake applications. However, there are no majorrisks for using the model in normal use cases as atranslation aid tool.ReferencesDavid Adelani, Jesujoba Alabi, Angela Fan, JuliaKreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter,Dietrich Klakow, Peter Nabende, Ernie Chang, et al.2022. A few thousand translations go a long way!leveraging pre-trained models for african news trans-lation. In Proceedings of the 2022 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 3053–3070.Željko Agić and Ivan Vulić. 2019. Jw300: A wide-coverage parallel corpus for low-resource languages.In Proceedings of the 57th Annual Meeting of the As-sociation for Computational Linguistics, pages 3204–3210.Antonios Anastasopoulos, Alessandro Cattelan, Zi-Yi Dou, Marcello Federico, Christian Federmann,Dmitriy Genzel, Franscisco Guzmán, Junjie Hu, Mac-duff Hughes, Philipp Koehn, et al. 2020. Tico-19:the translation initiative for covid-19. In Proceedingsof the 1st Workshop on NLP for COVID-19 (Part 2)at EMNLP 2020.Duygu Ataman and Marcello Federico. 2018. Com-positional representation of morphologically-rich in-put for neural machine translation. arXiv preprintarXiv:1805.02036.Alan Bundy and Lincoln Wallen. 1984. Mor-phographemics: Alias: spelling rules. Catalogueof Artificial Intelligence Tools, pages 76–77.Marta R Costa-jussà, James Cross, Onur Çelebi, MahaElbayad, Kenneth Heafield, Kevin Heffernan, ElaheKalbassi, Janice Lam, Daniel Licht, Jean Maillard,et al. 2022. No language left behind: Scaling human-centered machine translation.Nicki Skafte Detlefsen, Jiri Borovec, Justus Schock,Ananya Harsh Jha, Teddy Koker, Luca Di Liello,Daniel Stancl, Changsheng Quan, Maxim Grechkin,and William Falcon. 2022. Torchmetrics-measuringreproducibility in pytorch. Journal of Open SourceSoftware, 7(70):4101.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, Volume 1 (Long and Short Papers), pages 4171–4186.Sergey Edunov, Myle Ott, Michael Auli, and DavidGrangier. 2018. Understanding back-translation atscale. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 489–500.Emmanuel Habumuremyi and Claudine Uwamahoro.2006. IRIZA-STARTER 2006: A BilingualKinyarwanda-English and English-KinyarwandaDictionary. Rwanda Community Net.Sharon Inkelas and Cheryl Zoll. 2000. Reduplication asmorphological doubling. Manuscript, University ofCalifornia, Berkeley and Massachusetts Institute ofTechnology.Mark Kantrowitz and Bill Ross. 2018. Names corpus,version 1.3.Itamar Kastner, Matthew A Tucker, Artemis Alexiadou,Ruth Kramer, Alec Marantz, and Isabel Oltra Mas-suet. 2019. Non-concatenative morphology. Ms.,Humboldt-Universität zu Berlin and Oakland Univer-sity.Guolin Ke, Di He, and Tie-Yan Liu. 2020. Rethinkingpositional encoding in language pre-training. arXivpreprint arXiv:2006.15595.Taku Kudo and John Richardson. 2018. Sentencepiece:A simple and language independent subword tok-enizer and detokenizer for neural text processing. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing: SystemDemonstrations, pages 66–71.Junhui Li, Deyi Xiong, Zhaopeng Tu, Muhua Zhu, MinZhang, and Guodong Zhou. 2017. Modeling sourcesyntax for neural machine translation. In Proceed-ings of the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 688–697.Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, SergeyEdunov, Marjan Ghazvininejad, Mike Lewis, andLuke Zettlemoyer. 2020. Multilingual denoising pre-training for neural machine translation. Transac-tions of the Association for Computational Linguis-tics, 8:726–742.Dominik Macháček, Jonáš Vidra, and Ondřej Bojar.2018. Morphological and language-agnostic wordsegmentation for nmt. In International Confer-ence on Text, Speech, and Dialogue, pages 277–284.Springer.Rudra Murthy, Anoop Kunchukuttan, and Pushpak Bhat-tacharyya. 2019. Addressing word-order divergencein multilingual neural machine translation for ex-tremely low resource languages. In Proceedings ofthe 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long andShort Papers), pages 3868–3873.Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott,Michael Auli, and Sergey Edunov. 2019. Facebookfair’s wmt19 news translation task submission. arXivpreprint arXiv:1907.06616.Toan Q Nguyen and Julian Salazar. 2019. Transformerswithout tears: Improving the normalization of self-attention. In Proceedings of the 16th InternationalConference on Spoken Language Translation.Antoine Nzeyimana. 2020. Morphological disambigua-tion from stemming data. In Proceedings of the 28thInternational Conference on Computational Linguis-tics, pages 4649–4660.Antoine Nzeyimana and Andre Niyongabo Rubungo.2022. Kinyabert: a morphology-aware kinyarwandalanguage model. In Proceedings of the 60th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 5347–5363.Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan,Sam Gross, Nathan Ng, David Grangier, and MichaelAuli. 2019. fairseq: A fast, extensible toolkit forsequence modeling. In Proceedings of the 2019 Con-ference of the North American Chapter of the Associa-tion for Computational Linguistics (Demonstrations),pages 48–53.Peyman Passban, Qun Liu, and Andy Way. 2018. Im-proving character-based decoding using target-sidemorphological information for neural machine trans-lation. arXiv preprint arXiv:1804.06506.Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. Pytorch: An imperative style,high-performance deep learning library. Advances inneural information processing systems, 32.Maja Popović. 2017. chrf++: words helping charactern-grams. In Proceedings of the second conference onmachine translation, pages 612–618.Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computa-tional Linguistics.Vikas Raunak, Arul Menezes, and Marcin Junczys-Dowmunt. 2021. The curious case of hallucinationsin neural machine translation. In Proceedings of the2021 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 1172–1183.Philippe Remy. 2021. Name dataset. https://github.com/philipperemy/name-dataset.Thibault Sellam, Dipanjan Das, and Ankur Parikh. 2020.Bleurt: Learning robust metrics for text generation.In Proceedings of the 58th Annual Meeting of the As-sociation for Computational Linguistics, pages 7881–7892.Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In Proceedings of the 54th AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 1715–1725.Zewei Sun, Mingxuan Wang, and Lei Li. 2021. Multi-lingual translation via grafting pre-trained languagemodels. arXiv preprint arXiv:2109.05256.Jörg Tiedemann and Santhosh Thottingal. 2020. Opus-mt–building open translation services for the world.In Proceedings of the 22nd Annual Conference ofthe European Association for Machine Translation,pages 479–480.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. Advances in neural information processingsystems, 30.Zirui Wang, Yulia Tsvetkov, Orhan Firat, and Yuan Cao.2020. Gradient vaccine: Investigating and improv-ing multi-task optimization in massively multilingualmodels. In International Conference on LearningRepresentations.Marion Weller-Di Marco and Alexander Fraser. 2020.Modeling word formation in english–german neuralmachine translation. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 4227–4232.Weijia Xu, Sweta Agrawal, Eleftheria Briakou, Mari-anna J Martindale, and Marine Carpuat. 2023. Un-derstanding and detecting hallucinations in neuralmachine translation via model introspection. Trans-actions of the Association for Computational Linguis-tics, 11:546–564.Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,Wengang Zhou, Houqiang Li, and Tie-Yan Liu. 2020.Incorporating bert into neural machine translation.arXiv preprint arXiv:2002.06823.A Model, training and inferencehyper-parametersOur model hyper-parameters along with train-ing and inference hyper-parameters are providedin Table 8.Hyper-parameter ValueModelTransformer hidden dimensions 768Transformer feed-forward dimension 3072Transformer attention heads 12Transformer encoder layers with BERT 5Transformer encoder layers without BERT 8Transformer decoder layers with GPT 7Transformer decoder layers without GPT 8Morpho-Encoder hidden dimension 128Morpho-Encoder feed-forward dimension 512Morpho-Encoder attention heads 4Morpho-Encoder layers 3Dropout 0.1Maximum sequence length 512TrainingBatch size 32K tokensPeak learning rate 0.0005Learning rate schedule inverse sqrtWarm-up steps 8000Optimizer AdamAdam’s β1, β2 0.9, 0.98Maximum training epochs 40Morphological inferenceBeam width 4Top scores M 8Cut-off gap δ 0.3Minimum affix probability γ 0.3Global correlation weight α 0.08Surface form score stop gap log(β) 2.0Table 8: Hyper-parameter settingsB Dataset summarySubset SizeParallel sentencesJw.org website 562,417 sentencesRwanda’s official gazette 113,127 sentencesLexical dataIriza dictionary 2006 108,870 wordsKinyarwanda.net 10,653 wordsManually translated 8,000 wordsAugmented dataSpelled numbers 200,000 phrasesCopy data (e.g. Proper names) 157,668 wordsCode-switching foreign terms 10,276 termsTable 9: Summary of our experimental parallel datasetC Morphological decoding algorithmAlgorithm 1: Inflection generation helper subroutines1 Subroutine: inflectionGroupProbabilities(Ts, Tp, Ta, Ps, Pp, Pa,M,G)2 W ←M ×G array3 W [i, g]← −100, ∀i = 1, 2, ..M, ∀g = 1, 2, ..G4 for i← 1 to M do5 gs ← getInflectionGroup(Ts[i])6 W [i, gs]← max(W [i, gs], log(Ps[Ts[i]]))7 gp ← getInflectionGroup(Tp[i])8 W [i, gp]← max(W [i, gp], log(Pp[Tp[i]]))9 ga ← getInflectionGroup(Ta[i])10 W [i, ga]← max(W [i, ga], log(Pa[Ta[i]]))11 end for12 V ← G array13 V [g]←∑Mi=1W [i, g], ∀g = 1, 2, ..G14 Pg ← softmax(V )15 return Pg16 Subroutine: filterAndCutOff(T, P, g, δ, γ)17 T′ ← [ ]18 pp← 019 foreach i ∈ T do20 if (getInflectionGroup(i) = g) and (P [i] ≥ γ) then21 T′.append(i)22 if (pp− P [i]) > δ then23 break24 pp← P [i]25 end foreach26 return T ′27 Subroutine: computeScore(Ts, Tp, Ta, Ps, Pp, Pa, ρ[., .], α)28 C ← |Ts| × |Tp| × |Ta| array29 tot← 030 foreach s ∈ Ts do31 foreach p ∈ Tp do32 foreach a ∈ Ta do33 c← exp(αlog(ρ[s, a]) + logPs[s] + logPp[p] + logPa[a])34 C[s, p, a]← c35 tot← tot+ c36 end foreach37 end foreach38 end foreach39 C[s, p, a]← C[s, p, a]/tot; ∀s ∈ Ts,∀p ∈ Tp,∀a ∈ Ta // Normalize40 return CAlgorithm 2: Inflection generationInput :Probability distributions returned by the neural network (softmax heads) for stems, POStags, affix sets (Ps, Pp, Pa); probability values returned for each affix (multi-label heads)Pf ; M : number of top items (stems, POS tags, affix sets) to consider; N : number of topaffixes to consider; number of inflection groups G; ρ[., .]: corpus-computed correlation ofstems and affix sets; α: stem-affix set correlation weight; β: maximum probability gap forinflection groups; γ: minimum affix probability; δ: maximum probability gap for stems,POS tags and affix sets.Output :Candidate inflected forms and their scores.1 Subroutine generateInflections(Ps, Pp, Pa, Pf , ρ[., .], α, β, γ, δ)2 /* All argSort(.) calls are in decreasing order */3 Ts ← argSort(Ps)[: M ] // Up to M items4 Tp ← argSort(Pp)[: M ]5 Ta ← argSort(Pa)[: M ]6 Tf ← argSort(Pf )[: N ]7 Pg ← inflectionGroupProbabilities(Ts, Tp, Ta, Ps, Pp, Pa,M,G)8 Gs ← argSort(Pg)9 pp← 010 R← [ ] // List of inflections to return11 foreach g ∈ Gs do12 T′s ← filterAndCutOff(Ts, Ps, g, δ, 0)13 T′p ← filterAndCutOff(Tp, Pp, g, δ, 0)14 T′a ← filterAndCutOff(Ta, Pa, g, δ, 0)15 T′f ← filterAndCutOff(Tf , Pf , g, δ, γ)16 Cg ← computeScore(T′s, T′p, T′a, Ps, Pp, Pa, ρ[., .], α)17 Lg ← argSort(Cg)18 foreach (s, p, a) ∈ Ls do19 // Formulate affixes by merging affix set’s20 // own affixes and extra predicted affixes21 f ← affixMerge(a, T ′f )22 // Call morphological synthesizer23 surface← morphoSynthesis(s, f)24 if surface ̸= null then25 R.append((surface, s, p, a, f, Cg[s, p, a]))26 end foreach27 if (pp− Pg[g]) > β then28 break29 pp← Pg[g]30 end foreach31 return R32 // The returned inflections will be added to the beam33 // for beam search-based decoding.

Clean Full Text

Language

Doi

Arxiv

Mag

Acl

Pmid

Pmcid

Pub Date

Pub Year

Journal Name

Journal Volume

Journal Page

Publication Types

Tldr

Tldr Version

Generated Tldr

Search Term Used

Reference Count

Citation Count

Influential Citation Count

Last Update

Status

Aws Job

Last Checked

Modified

Created