200

Update Delete

ID	200
Original Title	Human languages trade off complexity against efficiency
Sanitized Title	humanlanguagestradeoffcomplexityagainstefficiency
Clean Title	Human Languages Trade Off Complexity Against Efficiency
Source ID	2
Article Id01	618446524
Article Id02	oai:ids-pub.bsz-bw.de:12786
Corpus ID	(not set)
Dup	(not set)
Dup ID	(not set)
Url	https://core.ac.uk/outputs/618446524
Publication Url	(not set)
Download Url	https://core.ac.uk/download/618446524.pdf
Original Abstract	From a cross-linguistic perspective, language models are interesting because they can be used as idealised language learners that learn to produce and process language by being trained on a corpus of linguistic input. In this paper, we train different language models, from simple statistical models to advanced neural networks, on a database of 41 multilingual text collections comprising a wide variety of text types, which together include nearly 3 billion words across more than 6,500 documents in over 2,000 languages. We use the trained models to estimate entropy rates, a complexity measure derived from information theory. To compare entropy rates across both models and languages, we develop a quantitative approach that combines machine learning with semiparametric spatial filtering methods to account for both language- and document-specific characteristics, as well as phylogenetic and geographical language relationships. We first establish that entropy rate distributions are highly consistent across different language models, suggesting that the choice of model may have minimal impact on cross-linguistic investigations. On the basis of a much broader range of language models than in previous studies, we confirm results showing systematic differences in entropy rates, i.e. text complexity, across languages. These results challenge the long-held notion that all languages are equally complex. We then show that higher entropy rate tends to co-occur with shorter text length, and argue that this inverse relationship between complexity and length implies a compensatory mechanism whereby increased complexity is offset by increased efficiency. Finally, we introduce a multi-model multilevel inference approach to show that this complexity-efficiency trade-off is partly influenced by the social environment in which languages are used: languages spoken by larger communities tend to have higher entropy rates while using fewer symbols to encode messages
Clean Abstract	(not set)
Tags	(not set)
Original Full Text	Title page Human languages trade off complexity against efficiency Authors: Alexander Koplenig1, Sascha Wolfer1, Jan Oliver Rüdiger1, Peter Meyer11Department of Lexical Studies, Leibniz Institute for the German Language (IDS), Mannheim, Germany. Corresponding author. E-mail: koplenig@ids-mannheim.deOriginally published in: OSF Preprints, Center for Open Science (2024), 57 pp. DOI: https://doi.org/10.31219/osf.io/8xgqzPublikationsserver des Leibniz-Instituts für Deutsche SpracheURN: https://nbn-resolving.org/urn:nbn:de:bsz:mh39-127860Creative Commons - Attribution 4.0 InternationalAbstract From a cross-linguistic perspective, language models are interesting because they can be used as idealised language learners that learn to produce and process language by being trained on a corpus of linguistic input. In this paper, we train different language models, from simple statistical models to advanced neural networks, on a database of 41 multilingual text collections comprising a wide variety of text types, which together include nearly 3 billion words across more than 6,500 documents in over 2,000 languages. We use the trained models to estimate entropy rates, a complexity measure derived from information theory. To compare entropy rates across both models and languages, we develop a quantitative approach that combines machine learning with semiparametric spatial filtering methods to account for both language- and document-specific characteristics, as well as phylogenetic and geographical language relationships. We first establish that entropy rate distributions are highly consistent across different language models, suggesting that the choice of model may have minimal impact on cross-linguistic investigations. On the basis of a much broader range of language models than in previous studies, we confirm results showing systematic differences in entropy rates, i.e. text complexity, across languages. These results challenge the long-held notion that all languages are equally complex. We then show that higher entropy rate tends to co-occur with shorter text length, and argue that this inverse relationship between complexity and length implies a compensatory mechanism whereby increased complexity is offset by increased efficiency. Finally, we introduce a multi-model multilevel inference approach to show that this complexity-efficiency trade-off is partly influenced by the social environment in which languages are used: languages spoken by larger communities tend to have higher entropy rates while using fewer symbols to encode messages. Keywords Quantitative linguistics, computational linguistics, linguistic typology, language model, language complexity, information theory, entropy, machine learning, spatial filtering Main text 1 Introduction A model that assigns probabilities to sequences of linguistic symbols is called a language model (LM) [1]. While originally only trained on (vast amounts of) textual data to predict upcoming linguistic material [2], modern LMs demonstrate impressive and at times surprising capabilities in a wide range of scientific applications beyond linguistic tasks, such as predicting protein structures [3], forecasting time series [4], accelerating drug and material discovery [5, 6], analysing genomic and epi-genomic data [7], and enhancing climate modelling [8]. At the same time, LMs also excel in traditional linguistic tasks, as evidenced by their ability to perform zero-shot learning, where they effectively generalise to new tasks without specific training, as shown by [9]. This highlights their potential to acquire human-like grammatical language through statistical learning, without relying on a built-in grammar [10]. On this basis, a vibrant research field has emerged, where LMs are being used as computational working models [11] or models of languages [12] to study different aspects of language processing and comprehension [2]. From a cross-linguistic perspective, LMs are also interesting, because they can be used as idealised language learners that learn to produce and process language by being trained on a corpus of linguistic input [10, 13]: A central goal of linguistics is to understand the diverse ways in which human language can be organised. By training an LM on linguistic material in different languages, researchers can investigate how these models learn and generalise linguistic structures and rules across different language families. The use of parallel corpora in this context is particularly valuable as it allows for systematic comparisons of language processing: parallel corpora, which contain translations of the same texts in multiple languages, enable the study of how LMs manage identical content within varying grammars and lexicons. Such datasets can also be used to systematically test and understand linguistic laws, i.e., statistical patterns shared across human languages [14] or to test if languages adapt to the geographical or sociodemographic environment in which they are being learned and used [15319]. Yet another idea 3 dating back to the work of Greenberg [20] 3 that was revived with the development of large parallel corpora [21] is to use parallel texts to classify and compare languages [22]. Examples of such cross-linguistic studies are [23327]. However, the majority of the aforementioned studies are based on very peculiar text types, especially translations of the Bible and there are several important challenges that the use of the Bible as a parallel text source pose [17, 28, 29]. To tackle this problem, we leveraged available corpora and multilingual text collections [30333] and compiled a database of parallel texts comprising a large variety of different text types, e.g. religious texts, legalese texts, subtitles for various movies and talks, and machine translations. In addition, we added comparable corpora, i.e., texts that are not parallel but come from comparable sources and are therefore similar in content, again comprising very different text types/genres, e.g. newspaper texts, web crawls, Wikipedia articles, translation tables for system messages in the Ubuntu operating system, or translated example sentences from a free collaborative online database. Furthermore, we added information from the Crúbadán project [34] that aims at creating text corpora for a large number of (especially under-resourced) languages. In total, the compiled database contains 41 different multilingual corpora, comprising nearly 3 billion words or nearly 9 billion Unicode characters across more than 6,500 documents and covering over 2,000 languages. These languages are spoken as a native language by more than 90% of the world's population and represent almost half of all languages with a standardised written representation. In a recent paper [35], we presented the first results based on this database. In this study, our primary focus was a cross-linguistic examination of language complexity, a topic that has garnered significant attention in linguistics and related fields over the past two decades 3 for an overview, see [36]. In our study, we quantitatively evaluated the so called equi-complexity hypothesis that suggests that all human languages, despite their diverse and varied nature, have the same level of overall complexity [37]. To overcome the difficulty of measuring overall language complexity [38], we leveraged information theory, an area of mathematics that links probability and communication [39] and provides notions of complexity that are both objective and theory-neutral [40]. To this end, we trained a simple LM on each of our documents and statistically analysed the training process to infer the average per-symbol information content or entropy rate of each document, which can be interpreted as a measure of complexity [41, 42]: the harder it is, on average, to predict upcoming text, the higher the entropy rate, the greater is the complexity of the text as a whole [10, 43345]. We argued the entropy rate can thus also be used to compare the complexity of different languages. We then statistically compared complexity rankings across different corpora by calculating correlation coefficients between the entropy rates across all possible pairs of multilingual corpora. For example, we correlated the entropy rate rankings derived from a corpus of movie subtitles in various languages with those from a similarly diverse corpus of Wikipedia sentences. This approach, applied comprehensively to all pairs among our 41 different multilingual corpora, makes it possible to assess the consistency of complexity rankings across various types of linguistic data. From an information-theoretic point of view, we showed that our results constitute evidence against the equi-complexity hypothesis: a language with high/low entropy rate in one corpus also tends to be more/less complex in another corpus. As higher complexity in language results in more demanding processing efforts, encompassing both language production and comprehension, our study naturally leads to the question: Why is there a trend towards increased complexity in certain languages? In this paper, we offer a potential answer to that question by demonstrating that high entropy languages tend to need fewer symbols to encode messages. We argue that from an information-theoretic point of view, this finding implies that higher complexity is compensated by higher efficiency. The main purpose of this study is to present, discuss and evaluate evidence for such a complexity-efficiency trade-off, while also striving to enhance the following empirical and methodological aspects of our prior work. First, in our previous paper we trained a rather simple statistical LM. However, as Baroni [46] pointed out, different LMs have uniquely structured internal architectures, and thus cannot be viewed as "blank slates." Instead, they should be regarded as algorithmic linguistic theories "encoding non-trivial structural priors facilitating language acquisition and processing" [46]. Therefore, we train various types of LMs on our data, ranging from simple statistical n-gram models to state-of-the-art neural network and transformer models. Secondly, we improve our prior work methodologically by developing a machine learning method that fully accounts for the relatedness of languages: when comparing languages, statistical challenges arise due to the fact that closely related languages often exhibit more similarities among themselves in various aspects than they do with more distantly related languages. Additionally, languages originating from the same regions often tend to be influenced by common factors, further complicating the analysis [47349]. While we have included language family, macro-area and country as factors to account for the genealogical and geographic relatedness of languages in our prior paper, this approach ignores variation within language families and geographical units as pointed out in several recent studies [47351]. To address this issue, we develop two quantitative approaches: (i) a semiparametric machine learning estimation method capable of simultaneously controlling for document- and language-specific characteristics while directly modelling potential effects due to phylogenetic relatedness and geographic proximity; (ii) a multi-model multilevel inference approach designed to test whether cross-linguistic outcomes are statistically associated with sociodemographic factors, while accounting for phylogenetic and spatial autocorrelation via the inclusion of random effects and slopes. The structure of this paper is as follows: the next section introduces the multilingual database details the procedures for compiling the text data (Sect. 2.1). This is followed by a description of the sociodemographic and linguistic variables considered in this study (Sect. 2.2). We then introduce the investigated LMs (Sect. 2.3) and describe how the textual data was pre-processed (Sect. 2.4). The methodology for estimating entropy is presented in Sect 2.5. Sect. 2.6 is devoted to statistical methods. We first establish a novel method for evaluating the similarity of entropy rates and length distributions across different corpora (Sect. 2.6.1). This is followed by a description of the multi-model inference approach used to analyse if the entropy-length trade-off is influenced by the number of language users (Sect. 2.6.2). In Sect. 3, our findings are presented. The paper concludes with the 'Discussion' section in which we discuss and evaluate the relevance of the trade-off between complexity and efficiency (Sect. 4). All data and code (Stata v18.0 and Python v3.6.8) needed to replicate our analyses are available at https://osf.io/xdwjc/. In addition, interactive results and visualisations are available online at https://www.owid.de/plus/tradeoffvis/. 2 Methods and materials Some material in this section is recycled from our prior publications [35, 52], in accordance with the guidelines provided by the Text Recycling Research Project [53]. 2.1 Database In what follows, we give an overview regarding the database. Additional in-depth details regarding all corpora can be found in [32]. In total, we analysed 41 different multilingual text collections. 40 text collections consist of actual full text data, while the remaining collection consists of word frequency information from the Crúbadán project [34]. Of the 40 full-text collections, 33 are fully parallel and 7 corpora contain comparable documents. The full-text corpora can be loosely categorised into the following text types: 5 religious text collections, 4 news/Wikipedia/Web crawls text collections, 5 text collections containing legalese texts, 22 multilingual subtitle corpora and 4 collections of other text types. 2.1.1 Text types 2.1.1.1 Religious texts The two parallel text collections BibleNT and BibleOT are both part of the Parallel Bible Corpus (PBC) made available by Mayer and Cysouw [31] and containing a total of 1,568 unique translations of the Bible. The BibleNT text collection consists of all 27 books that belong to the New Testament of the biblical canon. In total, �Ā = 1,459 different documents, i.e., translations of the New Testament into another language, are available for �Ā = 1,093 individual languages. The median length of individual documents is Āþ = 227,391 words and Āÿ = 1,190,294 characters. Correspondingly, the BibleOT text collection consists of all 39 books that belong to the Old Testament (�Ā = 254; �Ā = 147; Āþ = 642,772; Āÿ =3,259,354). The two parallel text collections WatchtowerV1 and WatchtowerV2 are also part of the Parallel Bible Corpus, both containing translations of different introductory texts of the Jehovah9s Witnesses9 official web site [22] (WatchtowerV1: �Ā = 142; �Ā = 140; Āþ =129,008; Āÿ = 659,563; WatchtowerV2: �Ā = 265; �Ā = 260; Āþ = 7,194; Āÿ = 35,608). The Quran collection consists of parallel translations of the central text of the Islam downloaded from http://tanzil.net/trans/ (accessed 4/30/20, �Ā = 43; �Ā = 43; Āþ =182,950; Āÿ = 860,590). 2.1.1.2 Web crawls The GlobalVoices comparable collection consist of contributions to the citizen media platform Global Voices. Raw text files of all articles were downloaded from http://casmacat.eu/corpus/global-voices.html (version: 2018Q4; accessed 4/30/20, �Ā = 40; �Ā = 39; Āþ = 20,021; Āÿ = 112,541). The other three collections were compiled based on plain text files from the Leipzig Corpora Collection (LCC) [33] that presents corpora in a uniform format. Here we focus on three collections that we name as follows (i) LCCnews, i.e., text material of crawled newspapers available online (�Ā = 112; �Ā = 85; Āþ = 196,899; Āÿ = 1,119,551), (ii) LCCweb, i.e., text material crawled from randomly chosen web pages (�Ā = 87; �Ā = 85; Āþ = 195,953; Āÿ = 1,127,076), and (iii) LCCwiki, i.e., text material from Wikipedia dumps (�Ā = 171; �Ā = 171; Āþ = 185,770; Āÿ = 1,038,774). Each document in each corpus consists of 10,000 randomly shuffled sentences in the corresponding language. 2.1.1.3 Legalese texts To compile the UDHR parallel collection, we downloaded parallel translations of the Universal Declaration of Human Rights from https://unicode.org/udhr/ (accessed 4/30/20, �Ā = 452; �Ā = 399; Āþ = 1,978; Āÿ = 10,822). The other four legalese parallel text collections are all obtained from the OPUS project [30]. The EUconst collection consists of different translations of the European Constitution (�Ā = 21; �Ā = 21; Āþ = 92,607; Āÿ =620,502). Europarl is a corpus of documents extracted from the European Parliament web site (�Ā = 21; �Ā = 21; Āþ = 5,362,935; Āÿ = 31,959,314). The collection EUmed is compiled from PDF documents from the European Medicines Agency (�Ā = 22; �Ā = 22; Āþ = 3,241,844; Āÿ = 18,714,208). UNPC consists of manually translated documents in the six official languages of the United Nations (�Ā = 6; �Ā = 6; Āþ = 341,723,872; Āÿ =879,903,168). 2.1.1.4 Subtitles The parallel subtitle collections consist of two types: subtitles of movies and subtitles of TED talks. The 13 subtitle collections are based on the ParTy corpus [32]. The Technology, Entertainment, Design (TED) talk subtitles were downloaded from https://amara.org/en/teams/ted/videos/ (accessed 4/30/20). Information regarding movie/talk titles, number of translations/languages per corpus and median lengths are provided in Table 1. Table 1: Overview of the subtitle text collections. 1st column: collection. 2nd column: collection id. 3rd column: movie/talk title. 4th column: number of documents. 5th column: number of different languages (ISO-639-3 codes). 6th/7th column: median text length in words/characters. Collection ID Title ĂĀ ĂĀ Ā� Āÿ Movies MSub01 Amelie 29 29 7,349 34,207 MSub02 Avatar 26 26 9,695 43,988 MSub03 Black Swan 37 37 4,661 20,444 MSub04 Bridge of Spies 16 16 12,690 60,718 MSub05 Das Leben der Anderen 15 15 8,908 42,103 MSub06 Frozen 27 27 7,832 34,262 MSub07 Gone Girl 12 12 17,317 79,075 MSub08 Grand Budapest Hotel 9 9 9,598 46,148 MSub09 Imitation game 15 15 10,303 49,983 MSub10 Inception 28 28 10,937 53,059 MSub11 Ironlady 14 14 9,679 45,133 MSub12 Noah 30 30 5,516 25,804 MSub13 Spectre 14 14 7,533 34,984 TED Talks TEDt01 Bring on the learning revolution 50 50 3,075 14,455 TEDt02 Do schools kill creativity 60 59 3,555 17,360 TEDt03 Doing the impossible cutting through fear 61 60 3,763 18,468 TEDt04 Modern Warrior 31 31 2,099 10,409 TEDt05 My philosophy for a happy life 31 31 1,853 8,658 TEDt06 Secondary sugar kills 28 28 1,369 6,318 TEDt07 Speak to the heart 74 71 1,376 7,017 TEDt08 Success is a continuous journey 49 48 770 3,669 TEDt09 Why is x the unknown 52 51 562 2,746 2.1.1.5 Other The comparable collection Ubuntu consists of Ubuntu localization files. Texts are available from OPUS [30] (�Ā = 86; �Ā = 86; Āþ = 5,780; Āÿ = 35,888). To compile the parallel Google Translate collection, we used Google Translate (https://translate.google.com; accessed 09/04/2019) to machine translate a short passage excerpt from the book "Aldono al la Dua Libro de l9 Lingvo Internacia" in the constructed language Esperanto, written by its inventor L.L. Zamenhof (downloaded from http://esperanto.davidgsimpson.com/librejo/index.html on 09/04/2019) into 102 languages (see [35] for the translated passage; �Ā = 102; �Ā = 102; Āþ = 716; Āÿ = 3,875). The two comparable corpora Tatoeba V1 and Tatoeba V2 are compiled based on Tatoeba, a collaborative online platform that makes available sentences translated into different languages. Raw data were downloaded from https://tatoeba.org/deu/downloads (accessed 4/30/20; V1: �Ā = 183; �Ā = 183; Āþ = 656; Āÿ = 3,034; V2: �Ā = 123; �Ā = 123; Āþ =3,438; Āÿ = 15,710). 2.1.1.6 Word list To generate the word list data, we downloaded all available lists from the Crúbadán project [34] from http://crubadan.org/files/ (accessed 4/30/20). In total, we arrived at 2,216 word frequency lists for a total of 1,943 different languages (Āþ = 101,079). 2.1.2 Overview of the database Figure 1 displays a map highlighting the geographical distribution of languages for the compiled multilingual database. The figure reveals an imbalance at the language level within the database: over 100 languages have at least 10 documents, but approximately 75% of languages have fewer than four documents. This scarcity reflects the limited electronic availability of documents in languages spoken by smaller populations [34]. This is exemplified by the contrast in median speaker numbers: while the median for all non-extinct languages documented by the Ethnologue stands at 8,000 [54], the median for languages represented with at least one document in our database is significantly higher, at 30,000. In addition, the majority of our text collections contain a comparatively small number of individual documents, with a median of 40 documents per corpus. This limited size can be attributed to specific reasons in certain cases; for example, the EUconst collection is naturally restricted to translations of the European Constitution into the official languages of the European Union. In contrast, for other collections such as the subtitle corpora, translations into further languages were not available when we compiled the database. On the other side of the spectrum, we have 11 multilingual text collections that consist of more than 100 different documents. As described above, documents are rather short, e.g. 25% of the documents are below 14,575 characters or 3,181 words. However, 200 documents are longer than 1 million characters, 49 documents are longer than 10 million characters and the longest documents are several hundred million words and more than a billion characters long. Figure 1: Global distribution of collected documents per language. Approximately 76% of languages have fewer than five documents. On the other side of this spectrum, over 160 languages have more than 10 documents. This imbalance reflects the limited electronic availability of documents in languages spoken by smaller populations [34]. In what follows we statistically compare the structure found in smaller corpora (i.e., those consisting of shorter documents and/or a limited number of available documents) with the structure found in larger corpora (i.e., those consisting of longer documents and/or data points for many languages). The idea is that if the results from both smaller and larger corpora align, this strengthens the claim that these results are not merely artefacts resulting from database bias. Additionally, we include control covariates, such as the number of corpora per language, to account for the unbalanced nature of our database, as described in the next section. 2.2 Sociodemographic and linguistic variables Information on speaker population size, corpus, language family, language (identified by its ISO-639-3 code), macro-area, writing script, speaker population size, longitude and latitude are taken from [35]. Expanded Graded Intergenerational Disruption Scale (EGIDS) level information was initially sourced from [55], which is reported in Glottolog [56] (v4.2.1). Country is defined by Ethnologue [57, 58] as the primary country/country of origin of the language in question [54]. To ensure completeness, we manually supplemented missing data from [55] by cross-referencing with Glottolog and Ethnologue. The EGIDS level serves as a measure of a language9s endangerment status [59]. We use the EGIDS level as a covariate to control for potential translation effects [60, 61], as languages with lower EGIDS levels are presumably more likely to be used as source languages, while languages with higher EGIDS levels are presumably more likely to be used as target languages. For example, an EGIDS level of 0 (labelled "International") pertains to the six official United Nations languages: Arabic, Chinese, English, French, Russian, and Spanish. On the other hand, languages with values of five and above pertain to languages that are not used in formal education, mass media or by the government, and they may consequently be more susceptible to (more) pronounced "translationese" influences [61]. With a similar logic in mind and to account for the unbalancedness of our database (see Sect. 2.1.2), we also consider the number of corpora with a least one available document per language as an additional control variable in what follows. Further information regarding the classification of languages into macro-family and sub-family are taken from [62]. We manually added information for languages that was missing by using publicly available genealogical classifications (see the script 'prepare_language_info.do' available at https://osf.io/tkgph/ for details). Classifications in [62] are given as comma-separated values. We define the first value as the macro-family and the second one as the sub-family, e.g. for the language <Ghotuo= the classification is <Niger-Congo, Atlantic-Congo, Volta-Congo, Benue-Congo, Edoid, North-Central, Ghotuo-Uneme-Yekhee=, so the macro-family is <Niger-Congo= and the sub-family is <Atlantic-Congo=. Additionally, we use a phylogenetic similarity matrix also provided by [62] that is based on word lists from the Automated Similarity Judgment Program (ASJP) [63]. Information on the number of countries in which each language is spoken was sourced from Glottolog (v4.2.1). We manually supplemented missing data by cross-referencing with Ethnologue [57, 58]. The rationale behind considering this variable as a potential covariate is to account for the varying degrees of pluricentrism [64]. For instance, languages such as Chinese or Spanish are spoken in several countries and may therefore have different codified standard forms. For further information and a discussion of potential caveats and problems regarding the assignment of environmental variables to individual languages in order to reflect local grouping structure, see [65, 66]. 2.3 Language models We use general-purpose data compression algorithms, taking advantage of the fact that language modelling and lossless compression are essentially equivalent [67369]. All data compression algorithms consist of a model and a coder [13]. Our focus is on the class of (lossless) compressors where the algorithm uses training data to estimate a model, i.e., a conditional probability distribution, that can be used to generate predictions about upcoming symbols. To perform compression, the predicted probabilities are then used to encode symbols using a technique called arithmetic encoding [70]. Table 2: Language models. The table lists each investigated LM along with its implementation techniques, source, and time required to train it on a document of median length. Training times are provided in seconds and categorised based on the hardware used: Central Processing Unit (CPU) and High-Performance Computing (HPC) cluster with Graphics Processing Unit (GPU) support. The first five LMs were run exclusively on a CPU, while the remaining two LMs were run on a GPU. For comparison, we also include the computation time required if these two models are run on a CPU. Further implementation details are given in Appendix A.1. LM Technique/Algorithm Model specification Source Time PPM2 N-gram modelling [71], prediction by partial matching [72], memory: 2000 megabytes order 2 [73, 74] 0.1 (CPU) PPM6 order 6 0.1 (CPU) PAQ Context mixing [13, 75], gated linear network [76] weights ~1.7 million, parameters ~3,800 [77, 78] 54.5 (CPU) LSTM Long short term memory [79] parameters ~3.93 million [80] 98.5 (CPU) TRFsmall Transformer [81] parameters ~2.24 million 150.5 (CPU) TRFmed parameters ~19.1 million 1,252.5 (CPU) / 21.2 (GPU) TRFbig parameters ~279 million 37,759.8 (CPU) / 58.9 (GPU) The seven LMs that we investigate are summarised in Table 2. In what follows, further details are given for each language model. PPM is a dynamic and adaptive variable-order n-gram LM. The algorithm assumes the Markov property: to predict the next symbol, the algorithm uses the last o immediately preceding symbols. For PPM2, we set o to 2, i.e., the last 2 symbols are used as context to generate predictions. For PPM6, we set o to 6. In both cases, the level of compression is set to maximum and the size of used memory is set to 2,000 megabytes. PAQ can be described as a weighted combination of predictions from a large number of models, where the individual models are combined using a gated linear network [13, 75, 76, 78]. The network has a single layer with 552 input nodes and 3,080 input weights. The model has a total of ~1.7 million weights, but due to a sparse updating scheme which leads to faster compression and decompression, the effective number of parameters used in training is significantly lower. Only 552∙7 = 3,864 weights are updated for each bit of data. We use version PAQ8o and set the compression level to maximum, requiring 1,712 megabytes of memory. NNCP [80] is a lossless data compressor that is based on the Transformer XL model defined in [82]. Modifications to the original Transformer XL model and algorithmic details are provided in [83, 84]. As for LSTM, the Adam optimiser is used and we use the "encode only" mode. For TRFsmall (version 3.1), we use the default options with four layers and a model dimension of 256, resulting in a total number of ~2.24 million parameters. For TRFmed (version 3.2), we set the number of layers to 12 and the model dimension to 512, resulting in a total number of ~19.1 million parameters. For TRFbig (version 3.2), we use the available "enwik9" profile that sets the number of layers to 20 and the model dimension to 1,024, resulting in a total number of ~279 million parameters. In addition, NNCP offers compression based on a Long Short-Term Memory deep neural network (LSTM) [79]. We use four layers of LSTM cells. The network is trained using truncated-like backpropagation [83, 85] and Adam optimisation is used to update network weights [86]. We do not use a text pre-processor or tokeniser and we use the faster "encode only" mode (the output cannot be decompressed, but the compression itself is still lossless). The total number of parameters is ~3.93 million parameters. In addition, when discussing the relevance of the complexity-efficiency trade-off (Sect. 0), we use OpenAI's GPT-2 model [87] with ~1.5 billion parameters, as implemented in the Hugging Face library [88]. 2.4 Text pre-processing and information encoding units Each document is tokenised and Unicode normalised where necessary. All uppercase characters are lowered based on the closest language-specific International Organization for Standardization (ISO) code. Unless otherwise specified in [35], the word-break algorithm of the International Components for Unicode library [89; Annex #29] was used to detect word boundaries (in texts without spaces or marks between words, a dictionary lookup method is used by the algorithm [90]). More details regarding each individual text collection can be found in [35]. Following [25], we represent a text κ as a random variable that is created by drawing (with replacement) from a set of symbol types � = {ý1, ý2, ý3, & , ýý}, where V is the number of symbol types, i.e., � = \|�\|. Correspondingly, a symbol token is any reoccurrence of a symbol type [25]. In what follows, we estimate the relevant information-theoretic quantities for the following information encoding units/symbol types: (i) (Unicode) characters, (ii) words and (iii) sub-word units. For (iii), we apply byte pair encoding (BPE) [91, 92] to split words into one or several units and then train the different LMs over the resulting sequences of sub-word units. We follow [91] and set the number of BPE merges to 0.4·V. After tokenisation into (ii) words/(iii) sub-word units, each word/BPE type is replaced by one unique Unicode symbol. The different compression algorithms are then used to compress the resulting symbol sequence. On the BPE level, we also compress the mapping of sub-word units to 4-byte Unicode symbols. 2.5 Entropy estimation In order to quantify the amount of information contained in κ, we can represent κ as a distribution of symbol frequencies by counting how often each symbol j appears in κ and call the resulting frequency ĀĀ. The Gibbs-Shannon unigram entropy H of this distribution can be computed as [39]: ÿ(Ā) = 2 ∑ ā(ýĀ)ýĀ=1 · log ā(ýĀ) (1) where ā(ýĀ) = ĄĀ∑ ĄĀ�Ā=1 is the maximum likelihood estimator of the probability of ýĀ in κ consisting of ∑ ĀĀýĀ=1 tokens. In what follows, all logs are to the base two, so the quantities are expressed in bits. H(κ) can be interpreted as the average number of (yes/no) guesses that are needed to correctly predict the type of a symbol token that is randomly sampled from κ. The entropy rate or per-symbol entropy of a stochastic process can be formally defined as [25, 39]: ℎ(Ā) = limĂ→∞ 1Ă ÿĂ(Ā) = limĂ→∞ 1Ă ÿ(þ1Ă) (2) where þ1Ă = þ1, þ2, & , þĂ represents a block of consecutive tokens of length N and ÿĂ(Ā) denotes the so-called block entropy of block size N [25, 93]. Following [41], we define þĂ as the prediction complexity of þĂ given þ1, þ2, & , þĂ21 as follows: þĂ ≡ ÿ(þĂ\|þ1Ă21) (3) þĂ quantifies the average uncertainty of the Nth symbol, given all preceding tokens þ1Ă21. Assuming a stationary ergodic stochastic process [25, 39, 42], þĂ reaches the entropy rate h as N tends to infinity [39, 41]: ℎ(Ā) = limĂ→∞ þĂ (4) In analogy to H(κ), the entropy rate ℎ(Ā) can be informally understood as the average number of guesses that are needed to guess the next symbol of a sequence and thus incorporating the notion that prediction and understanding are intimately related [13, 69, 93, 94]. Information can then be defined as any kind of knowledge that, when in your possession, allows you to make predictions with greater accuracy than mere chance [95, 96]. Thus, h encompasses complexity from various linguistic sub-domains, since any form of linguistic (e.g. grammatical, phonological, lexical, pragmatic) or non-linguistic (e.g. world) knowledge will help a reader or listener to predict more accurately and will therefore reduce h [10]. Since the probability distribution for any natural language is unknown [13, 93], we use the data compression algorithms described above to estimate ℎ(Ā). Per LM, the entropy rate estimate is computed (roughly speaking) as the number of bits per symbol in the compressed text: ℎĀā(Ā) = ÿÿĀ(�)Ā(�) (5) where ÿĀā(Ā) = �Āā(Ā) 2 �Āā(Āąăÿÿÿ). Here, �Āā(Ā) denotes the number of bits that are needed by the LM to compress Ā, Āąăÿÿÿ represents the first half of Ā, Ā(Ā) represents the length of the second half of Ā in words on the level of words or, both on the character and the BPE level, in Unicode characters. Note that on the BPE levels we also compress the mapping of unique symbols to 4-byte Unicode symbols mentioned above and add the resulting compressed lengths to �Āā(Ā) and �Āā(Āąăÿÿÿ). Further note that ℎĀā is directly related to the quantity perplexity that is often used in natural language processing to measure the quality of a language model, where perplexity is defined as 2/ÿĀ[40]. We use this relationship to also choose the LM that achieves the lowest perplexity on the test data: ℎĀăĄą(Ā) = minĀā∈ℒ ℎĀā(Ā) (6) where ℒ denotes the set of different LMs, i.e., ℒ ={PPM2, PPM6, PAQ, LSTM, TRFsmall, TRFmed, TRFbig}. In a similar vein, we choose ÿĀăĄą(Ā). 2.6 Statistical analysis 2.6.1 Comparing entropy/length distributions across LMs and corpora We now take κ to be a corpus that consists of individual texts Āÿ, where i denotes 1,…, I different languages. For brevity, we omit the superscript and the hat in what follows. However, we compute the correlation coefficients described below for both the best LM (ℎĀăĄą and ÿĀăĄą) and for each individual LM (ℎĀā and ÿĀā). The entropy estimate for Āÿ is denoted as ℎϚ(Āÿ), where Ϛ denotes one of three information encoding units (words, characters, BPE), likewise for ÿϚ(Āÿ) and ĀϚ(Āÿ). ℎϚ(Āÿ) and ÿϚ(Ā) are computed on all three levels, while ĀϚ(Ā) is computed on either the word or the character level. Note that for languages with more than one available translation/document in a corpus, all quantities are averaged. To evaluate the (dis-)similarity of entropy/length distributions across corpora and test for a potential trade-off between entropy and length, we first compute the pairwise Pearson correlation ÿ[�Ϛ2 (Ā), �Ϛ22(ÿ)] for all corpus pairs (Ā, ÿ) where �Ϛ2 denotes either h, L or K on the encoding level Ϛ. Likewise for �Ϛ22. In addition, we also consider H computed on the basis of the Crúbadán word frequency information (eq. 1), denoted as HCr in what follows. Both �Ϛ2 (Ā) and �Ϛ22(ÿ) are logged. To control for potential sources of influence, we fit linear models of the form: � = Āÿ + � (7) where y is the n × 1 vector of observed values, either �Ϛ2 (Ā) or �Ϛ22(ÿ), n denotes the number of languages that are available in both Ā and ÿ, X is the n × p design matrix of p covariates including a n × 1 vector of ones for the intercept, ÿ is the corresponding p × 1 vector of coefficients and � is the n × 1 vector of residuals. We assume that �ÿ are identically distributed with ý(�ÿ) = 0 with variance ÿ2. Importantly, we want to rule out potential autocorrelation among the residuals, i.e., we wish to test the following null hypothesis in what follows: ÿ0: ý[��T] = ÿ2�. (8) where I is the n × n identity matrix and Τ denotes the matrix transpose. As potential control variables, we consider the EGIDS level, the (logged) speaker population size, the (logged) number of corpora and the (logged) number of countries in which the language is spoken. We also include a set of indicator variables for the levels (categories) of writing script. To avoid overfitting, scripts that were unique to a single language were grouped into a common category. For HCr, we additionally control for the number of words and the number of documents (both logged) on which the word frequency list is based, and a binary variable indicating whether the word frequency list is truncated to account for differences in the way different Crúbadán word lists were generated. Further information can be found in [35]. To select the relevant control variables from the candidate set, we use the lasso machine learning technique [97]. To choose the optimal value for the penalty parameter for each lasso, we use a heteroskedastic plugin estimator [98]. Languages with missing information on any of the control variables are excluded in each case. Let Ā denote a n × ā matrix of ā covariates selected by the lasso. We then regress y on Ā and compute residuals denoted as �. To test H0 from above (eq. 8), we compute the modified version of Moran's I [99] suggested by Kelejian and Prucha [100], written as: Ā = ÿ(�Tÿ�) [(�T�)√tr{(ÿT + ÿ)ÿ}]21 (9) where W denotes an n × n weighting matrix and tr represents the trace operator. For W, we consider two inverse distance matrices: (i) to test for spatial autocorrelation, we construct an inverse distance matrix WG based on longitude and latitude information, (ii) to test for phylogenetic autocorrelation, we construct an inverse distance matrix WP based on a phylogenetic similarity matrix provided by [62]. In both cases, matrix elements are equal to the reciprocal of distance that are then normalised using spectral normalisation. We test for autocorrelation with (i) WG as input, (ii) WP as input and (iii), as explained below, both WG and WP as input. For brevity, we drop the subscript in what follows and describe our algorithmic approach for input matrix W. Since I 2 ~ Χ2(1) [100], we test H0 via a standard Χ2-test with one degree of freedom. If p < 0.05, we extend our linear regression model (eq. 7) by a semiparametric filter [1013103] as: � = þĀ + Āÿ + � (10) where, in addition to the above, F is a n × q matrix of q eigenvectors and Ā is a n × 1 vector of parameters. F is computed based on a transformed version of W defined as [101]: � ≡ (� 2 ��Τÿ ) ÿ (� 2 ��Τÿ ) (11) The eigensystem decomposition of M generates n eigenvalues and n corresponding eigenvectors. The eigenvalues are then sorted in descending order, denoted as � =(λ1, λ2, λ3, & λÿ), so that the largest eigenvalue receives the subscript 1, the second largest eigenvalue receives the subscript 2 and so on. The corresponding set of eigenvectors can then be denoted as ý = (ý1, ý2, ý3, & ýÿ). We include ý1 into F and let the lasso select a subset of control variables from X. We then compute � based on a regression of y on Ā and F. After that, we perform the Χ2-test again. If the test is still significant, we also include ý2 into F and re-perform estimation. This iterative procedure is repeated until p ≥ .05. [Note that in scenario (iii), where we simultaneously control for WG and WP, we compute two sets of eigenvectors, denoted as ý� = (ýG,1, ýG,2, ýG,3, & ýþ,ÿ), and ý� = (ýP,1, ýP,2, ýP,3, & ýÿ,ÿ), and alternate the inclusion of eigenvectors into F, i.e., we first include ýG,1, then ýP,1, and so on. Correspondingly, I 2 ~ Χ2(2)]. Let ϵć�2 (Ā) denote the resulting residuals for �Ϛ2 (Ā), likewise for �Ϛ22(ÿ). This procedure ensures that there is no spatial/phylogenetic correlation among the resulting residuals, i.e., ϵćϚ2(Ā) and ϵćϚ22(ÿ). We then compute the Pearson correlation between those residuals and proceed as described above. The correlation coefficients per condition (none, geographical, phylogenetic and both) are denoted as ρnone, ρgeo, ρphylo and ρboth. 2.6.2 Multi-model multilevel inference To evaluate if the trade-off is moderated by the social environment in which languages are being used, we run separate multilevel effects models (MLEMs) with (i) h or K as the outcome on all three levels (words/characters/BPE) for all eight LMs (best, PPM2, PPM6, PAQ, LSTM, TRFsmall, TRFmed, TRFbig) and (ii) L as the outcome for words/characters. For N = 3,705 individual documents, we fit MLEMs of the form [104]: � = Āÿ + �� + � (12) where, in addition to the above, Z is a matrix of random predictors and u is a vector of random effects that are assumed to follow a normal distribution, with mean 0 and variance-covariance matrix G. The residual errors ϵ are assumed to follow a normal distribution, with mean 0 and variance matrix ÿ2�; � ⊥ �. To enhance convergence, the outcome is standardised per corpus, i.e., the corpus-specific mean was subtracted from each observed value and the result was divided by the corpus-specific standard deviation, but we also provide results for log-transformed outcomes (see https://osf.io/93csg/) that can be visualised in our interactive online application (https://www.owid.de/plus/tradeoffvis/). We consider a fixed effect for the estimated speaker population size (logged) as a proxy for population structure [105]. The following control variables are included: (i) fixed effects: corpus type (parallel/comparable), binary indicators for the first four EGIDS levels and the (logged) number of countries; (ii) random intercepts for the following groups: writing script, corpus, macro-area, macro-family, sub-family and language. We cross corpus, macro-area, macro-family and writing script and explicitly nest language within sub-family within macro-family; (iii) random slopes for population size, i.e., we allow the effect of population size to vary across the different groups. We adopt a multi-model inference approach [106] by sub-setting each full model, i.e., we generate a set of models with all possible control variable subsets, which are then fitted to the data. We fit sub-models per outcome, type and LM. All models were fitted with gradient-based maximization (maximal number of 20 iterations) and via maximum likelihood (ML). Per outcome and per type, we then compute a frequentist model averaging (FMA) estimator over all R candidate models [107, 106, 108]: �ĉ = ∑ �ĀāĀ=1 �ĉ,Ā (13) where �ĉ,Ā denotes the estimated fixed effect of variable x for model j and �Ā is a weight computed as: �Ā = ă(212�Ā) � (14) where � = ∑ ÿ(212��)āă=1 represents the sum of weights for all R models. To compute �Ā, we use Akaike9s information criterion (AIC) [109], where lower values indicate a better model, �Ā = AICĀ 2 AICþÿÿ where AICĀ denotes the AIC value computed for model j and AICþÿÿ represents the minimum AIC value over all R models. Note that in models where x does not appear, �ĉ,Ā ≡ 0. On this basis, we compute an FMA estimator of the standard error (SE) as [106]: SE(�ĉ) = ∑ �ĀāĀ=1 √SE(�ĉ,Ā)2 + (�ĉ,Ā 2 �ĉ)2 (15) where SE(�ĉ,Ā) denotes the estimated standard error of �ĉ,Ā for model j. In models where x does not appear, we set SE(�ĉ,Ā) ≡ 0. To assess statistical significance, we compute a corresponding two-tailed p-value as ā = 2 ⋅ (1 2 Φ (\| �þSE(�þ)\|)) where Φ() denotes the cumulative standard normal distribution. In a similar vein, we compute a 95% confidence interval (95%-CI) as �ĉ ± Φ21(0.975) ⋅ SE(�ĉ) where Φ21() denotes the inverse cumulative standard normal distribution. Note that the Akaike weights �Ā can be "interpreted as approximate probabilities of each model being the actual best model, given the data" [106]. Thus, we can use the �Ā to estimate the relative importance of variable x, computed as [106]: ÿĉ = ∑ �Ā�ĉ,ĀāĀ=1 (16) where �ĉ,Ā is a binary indicator that is equal to 1 if x is explicitly in model j and 0 otherwise [106]. The larger ÿĉ, the more important x. To put the value of ÿĉ into perspective, we show in Appendix A.2 that its theoretical minimum is ~0.27. 3 Results 3.1 Comparing entropy/length distributions across LMs and corpora 3.1.1 Comparing language models Figure 2: Comparing language models. (a 3 c) LM-specific entropy rates as a function of text length for different symbolic levels (words, characters, BPE). Each solid line represents a locally weighted scatterplot smoother (bandwidth = 0.3) for the entropy estimates of Nκ = 4,297 different documents that belong to the compiled multilingual database (see Sect. 2.1 for details, rug plots at the bottom of each graph illustrate the length distribution). Note that on the BPE level, �Āā is plotted against L in characters. (d 3 f) Median unadjusted Pearson correlation, ÿÿĀÿ�, across LMs. These values are calculated by first cross-correlating average entropy rates per language among LMs for each of the 40 full text corpora, followed by computing the median value for each LM pair. Interactive visualisations are available at https://www.owid.de/plus/tradeoffvis/. Figure 2a-c summarises the distribution of h as a function of length L for each level and each investigated LM across the 40 full-text multilingual text collections/corpora, totalling Nκ = 4,297 different documents (see Sect. 2.1 for details). The plots indicate that for most documents PAQ (mint line) turns out to be the best LM, i.e., ℎĀăĄą(Ā) = ℎÿ�Ā(Ā) (see Sect. 2.5 for details) for most of the documents. For longer documents, the larger LMs (LSTM and the three Transformer LMs) achieve similar or lower entropy rates. Correspondingly, Table 3 shows that PAQ has the lowest h in more than 90% of the documents across all three symbolic levels. Table 3: Best LM per level. 1st column: LM. 2nd column: word level. 3rd column: character level. 4th column: BPE level. For each of Nκ = 4,297 different documents, each column lists, for a given symbolic level, the number of documents such that the LM in the given row is the best, i.e., ��ýþ = �Āā. On all three levels, PAQ is the best LM in more than 90%. LM Words Characters BPE PPM2 0 (0.00%) 0 (0.00%) 0 (0.00%) PPM6 18 (0.42%) 261 (6.07%) 85 (1.98%) PAQ 4,241 (98.70%) 3,960 (92.16%) 4,182 (97.32%) LSTM 1 (0.02%) 3 (0.07%) 0 (0.00%) TRFsmall 0 (0.00%) 0 (0.00%) 0 (0.00%) TRFmed 3 (0.07%) 0 (0.00%) 1 (0.02%) TRFbig 34 (0.79%) 73 (1.70%) 29 (0.67%) To determine if entropy rate distributions are systematically affected by the choice of LM, we used the entropy estimates for the 40 full text corpora at each symbolic level to compute pairwise correlations ρnone for each pair of LMs. Figure 2d-f presents the resulting pairwise relationships as correlation matrices. Each cell represents the median value of ρnone for a pair of LMs. At each symbolic level, the statistical association between different LMs is remarkably strong. Even the lowest median value, observed at the character level between TRFbig and PPM6, is high, with a value of ÿnone = 0.84. To rule out the possibility that these associations mainly result from either language- and document-specific characteristics or the genealogical and geographic relatedness of languages (see Sect. 2.6.1 for details), Figure 3 visualises all four types of estimated correlation coefficients, i.e., ρnone, and corresponding adjusted partial correlations, i.e., ρgeo, ρphylo and ρboth for all LM pairs across corpora and symbolic levels. The results indicate that the adjusted partial correlations point in the same direction as the unadjusted correlations. The results presented in this section demonstrate that although there are differences in the performance of various LMs depending on document length (Figure 2a-c), the resulting entropy rate distributions are remarkably consistent across LMs (Figure 3). This suggests that, from a cross-linguistic perspective, the choice of LM for investigating different languages may have a minimal impact. Figure 3: Distribution of pairwise correlations across LMs for each symbolic level. For each of the 40 full text corpora and per symbolic level, we compute the median value of the pairwise correlation between the estimated entropy rate distributions for each LM pair. We compute both unadjusted correlations, i.e., ρnone, and adjusted partial correlations, i.e., ρgeo, ρphylo and ρboth (see Sect. 2.6.1 2.1 for details) across LM pairs. Per type of correlation coefficient and symbolic level, we compute Ăÿ = 840 individual correlation coefficients where each data point represents one LM pair. 3.1.2 Comparing languages We proceed by comparing entropy rate distributions across corpora to determine whether a language that tends to be more complex in one corpus also tends to be more complex in another corpus. Given that the results from the previous section clearly indicate stability across different LMs, we will focus, for each corpus κ, on the estimates of its best model, i.e., ℎĀăĄą(Ā) (see Sect. 2.5 for details). Interactive visualisations for each LM, i.e., ℎĀā(Ā). are available at https://www.owid.de/plus/tradeoffvis/. As outlined above, we evaluate the similarity of ℎĀăĄą-distributions by computing ρnone, ρgeo, ρphylo and ρboth across corpora and across symbolic levels. Per correlation type, we compute �� = 7,254 individual correlations. Figure 4a demonstrates that entropy rate distributions are very similar across corpora and symbolic levels as indicated by a strong positive correlation between corpus pairs. The results remain stable when we control for language- and document-specific characteristics, as well as the genealogical and geographic relatedness of languages (see Sect. 2.6.1 for details). Interactive LM-specific visualisations, available at https://www.owid.de/plus/tradeoffvis/ show that highly comparable patterns are obtained when estimating entropy rates for individual LMs. Figure 4: Distribution of pairwise correlations across corpora and symbolic levels. For each corpus pair, we compute both unadjusted correlations, i.e., ρnone, and adjusted partial correlations, i.e., ρgeo, ρphylo and ρboth (see Sect. 2.6.1 for details). Correlations are computed both per symbolic level, where estimates are derived on the same symbolic level, and across symbolic levels, where, e.g., the entropy rate distribution calculated for words as information encoding units in one corpus is correlated with, e.g., the distribution calculated at either the character or the BPE level in another corpus. (a) Similarity of ��ýþ-distributions across corpora, including correlations between ��ýþ for the 40 full text corpora and HCr based on the Crúbadán word lists (Ăÿ = 7,254). (b) Similarity of Ā-distributions across corpora (Ăÿ = 3,160). (c) Similarity of ÿ ��ýþ-distributions across corpora (Ăÿ = 7,140). Interactive LM-specific visualisations are available at https://www.owid.de/plus/tradeoffvis/. Entropy rates are estimated as the ratio of the number of bits needed to compress the test data, K, to the length of the test data, i.e., L (cf. eq. 5). To further understand the above results, we repeated the analyses for both variables that are part of this ratio. Figure 4b reveals that while the results are more pronounced for h, the distributions for L are also very comparable across corpora and symbolic levels (�� = 3,160). However, Figure 4c shows that the results are much weaker for distributions of ÿĀăĄą (�� = 7,140). Since K is the product of h and L, this suggests a potential trade-off between h and L. To investigate this possibility, we compute ρnone, ρgeo, ρphylo and ρboth between ℎĀăĄą and L across corpora on all three symbolic levels. Table 4 demonstrates that on all three symbolic levels and for all four correlation types, there is a pronounced negative statistical association between entropy rate and length distribution. Again, our interactive visualisation tool shows highly comparable patterns for individual LMs. These results indicate the existence of a trade-off between both variables. Table 4: Association between ��ýþ and L across corpora. 1st row: number of individual correlation coefficients, NÃ. 2nd 3 4th column, type of correlation coefficient ρnone, ρphylo, ρgeo and ρboth. 2nd 3 3rd column: associations on each level (words, characters, BPE), listed quantities for ρ are median values per parameter combination. Median values for ��ýþ(words) include HCr based on the Crúbadán word lists. Note that for ��ýþ on the BPE level, distributions are correlated with L in characters. ��ýþ(words) ��ýþ(chars) ��ýþ(BPE) NÃ 1,638 1,600 1,600 ÿnone -0.69 -0.69 -0.69 ÿphylo -0.69 -0.62 -0.62 ÿgeo -0.69 -0.62 -0.62 ÿboth -0.69 -0.60 -0.61 To establish if the trade-off between entropy and length holds across symbolic levels, we compute ρnone, ρgeo, ρphylo and ρboth between ℎĀăĄą and L for all corpus pairs across all three symbolic levels. For each correlation type, �� = 20,100 individual correlations were calculated to generate a correlation matrix, which was then subjected to principal component analysis. Figure 5 presents scatterplots of the first two factors that explain most of the variance in the matrix. For both unadjusted and adjusted partial correlations, more than a third of the variance is attributed to the trade-off between entropy and length in each case. Interactive LM-specific visualisations that are available at https://www.owid.de/plus/tradeoffvis/ illustrate that highly comparable patterns are observed when estimating entropy rates for different LMs. Figure 5: Comparing entropy and length across corpora. For each corpus pair and across symbolic levels, we compute both unadjusted correlations (ρnone, a), and adjusted partial correlations (ρphylo, b; ρgeo, c; ρboth, d), see Sect. 2.6.1 for details. For each correlation type, Ăÿ = 20,100 individual correlations are computed to generate a corresponding correlation matrix. Principal-component factoring reveals that for both unadjusted and adjusted correlations, more than ~50% of the variance in the matrix can be attributed to two factors: one main factor representing the strong negative correlation between length and entropy measures (accounting for 34.44% to 40.96% of the variance), and one factor distinguishing symbol types (accounting for 16.47% to 18.88% of the variance). Each marker label represents a numeric ID for one of the 41 investigated corpora (see Appendix A.3). Interactive LM-specific visualisations are available at https://www.owid.de/plus/tradeoffvis/. Given that, from an information theoretic point of view, message length quantifies efficiency 3 the shorter the message the higher the efficiency [110] 3 we arrive at our main empirical result: human languages trade off complexity against efficiency. More explicitly, a higher average amount of choice/uncertainty per produced/received symbol is compensated by a shorter average message length. 3.1.3 Summary In the last two sub-sections, we demonstrated that (i) entropy rate distributions are highly consistent across different LMs, suggesting that the choice of LM might have minimal impact on cross-linguistic investigations of the kind we presented here. We also showed that (ii) there is a pronounced trade-off between entropy rate and document length across different corpora, which implies that languages balance efficiency and complexity. This finding highlights a potentially fundamental principle in linguistic structure, where higher uncertainty per symbol is offset by shorter message lengths. To bring these results together, we now compute fully adjusted partial correlations, ρboth for all possible pairwise combinations of the 41 corpora, the three variables (ℎĀā, HCr, L), the three symbolic levels (words, characters, BPE), and the seven investigated LMs (PPM2, PPM6, PAQ, LSTM, TRFsmall, TRFmed, TRFbig). In total, �� = 423,660 individual correlation coefficients were computed. Figure 6 shows that the findings point in the same direction as previously observed, confirming the consistency of entropy rate distributions and their trade-off with document length. To emphasise this point, we extracted the first 80 factors from the factor analysis, which together account for ~90% of the variance in the correlation matrix. For each factor, we conducted separate linear regressions with the factor as the outcome and a binary indicator for the type of variable (1 = L vs. 0 = ℎĀāor HCr) as the predictor. For each factor, we then extracted the amount of explained variance (R2) as a measure of model fit. Among all factors, R2 is highest for the first factor, with R2 = 76.14%. We then repeated the analyses using indicator variables for the investigated LMs as predictors. Again, R2 is highest for the first factor, but with a much smaller value of R2 = 6.44%). Figure 6 demonstrates that the first factor distinguishes between length and entropic variables but not between LMs. We further visually inspected all remaining factors, none of which separated the LMs, reinforcing the robustness of our findings across different models. These results underscore that the negative statistical association between entropy and length in human languages is consistent across various LMs and corpora, suggesting that the trade-off between complexity and efficiency may reflect a fundamental property of human language. Figure 6: Evidence for a trade-off between complexity and efficiency across corpora and language models. We compute adjusted partial correlations, ρboth (see Sect. 2.6.1 for details), for each combination of the 41 corpora, the three variables (�Āā, HCr, L), the three symbolic levels (words, characters, BPE), and the seven investigated LMs (PPM2, PPM6, PAQ, LSTM, TRFsmall, TRFmed, TRFbig), totalling Ăÿ = 423,660 individual correlations. The resulting correlation matrix is then analyzed with principal-component factoring. The scatterplot demonstrates that (i) different LMs are very similar and (ii) the most important factor, accounting for roughly a third of the variance in the matrix, represents a trade-off between complexity and efficiency: languages that tend to have a higher entropic value tend to need fewer symbols to encode messages. Each marker label represents a numeric ID for one of the 41 investigated corpora (see Appendix A.3). 3.2 Multi-model multilevel inference 3.2.1 Random effects models To evaluate if the trade-off between complexity and efficiency is influenced by the social environment in which languages are used, we adopt the multi-model inference approach outlined in Sect. 2.6.2. Figure 7: Estimated variable importance and FMA-estimates by outcome and symbolic level ('word' 3 words, 'char' 3 characters, 'BPE' 3 byte pair encoding). For each parameter combination (outcome/level), R = 8,192 candidate MLEMs that include fixed and random effects were run. (a) Estimated variable importance (ÿ̂ĉ) per variable. Higher values indicate greater importance (cf. Sect. 2.6.2 for details), ÿ̂ĉ-values range from 0 (white) to 1 (blue). (b) FMA-estimated effect of speaker population size on each outcome per symbolic level. Vertical lines represent the FMA-estimate, �ĉ, of population size, here denoted as βlog_pop, on ℎĀăĄą, L or ÿĀăĄą. Horizontal lines show corresponding 95%-CIs (cf. Sect. 2.6.2 for details). Lines are coloured in black if the 95%-CI crosses zero (vertical dashed grey line), whereas blue and pink indicate significant negative and positive effects. Interactive LM-specific visualisations are available at https://www.owid.de/plus/tradeoffvis/. By including several fixed and random effects, we control for both (i) language- and document-specific characteristics and (ii) spatial and phylogenetic autocorrelation. For each outcome and symbolic level, we fit R = 8,192 candidate models. Figure 7a visualises the estimated relative importance, ÿĉ, for each variable. With the exception of macro-family for ÿĀăĄą, all considered random effects have high relative importance for all three outcomes and across all three levels. Conversely, there is only weak evidence for the importance of most fixed effects, with the clear exception of speaker population size, which is maximally important 3 relative to the other variables 3 for both ℎĀăĄą and L, but not for ÿĀăĄą. To investigate if there is also evidence for a trade-off between entropy and length similar to the results presented in the previous section, Figure 7b plots the FMA-estimated effect, �ĉ, of speaker population size on each outcome per symbolic level. There is a positive and significant effect of population size on ℎĀăĄą across all three symbolic levels and a negative significant effect on L for both characters and BPE. For ÿĀăĄą, there is no significant evidence of an effect of population size on any of the three symbolic levels. Table 5 lists the corresponding estimates for all investigated fixed effects, showing that the only consistent evidence of a noteworthy effect is for speaker population size on either entropy or length. These results substantiate the evidence of a trade-off between entropy and length and indicate that languages with more speakers tend to have higher entropic values, i.e., are more complex, but also tend to produce shorter messages, i.e., are more efficient. Table 5: FMA-estimated fixed effects on each outcome per symbolic level. Each cell lists ��, here denoted as β and the corresponding p-value (cf. Sect. 2.6.2 for details). 1st column: fixed effect. 2nd 3 4th column: results for ��ýþ. 5th + 6th column: results for L. 7th 3 9th column: results for ÿ ��ýþ. Cell content is highlighted in bold if p < .05. ��ýþ L ÿ ��ýþ word char BPE word char word char BPE Population size β = 0.101 p < .001 β = 0.062 p < .001 β = 0.074 p < .001 β = -0.106 p < .001 β = -0.061 p < .001 β = -0.006 p = 0.524 β = -0.003 p = 0.638 β = 0.016 p = 0.305 EGIDS = 0 β = -0.239 p = 0.394 β = -0.085 p = 0.436 β = -0.251 p = 0.086 β = 0.630 p < .05 β = 0.356 p = 0.130 β = 0.225 p = 0.343 β = 0.267 p = 0.295 β = 0.147 p = 0.499 EGIDS = 1 β = -0.015 p = 0.706 β = 0.008 p = 0.709 β = 0.009 p = 0.702 β = 0.169 p = 0.134 β = 0.167 p = 0.051 β = 0.048 p = 0.508 β = 0.034 p = 0.575 β = 0.038 p = 0.567 EGIDS = 2 β = -0.029 p = 0.618 β = -0.010 p = 0.680 β = -0.021 p = 0.571 β = 0.025 p = 0.656 β = 0.004 p = 0.897 β = -0.053 p = 0.508 β = -0.050 p = 0.521 β = -0.067 p = 0.476 EGIDS = 3 β = -0.023 p = 0.632 β = -0.001 p = 0.954 β = -0.007 p = 0.740 β = 0.012 p = 0.746 β = -0.001 p = 0.948 β = -0.028 p = 0.605 β = -0.070 p = 0.439 β = -0.091 p = 0.380 N countries β = -0.024 p = 0.539 β = -0.037 p = 0.249 β = -0.038 p = 0.280 β = -0.006 p = 0.784 β = 0.002 p = 0.912 β = -0.065 p = 0.272 β = -0.063 p = 0.287 β = -0.091 p = 0.156 parallel β = 0.017 p = 0.657 β = -0.008 p = 0.843 β = -0.001 p = 0.987 β = -0.090 p = 0.253 β = -0.026 p = 0.618 β = -0.022 p = 0.544 β = -0.009 p = 0.714 β = -0.001 p = 0.932 3.2.2 Random effects and slope models As argued in the introduction (Sect. 1), an approach that focuses exclusively on random effects ignores variation within language families and geographical units [47349, 51, 111]. We thus proceed by including random slopes, i.e., we allow the effect of population size to vary across different groups, representing deviations from the overall mean linear effect of speaker population size. The methodological question in this context is, which random effects should include random slopes? We do not include a random slope for language since population size does not vary within languages. Due to geographic proximity and phylogenetic non-independence, it makes sense to include random slopes for macro-area, macro-family, and sub-family as geographic and phylogenetic structures can be critical for understanding language diversity and evolution, which justifies including these random slopes to account for shared inheritance and environmental factors [47]. There are, however, no a priori reasons for whether or not random slopes for writing script and corpus also make sense since both are not necessarily tied to population size or linguistic features in a way that would suggest significant variability in the effect of population size across different scripts. Including random slopes for both without clear justification could lead to overfitting, adding unnecessary complexity to the model and especially reducing the power to detect true effects: as noted by [6], a too-complex MLEM, by including excessive random slopes, may inflate Type I error rates and reduce the ability to identify significant predictors due to overfitting. This underscores the importance of balancing model complexity with the need to capture meaningful variability. We thus opted for a two-stage estimation process. We first include random slopes for macro-area, macro-family, and sub-family only in our multi-model multilevel approach (R =17,920). As a second step, we then additionally include random slopes for writing script and corpus (R = 35,200). Figure 8 visualises the results. With respect to the relative importance of the fixed and the random effects, both Figure 8a and Figure 8b largely point in the same direction as the random-effects-only approach (see Figure 7a). A noteworthy exception in both cases is that speaker population size is not only maximally important in predicting both ℎĀăĄą and L, but also very to maximally important in predicting ÿĀăĄą. Regarding the relative importance of the variables for which we include random slopes in both scenarios, there is significant agreement for the two slopes included for phylogenetic non-independence (macro- and sub-family): neither seems to play an important role in predicting either L or ÿĀăĄą. Figure 8: Estimated variable importance and FMA-estimates by outcome and symbolic level. (a and c) For each parameter combination, R = 17,920 candidate MLEMs that include fixed effects, random effects and random slopes for macro-area, macro-family and sub-family were run. (b and d) For each parameter combination, R = 35,200 candidate MLEMs that include fixed effects, random effects and random slopes for writing script, corpus, macro-area, macro-family and sub-family were run. (a and b) Estimated variable importance (ÿ̂ĉ) per variable. Higher values indicate greater importance (cf. Sect. 2.6.2 for details), ÿ̂ĉ-values range from 0 (white) to 1 (blue). (c and d) FMA-estimated effect of speaker population size on each outcome per symbolic level. Vertical lines represent the FMA-estimate, �ĉ, of population size, here denoted as βlog_pop, on ℎĀăĄą, L or ÿĀăĄą. Horizontal lines show corresponding 95%-CIs (cf. Sect. 2.6.2 for details). Lines are coloured in black if the 95%-CI crosses zero (vertical dashed grey line), whereas blue and pink indicate significant negative and positive effects. Analogous LM-specific interactive visualisations and numeric results are available at https://www.owid.de/plus/tradeoffvis/. For ℎĀăĄą, random interactions between population size and either macro-family or sub-family is very important. Interestingly, the variable included to account for geographic proximity as a random slope (macro-area) only plays an important role in predicting either L or ÿĀăĄą in the first scenario. It seems that, to a large extent, this influence might be absorbed by the inclusion of random slopes for writing script and corpus in the second scenario, especially for L on the character level and for ÿĀăĄą on both the word and the character level. The inclusion of writing script as a random slope does not seem to be very important. However, including corpus seems to make a difference for all seven parameter combinations, except for ℎĀăĄą on the word level. Regarding the FMA-estimated effect (�ĉ) of speaker population size, Figure 8c shows that in the first scenario, the results for each outcome and symbolic level are qualitatively identical to the random-effects-only approach (see Figure 7b): (i) a significant positive effect of population size on ℎĀăĄą, (ii) a significant negative effect on L, and (iii) no significant evidence for any effect on ÿĀăĄą. This again suggests a trade-off between entropy and length. In the second scenario, while there remains stable evidence for an entropy-length trade-off for words as symbols, at the character level, neither the positive effect on ℎĀăĄą nor the negative effect on L reaches significance at the 5% level (p = 0.064 for ℎĀăĄą and p = 0.086 for L). Unexpectedly, for ÿĀăĄą, there is a significant positive effect of population size across all three levels. Future work could determine whether this indicates a true effect or if this result arises from an overly complex model structure that is unable to detect true effects accurately. To foster such endeavours, we provide a dataset on our OSF repository (https://osf.io/93csg/) that contains all information needed to replicate our findings and conduct follow-up investigations. This dataset includes estimates for both (i) ℎĀăĄą and ÿĀăĄą and (ii) LM-specific estimates, ℎĀāand ÿĀāfor the seven investigated LMs across all three symbolic levels and two different outcome transformations (standardised vs. logged, see Sect. 2.6.2 for details), totalling information for R = 3,520,000 individual models. Further note that as mentioned above (see Sect. 2.6.2), we chose AICĀ as the criterion to weigh models. AIC is computed by balancing goodness-of-fit with model complexity, i.e., 22ℒ̂Ā +2(ýĀĄ + ýĀă), where ℒ̂Ā denotes the maximized log-likelihood of model j and ýĀĄand ýĀă are the numbers of estimated fixed effects and random effects (including random slopes) parameters, respectively. However, AIC is not the only potential choice as a criterion. For example, we can choose a variant of the Bayesian Information Criterion (BIC) [1123114] that imposes a more substantial penalty on the inclusion of additional parameters than AIC, computed as 22ℒ̂Ā + 2(log(� 2 ýĀĄ) ýĀă). If we use this criterion to compute results for the second scenario, all obtained FMA-estimated effects of speaker population size, i.e., positive effects for h and negative effects for L, reach significance at p < 0.05 on all three symbolic levels. We invite interested readers to further explore this using our interactive visualization tool available at https://www.owid.de/plus/tradeoffvis/, where we offer results for a total of four different information criteria. 4 Discussion As written above, Baroni argues that LMs should be seen as distinct algorithmic linguistic theories rather than "blank slates," as they inherently encode structural biases that shape their linguistic capabilities. Each LM thus represents a "general theory defining a space of possible grammars." [46]. Put differently, an LM can be seen as a model of an idealised language learner [10, 115, 116]. Hence, we can think of an LM that is trained on language-specific data "as a grammar, that is, a computation system that, given an input, can predict whether the sequence is acceptable to an idealized speaker of the language" [46]. In this paper, we investigated very different types of LMs, belonging to different classes: statistical, machine learning, and deep learning models. Each class exemplifies unique learning capabilities and limitations: PPM can be seen as an idealised learner focused on identifying and refining local word sequence patterns. By memorising partial matches with the last few symbols (e.g., 2 for PPM2, 6 for PPM6), PPM constructs a probabilistic grammar of language that adjusts predictions based on immediate context. This model is particularly effective at learning the structure of short-term dependencies and frequent n-grams within sentences based on very little input, making it adept at capturing local grammatical rules but limited in handling long-term dependencies. PAQ functions as an idealised learner that integrates insights from multiple models to form a coherent representation of grammar. Each model within PAQ contributes specialised 'knowledge', and a gated linear network moderates these contributions to balance and refine various perspectives. This process allows PAQ to learn from multiple strings simultaneously, adapting to a wide range of linguistic patterns and refining predictions through consensus. However, the complexity of integrating multiple models may limit its ability to focus on highly specific patterns or rare linguistic structures. LSTM networks represent an idealised learner adept at preserving and utilizing temporal sequences. By using memory cells and gates, LSTMs maintain continuity and context over longer sequences, enabling them to learn long-term dependencies and sequential information. This capability allows LSTMs to represent the progression of language over time, capturing narrative coherence and integrating both long-term and short-term adjustments. However, LSTMs may struggle with complex hierarchical structures due to their sequential processing nature and consequently require extensive training time. Lastly, the Transformer model can be seen as an idealised learner that is proficient at mapping global context through its self-attention mechanism. This mechanism enables each symbol to dynamically relate to others, constructing a comprehensive web of interactions between symbols. Transformers are highly capable of learning complex dependencies and contextual relationships within entire sentences or documents. Their ability to process all tokens simultaneously allows them to represent language in a broad and interconnected manner, making them particularly adept at generating coherent and contextually relevant text. However, this ability also comes at a cost, as Transformer models need to be trained on huge amounts of data to achieve this level of performance (see, e.g., Figure 2a-c). Our first main result (Sect. 3.1.1) thus indicates that the choice of the LM has very little impact on the obtained results. Given the far reaching architectural differences between the investigated LMs, we think this is a surprising result. For instance, a PPM2 model, by design, lacks the memory to store long-term dependencies, yet Figure 2d-f shows that the results are highly comparable across LMs. This trend holds even for our largest corpus, UNPC (Āþ = 341,723,872, see Sect. 2.1.3), which contains information for six languages. For example, the median value of ρboth between ℎÿÿā2 and ℎ�āýþăĂ, the former estimate being based on an LM with a context window of exactly two symbols and the latter based on a transformer model with ~19.1 million parameters (see Table 2), across symbolic levels (�� = 9) is ÿboth = 0.90. Similarly, take HCr, which is based on an even simpler LM, i.e., a 1-gram LM that does not consider any relationships between words but only their frequency of occurrence. Yet the median of ρboth between HCr and ℎĀā for the seven considered LMs across levels and corpora (�� = 840) is ÿboth = 0.42. This consistency across LMs is an important observation for several reasons. Firstly, larger LMs are notably more expensive to train, requiring substantially more computational resources as compared to smaller models like PAQ, as outlined in Table 2. Additionally, larger models need a lot of training data to achieve optimal performance (see Figure 2). This cost-effectiveness makes smaller LMs particularly attractive for cross-linguistic studies, especially when computational resources are scarce. Furthermore, as written above, the available electronic data for many languages, particularly those spoken by smaller populations, is very limited [34]. Our results indicate that training smaller LMs in such endeavours seems to be a viable option. It is important to point out that our demonstration of consistency across LMs occurs under a specific scenario: we examined whether the information-theoretic complexity of languages relative to each other remains consistent regardless of the LM used. In other words, if language A is deemed more complex than language B when analysed with one particular LM, this relationship persists even when a completely different type of LM is employed. Future cross-linguistic research should explore whether this agreement across LMs extends to other types of analyses and questions, particularly those that are not strictly based on information-theoretic measures. As our second main result (Sect. 3.1.2), we re-evaluated the hypothesis that all languages are equally complex. To address this, we developed a statistical approach that integrates machine learning with spatial filtering methods. This methodology, detailed in Sect. 2.6.1, was designed to control for language- and document-specific characteristics, as well as the phylogenetic and geographic relationships between languages. We used this approach to compare entropy estimates across corpora and showed that for different LMs, different types of symbols as information encoding units, and under control of potential sources of influence, a language with a high/low entropy rate in one corpus also tends to be more/less complex in another corpus. Extending the findings of our prior publication [35], these results provide information-theoretic evidence against the equi-complexity hypothesis [37]. This result inevitably leads to the question: as higher complexity in language results in more demanding processing efforts, why should there be a trend towards increased complexity in certain languages? We provide a potential answer to this question as our third main result (Sect. 3.1.2): we showed that there is a trade-off between the distributions of estimated entropy rates and length across corpora and across LMs. Given that, from an information-theoretic perspective, message length quantifies efficiency [110, 117] we argue that this result suggests that higher complexity is compensated by higher efficiency. In discussions about the obtained trade-off between L and the average amount of information per symbol, i.e., h, a recurring objection was that this result seems trivial. As some colleagues pointed out, if the same message is encoded in two languages, each symbol in the language with the shorter message length must transmit more information, almost by definition. In a similar vein, in a recent publication a large-scale quantitative information-theoretic analysis of parallel corpus data in almost 1,000 languages was presented to show that there are apparently strong associations between the way languages encode information into words and patterns of communication, e.g. the configuration of semantic information [118]. This publication was criticised by [119] who demonstrated that the results presented by [118] are systematically biased by varying text lengths, which is a very well-known fact in quantitative linguistics as most, if not all, quantities in the context of word frequency distributions vary systematically with text length [1203122]. The authors of [118] responded that what they call "information density" and text length are "two sides of the same coin" and that the Gibbs-Shannon entropy and text length, conditional on the same content, measure the same underlying construct [123]: "because the information content of the parallel translations is the same across comparison languages, we can infer that the more words present [sic] within a document covering the same material, the less information is encoded in each." Another recent response to [118] made a similar argument: "if it takes a language more words to convey the 8same9 message, then each word conveys less information." [124]. Both arguments are incorrect. First, let us note that a trade-off between h and L does not only occur for parallel corpora, but it is also observed (i) for comparable corpora (the median adjusted correlation, ρboth, between ℎĀăĄą and L for the seven comparable corpora in our database (see Sect. 2.1 and Appendix A.3) amounts to ÿboth = -0.47 for words, ÿboth = -0.59 for characters and ÿboth = -0.59 for BPE, NÃ = 49), and (ii) across corpora 3 in other words: if we know the entropy distributions in one multilingual text collection (parallel or not), we can predict the length in another corpus. For example, the adjusted pairwise correlation (N = 73) between ℎĀăĄąon the BPE level for the UDHR (parallel) corpus and L on the character level for the LCC news (comparable) corpus is ρboth = -0.79. Secondly, eq. 1 clearly demonstrates that the Gibbs-Shannon unigram entropy is not simply equivalent to text length, but rather a diversity index [43] that measures the amount of "freedom in the combination of symbols" [125], as it is a function of the number of different symbols and how evenly those symbols are distributed. Thirdly, the main problem with such arguments is that the information-theoretic concept of information is fully agnostic about the content of messages [110]. As Shannon, the founding father of information theory, puts it [117]: "The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages." In fact, the use of "information" as a label (e.g., "amount of information" or "information content") has been a "continuous source of misunderstanding" since the very inception of information theory, as pointed out by Bar-Hillel [126], because "it is psychologically almost impossible not to make the shift from the one sense of information, […], i.e., information = signal sequence, to the other sense, information = what is expressed by the signal sequence". In information theory, messages are only treated as signal sequences but not as "content-bearing entities" [126]. Bar-Hillel shows that only under very special circumstances, which do not typically apply to human language, can we infer the amount of information conveyed merely by the length of a message. There is no logical connection between the concept of semantic information (i.e., what is "expressed" by a transmitted signal sequence) and the rarity or improbability of the signal, which is measured by h. Put differently, information in the information-theoretic sense has nothing to do with meaning, but only measures the amount of uncertainty that is removed when receiving a message or the average amount of information learned about upcoming symbols when observing a symbol of the message. To illustrate this, consider the sentence "Rain occurs most frequently in spring." Using LCC word frequency information (cf. Sect. 2.1.1.2), the predictive uncertainty or self-information, �, of the sentence, calculated based on a unigram LM (see eq. 1), is �ĈĀăĂ1ąăÿþ = 12.531 bits per symbol (bps). Now compare this with the sentence "Liquid precipitation occurs most frequently in spring." If [123, 124] were correct, we should expect that this sentence has lower per-symbol information content, as it contains the same message but uses 7 instead of 6 words. However, the self-information for this sentence is �ĈĀăĂ1ąăÿþ = 13.646 bps and therefore higher. This is due to the fact that "rain" is much more frequent in our example data (ĀĀÿÿ = 1,594) than "liquid" (ĀĀÿÿ = 324) or "precipitation" (ĀĀÿÿ = 96). Since a unigram LM does not consider any contextual information, it makes sense to additionally use a large LM to compute self-information. Using a GPT-2 model with ~1.5 billion parameters, the self-information of the first sentence amounts to �ĈĀăĂþÿ�22 = 0.952 bps. Again, the self-information of the second sentence is higher with �ĈĀăĂþÿ�22 = 0.965 bps. As a further illustration, take the following two sentences: "I watched TV." and "I watched television." From the point of view of propositional content, both sentences contain the same message. Following [123, 124], we thus should expect that the self-information is the same in both cases. However, both �ĈĀăĂ1ąăÿþ and �ĈĀăĂþÿ�22 are lower for the first sentence (�ĈĀăĂ1ąăÿþ = 11.915, �ĈĀăĂþÿ�22 = 3.068) compared to the second one (�ĈĀăĂ1ąăÿþ = 12.397, �ĈĀăĂþÿ�22 = 3.210) again because "television" is much less frequent (ĀĀÿÿ = 1,568) than "TV" (ĀĀÿÿ = 4,269). Finally, consider the sentence pair "They will attack at dawn." versus "They will attack at 5am." From the point of view of propositional content, the "total amount of information" of the second sentence should be higher because it specifies a precise time. Based on a unigram LM, the total amount of information is �ąĀąÿý1ąăÿþ = 53.389 bits for the first sentence and �ąĀąÿý1ąăÿþ = 55.884 bits for the second one. However, based on the arguably much better GPT-2 LM, we obtain the opposite result with �ąĀąÿýþÿ�22 = 7.000 bits for the first sentence and �ąĀąÿýþÿ�22 = 6.620 bits for the second one. These illustrations demonstrate that both H and h reflect the statistical structure of "linguistic sequences independently of the specific information that [is] being encoded" [127]. As such, it is important to point out that information theory cannot be used to measure something like the total amount of semantic information/propositional content of a sentence. Instead, information theory provides a framework for quantifying the efficiency of symbol sequences in transmitting data, focusing on the probabilistic structure and redundancy of the language rather than its semantic content. To elaborate further, it should be noted that translation from one language into another is fundamentally different from what is meant by encoding in the information-theoretic sense. In early works on machine translation, such as Warren Weaver's famous memorandum [128], translation was sometimes understood to be similar to, e.g., cryptographic decoding. But encoding schemes are basically just biunique mappings of an information source to symbol sequences that permit exact recovery of the original symbols, whereas there is no information-theoretically well-defined sense in which one (of possibly very many) translation of a text into another human language conveys the "same content". The difference is similar to that between sign language and fingerspelling. A written English text can be translated into a sign language, such as American Sign Language (ASL). Since ASL is a full-blown natural language of its own (with no relation whatsoever to spoken English), there are usually many possible translations none of which allows unambiguous reconstruction of the source text from it. On the other hand, fingerspelling (which is a part of ASL) enables users to mechanically render e.g. English words through sequences of signs representing letters and can thus, in principle, be back-translated losslessly to the oral language original. The biuniqueness of encoding schemes is the reason why text length and entropy rate are indeed trivially inversely related to each other when comparing different encodings of a source text, since different encodings of the same source can always be compressed to the exact same outcome and the length of that outcome is then used to estimate the entropy rate. For translations between human languages no such argument is available. Indeed, as we will demonstrate in what follows, shorter translations may in principle come with a lower entropy rate instead of a higher one. Before further interpreting the complexity-efficiency trade-off, we will discuss the results, presented in the Appendix, from several additional analyses we conducted to test the reliability and validity of this trade-off. First, practicing what we preached above, we rule out the possibility that the association between h and L is simply the result of a well-known systematic text-length bias (Appendix A.4.1). Secondly, we demonstrate that there is clear evidence for an entropy-length trade-off only between languages, but not within languages (Appendix A.4.2). Thirdly, we show in Appendix. A.4.3 both theoretically and empirically that the trade-off is indeed not trivial, as one can define processes that increase both entropy and length at the same time. With these results in mind, we conclude this paper by discussing potential reasons for a trade-off between complexity and message length. This discussion is based on our fourth main result: using a multi-model multilevel approach detailed in Sect. 2.6.2, we presented findings in Sect. 3.2 indicating that the trade-off is influenced by the social environment in which languages are learned and used. Specifically, languages with more speakers tend to be more complex. At first glance, this result contrasts with previous research suggesting that languages spoken in larger communities tend to be less complex [1293133], as larger communities are assumed to favour simple and predictable language structures. At the same time, our results indicate that languages with more speakers tend to produce shorter messages, i.e., are more efficient. Let us speculate that in large societies, institutionalised education potentially makes greater linguistic complexity possible by providing systematic and formalised language learning, which, in turn, can support the acquisition and use of more complex linguistic structures. In line with this, a recent large-scale study found a positive statistical correlation between grammatical complexity and speaker population size [51]. At the same time, the importance of written communication in larger societies might create a natural pressure towards shorter messages, as it saves costs for producing, storing, and transmitting written texts (e.g., book paper, storage space, bandwidth). This dual influence 3 educational systems enabling complexity and the practical need for efficiency in written communication 3 could help explain why languages in larger communities might evolve to balance these pressures, resulting in shorter but more complex messages. Testing this hypothesis is an important avenue for future research. Appendix A.1 Language modelling details The first five LMs (PPM2, PPM6, PAQ, LSTM, TRFsmall, see Table 2) were trained on a Linux server (CentOS 7.9.2009). The remaining two LMs (TRFmed and TRFbig) were trained on the 'bwForCluster Helix' (for further details, see https://wiki.bwhpc.de/e/Helix, accessed 07/03/24). This cluster features various node configurations; for our analysis, we utilised the 'gpu 4' node type, which comprises two AMD Milan EPYC 7513 processors (2.6 GHz / 64 CPUs) and 256 GB of RAM. Specifically, 29 nodes are equipped with four Nvidia A40 GPUs (48 GB memory each), while 26 nodes have four Nvidia A100 GPUs (40 GB memory each). A custom script was developed to automate the process, generating a SLURM job for each file and assigning each file to a single GPU. Preliminary experiments indicated that results varied slightly depending on the GPU used, suggesting the potential benefits of restricting training to a specific GPU type. Further experiments demonstrated that, compared to the A40 GPUs, the A100 GPUs only reduced computation time for the larger models (as shown in Table 2: 21.2 seconds for TRFmed and 58.9 seconds for TRFbig on A100, versus 19.81 seconds for TRFmed and 96.31 seconds for TRFbig on A40). Since the slight hardware-dependent variation in compression lengths only introduces random noise, we chose not to restrict GPU types and allowed SLURM to choose the node and GPU to optimise overall computation time. Jobs on the 'bwForCluster Helix' have a maximum runtime of five days, after which they are terminated by the SLURM scheduler. Errors or timeouts can cause job aborts; our script handles such cases by resuming from the last checkpoint. Despite this, we were unable to obtain unbiased estimates for ��āýĀÿ�(Ā), i.e., the number of bits needed by TRFbig to compress entire documents, for three documents from the UNPC corpus (French, Spanish, and Russian) at both word and character levels. As a workaround for entropy rate and length distribution comparisons across LMs and corpora (see Sects. 2.6.1 and 3.1), we replaced ℎĀā(Ā) for these three files at both symbolic levels with [āÿĀ(�)Ā(�) + āÿĀ(��ÿÿ�)Ā��ÿÿ�(�) ] 2⁄ , where Āąăÿÿÿ(Ā) represents the length of the training data, i.e., the first half of Ā. We did not use these documents in any further analysis, including the computation of ℎĀăĄą(Ā) (see eq. 6) and LM-specific multi-model multilevel inferences (see Sects. 2.6.2 and 3.2). A.2 Scaling ÿ� To effectively interpret the relative importance of an x, it is beneficial to scale ÿĉ to an interval reflecting its relative significance compared to a baseline theoretical minimum. The theoretical minimum ÿĉ2, computed for a variable �2 that does not improve the model fit at all, represents the least possible importance a variable can have while still being included in the models. Note that including an additional variable �2 effectively doubles the model space to 2R, as each model can either include or exclude �2. By definition, �2 occurs in half, or R, of all models, and we assume that AICþÿÿ does not include �2. If the inclusion of �2 does not improve the model fit, it only affects the AIC through the penalty for the additional parameter, increasing ýĀ by 1 and thus increasing AICĀ by 2 units for each model that includes �2. Given that ∑ �ĀāĀ=1 = 1, where �Ā are the Akaike weights computed from the original R models, the new model space, expanded to R' = 2R, has altered weights due to the inclusion of �2. The new sum of weights can be written as: �2 = ∑ ÿ(212��) =ā2ă=1 ∑ ÿ(212(��+2))āă=1 + ∑ ÿ(212��) =āă=1 ∑ ÿ(212��21)āă=1 + � = ∑ ÿ(212��)āă=1 ÿ21 + � = �ÿ21 + � = (ÿ21 + 1) × �. (17) Since this also demonstrates that models that include �2 contribute ÿ21 × � to the sum, we can compute the theoretical minimum ÿĉ2 as: ÿĉ2 = ă21×� (ă21+1)×� = ă21ă21+1 = 1ă×(1�+1) = 11+ă (18) Thus ÿĉ2 = .2689. Note that this minimum only applies for the fixed and the random effects, but not for the random rslopes, since random slopes only occur in models that also contain the corresponding random effect/intercept and a fixed effect for population size. ÿĉ2 for random slopes will thus be lower than 11+ă A.3 Numeric IDs These IDs can be used to map the corpora (1st column) to the corresponding IDs (2nd column) in Figure 5, Figure 6 and Appendix Figure 1. The third column indicates the type of the respective corpus. Corpus ID Corpus type BibleNT 1 parallel BibleOT 2 parallel WatchtowerV1 3 parallel WatchtowerV2 4 parallel Quran 5 parallel GlobalVoices 6 comparable LCCnews 7 comparable LCCweb 8 comparable LCCwiki 9 comparable UDHR 10 parallel EUconst 11 parallel Europarl 12 parallel EUmed 13 parallel UNPC 14 parallel MSub01 15 parallel MSub02 16 parallel MSub03 17 parallel MSub04 18 parallel MSub05 19 parallel MSub06 20 parallel MSub07 21 parallel MSub08 22 parallel MSub09 23 parallel MSub10 24 parallel MSub11 25 parallel MSub12 26 parallel MSub13 27 parallel TEDt01 28 parallel TEDt02 29 parallel TEDt03 30 parallel TEDt04 31 parallel TEDt05 32 parallel TEDt06 33 parallel TEDt07 34 parallel TEDt08 35 parallel TEDt09 36 parallel Ubuntu 37 comparable GoogleT 38 parallel TatoebaV1 39 comparable TatoebaV2 40 comparable Crúbadán 41 word list A.4 Evaluating the validity of entropy-length trade-off A.4.1 Testing for a potential systematic length bias In Sect. 3.1.2, we showed that, across corpora and LMs, languages that tend to have a higher entropy rate h tend to require fewer symbols to encode messages, i.e., have a lower L. Given the almost ubiquitous influence of text length in both corpus and quantitative linguistics, we now test whether the obtained pattern between h and L holds when we explicitly adjust our estimate for a possible text length bias. To this end, we used additional data that we made available as part of our prior publication [35] in which we prepared a truncated version by first computing ϠāĀăāĆĄ that denotes the minimum document length in symbols per corpus and per type (characters or words). We then extrapolated entropy rates [35] based on the first ϠāĀăāĆĄ symbols for each available text that belongs to the corresponding corpus. This procedure ensures that per corpus and per type, h is computed for text samples of identical size. We then compute cross-correlations between h and L across corpora and type as described in Sect. 2.6.1. Appendix Figure 1 shows that the results for the truncated version are qualitatively identical to the results presented in Sect. 3.1.2. Appendix Figure 1: Comparing entropy and length across corpora. To rule out that the above results are mainly driven by a potential text length bias, entropy rates taken from [35] are computed for text samples of identical size per corpus and per type (words, characters). For each correlation type, Ăÿ = 12,880 individual correlations are computed. Refer to the caption of Figure 5 for further details. A.4.2 Associations between and within languages To evaluate whether the entropy-length trade-off occurs between languages but not within languages, we systematically adjust our estimates for the influence of text length, as this bias is especially relevant within languages [1203122]. Since there is a direct correspondence between entropy rates and text length at the word and character levels, but not for BPE, we focus on words and characters as information encoding units. Additionally, we want to ensure that the observed patterns did not arise merely from the way we statistically inferred entropy rates using LMs. Therefore, here we thus use the non-parametric entropy rate estimator by [134] that was already used in several studies [16, 17, 25, 26, 135] and that is computed as ℎÿ(Ā) = [1Ā ∑ ��log2(ý )Āý=2 ]21 (20) The key quantity is the match-length �ý. It measures the length (in symbols) of the shortest substring starting at position l that is not also a substring of the part of the corresponding document Ā before this position and can be used to estimate h, since it was shown that �ý grows like log2 þ ℎ⁄ [134, 136, 137]. More details and an open source Java program to efficiently obtain match-lengths in texts can be found in [26]. As a first analysis, we use the PBC (cf. Sect. 2.1.1.1) and split each translation into the 66 different books of the Biblical canon. We only kept translations with available information for all the 66 books. We then kept all 29 books with a median length of at least 10,000 words. For languages with more than one available Bible translation, we randomly sampled one translation. In total, we have available translations for 144 different languages. We then compute two sets of correlations: (i) Pearson correlations, ÿĈÿą/ÿÿ, between entropy rates, ℎÿ, and L within languages. For each of 144 languages, we have 29 Bible books. Per language, we compute the Pearson correlation between ℎÿ and L. (ii) Pearson correlations, adjusted for geographical proximity, ÿąăĀ, between ℎÿ and L between languages for 29 Bible books, each with parallel translations into 144 languages. To account for a potential text length bias, we first compute the minimum text length in symbols (words/characters) per language across the 29 Bible books and call that minimum ϠĀ. Similarly, we compute the minimum text length in symbols per Bible book across the 144 languages and call that minimum ϠĈ. We then truncated each book at the respective minima and used the truncated books to calculate the corresponding entropy rates ℎϠĀÿ and ℎϠýÿ . We repeated (i) and (ii) with these truncated entropy rates as input. Appendix Figure 2a shows that the trade-off only holds between, but not within languages: Within languages (144 languages, each with 29 Bible books from the same Bible translation), there is only weak evidence for a trade-off if we do not control for text length. As soon as we add that control, any evidence for a trade-off disappears. Between languages (29 Bible books, each with parallel translations into 144 languages), there is clear evidence for a trade-off, regardless of whether we control for text length. Appendix Figure 2: Evaluating the entropy-length trade-off between and within languages. Per symbolic level (characters, words), we compute (i) ÿĈÿą/ÿÿ, between ℎÿand L within languages and (ii) ÿąăĀ between ℎÿ and L between languages. To account for potential text length bias, we compute entropy rates for text samples of identical size (ℎϠýÿ , ℎϠĀÿ ) and use those as input to calculate ÿĈÿą/ÿÿ and ÿąăĀ. Blue colour: Distribution of ρs with ℎÿas input. Red colour: Distribution of ρs with ℎϠýÿ or ℎϠĀÿ as input. (a) Distribution of ρs for 29 books of the Biblical canon in 144 different languages between and within languages. (b and c) Distribution of ρs for 43 different text samples in 43 languages between and within languages. (b) Parallel text samples. (c) Non-parallel text samples. We continue by demonstrating that an entropy-length trade-off between languages also occurs when the content of the message is not fully controlled/parallel. For this, we used the Quran corpus consisting of parallel translations of 6,233 sentences/verses into 43 different languages (see Sect. 2.1.1.1). We randomly distributed the sentences into i = 1, 2, …, 43 different samples s1, s2, …, s43 of approximately equal size where, across languages, each sample consists of the same sentences that are arranged in the same order. Thus, each si in each language ceteris paribus contains a different message with the statistical characteristics of the source text, i.e., the corresponding Quran translation. We then continue as above to compute ÿĈÿą/ÿÿ. Appendix Figure 2b shows that within languages (43 languages, each with 43 different text samples), there is no noteworthy negative correlation between entropy and length on either symbolic level, irrespective of whether correlations are unadjusted or adjusted for text length. Hence, within languages, i.e., across comparable samples in a given language, there is no evidence that higher entropy is compensated by shorter length. We then randomly re-arranged the data into a between-languages format by assigning one randomly chosen si from each language to one of 43 different text collections, so that each text collection contains 43 different samples, one from each language. Appendix Figure 2b demonstrates that here results point again towards a trade-off between h and L between languages (parallel documents in different languages) on both levels and for both types of text length adjustments. We round off this section by eliminating the parallelism across languages. Instead of preparing 43 different language-specific samples that consist of the same set of sentences in the same order across languages, we randomly distribute the sentences across languages without maintaining any parallelism across languages. This approach tests whether the entropy-length trade-off persists in a completely randomized context. Appendix Figure 2c shows that, after this random distribution, the results are fully comparable once again: there is no evidence for a trade-off between entropy and length within languages. However, a trade-off is observed between languages on both symbolic levels and for both types of text length adjustments. These findings suggest that the entropy-length trade-off is a robust phenomenon primarily observed between languages rather than within languages. A.4.3 The entropy-length trade-off is not trivial Appendix Figure 3: For each value of q between 0.5 and 1.0 (with steps of 0.01), we generated encodings into Ϟ of a source message with 1M symbols and p = 0.75 emitted by ϻ. The blue circles depict the entropy rates �(Ϟ) for each value of q. The orange horizontal line shows the entropy rate estimated for the original source message, i.e., ��ýþ(ϻ). As theoretically expected, as long as q < p = 0.75 (mint vertical line), ��ýþ(Ϟ) > ��ýþ(ϻ). To demonstrate that the translation/encoding of a source message in one "language" into a different "language" can lead to both higher per-symbol entropy h and longer message length L, we consider a discrete source ϻ that emits a sequence of N symbols with source alphabet S = {♣, ♦}. For simplicity, we assume that the source is memoryless, i.e., symbols are statistically independent of each other, but it is not difficult to extent this to sources with memory (e.g. Markov sources). We assume that ♣ is emitted with probability p, ♦ is emitted with probability (1 3 p) and ♣ is more frequent than ♦, i.e., p > 0.5. h can be written as ℎ(ϻ) = 2[ā · þĀā2(ā) + (1 2 ā) ·þĀā2(1 2 ā)]. To encode messages of ϻ into a target language Ϟ that is both longer than ϻ and has a higher h, after each nth symbol of the source message, a symbol from the source alphabet is added to the original sequence; with probability q the symbol will be ♣ and with probability (1 3 q), it will be ♦. It is easy to see that the original source message can be perfectly reconstructed from its translation just by knowledge of n. The message encoded into Ϟ will be longer than the source message by +� ÿ⁄ , symbols and ♣ is emitted with probability ā2 = ā·Ă++Ă ÿ⁄ ,·ĂĂ++Ă ÿ⁄ , . Therefore, encoded messages in Ϟ will have the same h as ϻ iff q = p and any encoding into Ϟ will have higher h when ā′ < ā. As an illustration, we generated a source message with N = 1M symbols and p = 0.75. The source message is encoded into Ϟ for n = 10, i.e., after every 10th symbol, either ♣ or ♦ will be added and the encoded message will be 100,000 symbols longer than the source message. The theoretical per-symbol entropy of the source is h = 0.8113 and p' = (7.5+q)/11. Thus, in the range of Ă = [0,0.75[, the entropy of the encoded message will be higher. This also holds empirically: the observed extrapolated entropy rate for the original sequence is h ≈ 0.84. We generated 51 encodings into Ϟ with q ranging from 0.5 to 1.0 (with steps of 0.01) and estimated ℎĀăĄą for the resulting sequence. Appendix Figure 1 shows that as long as q < 0.75, ℎĀăĄą(Ϟ) > ℎĀăĄą(ϻ). This line of reasoning can also be extended to actual natural language data. To this end, we used the EUconst corpus consisting of different translations of the European Constitution (cf. Sect. 2.1.1.3) and introduced an artificial 'rule': whenever the language-specific equivalent of 'article' (e.g. 'cikk' in the Hungarian translation or 'άÃθÃ¿' in the Greek one, see the script 'prepare_artificial_EUconst.do' available at https://osf.io/7mpxr/ for details) occurs in the text, one of two 'qualifiers' needs to be added after one of the next three following words (randomly chosen): the 'qualifier' has to be '♦' if the cumulative length in Unicode characters of the last three words is lower than some language specific threshold (e.g. 8 in the Hungarian translation or 14 in the Greek one, see the script 'prepare_artificial_EUconst.do' available at https://osf.io/7mpxr/ for details). If not, the qualifier has to be '♣'. It is worth pointing out that 3 as above 3 the original text can be fully reconstructed from the 'encoded' version since neither '♦' nor '♣' occur in the original text. We then used the usual workflow to estimate ℎĀăĄą for the 'synthetic' text versions on all three levels. For the 'original' version, there is a strong negative correlation between entropy and length on all three levels with ρgeo = -0.86 for words, ρgeo = -0.78 for characters and ρgeo = -0.76 for BPE. The same is true for the 'synthetic' version, with ρgeo = -0.87 for words, ρgeo = -0.79 for characters and ρgeo = -0.77 for BPE. By definition, the artificial rules increase the text length L. If the trade-off between entropy and length would indeed be trivial, i.e., if a longer text always has lower per-symbol entropy rates, we should expect that ℎĀăĄą for the 'synthetic' versions are lower than their counterparts estimated for the 'original' texts, i.e., ℎĀăĄą(original) - ℎĀăĄą(synthetic) > 0. Appendix Figure 4 shows that this is not true: in all but one case, ℎĀăĄą is higher in the 'synthetic' version than in the 'original' one. Appendix Figure 4: Absolute difference in entropy. Differences are computed between the 'original' and the 'synthetic' version for each translation of the European Constitution (represented by their ISO-639-3 codes on the x-axis) for each encoding unit (words, a; characters, b; BPE, c). To further investigate this observation, we focus on the English version of the European Constitution and introduce the following six rules: - Rule 1: "Do nothing", i.e., leave the text as is - Rule 2: "Contract 'of the'", i.e., all occurrences of "of the" will be replaced by the new word type "ofthe" - Rule 3: "Qualify 'the' by either '♦' or '♣'", i.e., whenever the definite article "the" occurs in the text, one of two 'qualifiers' needs to be added after one of the next three following words (randomly chosen): the 'qualifier' has to be '♦' if the word directly preceding "the" is one of the following words: to, in, and, by, on, for, at, or, as, if, all, is, be, has. In all other cases, the 'qualifier' will be '♣' - Rule 4: "Replace 'shall be'", i.e., whenever "shall be" occurs after a word with only one character, "shall be" is replaced by "beshall" - Rule 5: "Double 'this'", i.e., if the first word after "this" doesn't end with a vowel, "this" is repeated after that word - Rule 6: "Split 'ing'", i.e., in every second occurrence of a word ending in 'ing', a space is inserted before this 'ing' (e.g. 'fishing' 3 'fish ing') Appendix Figure 5: Entropy rates against length for the six synthetic versions of the European constitution in English (words, a; characters, b; BPE, c). Note that on the BPE level, ℎĀăĄą is plotted against L in characters. We then generated six versions of the source text: for the first version v1, the first rule is applied, for the second version v2, the first two rules applied in the order specified above, and so on. We then compute ℎĀăĄą for each version. Note that if the rules are known, the original text can again be losslessly recovered from each of the synthetic versions. Appendix Figure 5 shows that there is no evidence for a negative relationship between entropy rates and text length on both symbolic levels. It is interesting to see that (i) Rule 2 seems to be efficient, but adds complexity: compared to v1, v2 has a shorter text length but higher complexity on all symbolic levels; (ii) that compared to v4, v5 has longer text length but lower entropy rates, implying that Rule 5 is not efficient but highly predictable, i.e., less complex and (iii) Rule 3 and Rule 6 are both inefficient and increase complexity. Taken together, the analyses presented in this appendix show that the entropy-length trade-off we present in the main body of our paper is indeed not trivial as one can define arbitrary language-based rules that increase both entropy and length. List of abbreviations AIC 3 Akaike9s information criterion ASL - American Sign Language BIC 3 Bayesian Information Criterion BPE 3 Byte pair encoding bps 3 bits per symbol CI 3 confidence interval CPU 3 Central Processing Unit EGIDS 3 Expanded Graded Intergenerational Disruption Scale FMA 3 Frequentist model averaging GPU 3 Graphics Processing Unit HPC 3 High Performance Computing ISO 3 International Organization for Standardization LCC 3 Leipzig Corpora Collection LM 3 Language model LSTM 3 Long Short-Term Memory ML 3 maximum likelihood MLEM - multilevel effects model PBC 3 Parallel Bible Corpus SE 3 Standard error TED 3 Technology, Entertainment, Design Acknowledgements Not applicable Availability of data and materials All data and code (Stata v18.0 and Python v3.6.8) needed to replicate our analyses are available at https://osf.io/xdwjc/. In addition, interactive results and visualisations are available online at https://www.owid.de/plus/tradeoffvis/. Competing interests The authors declare that they have no competing interests. Funding To train the TRFmed, TRFbig language models, we acknowledge support by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG. Authors' contributions AK: conceptualisation (lead), data curation, resources (lead), formal analysis, investigation, methodology, software (lead), validation (lead), visualisation (lead), writing 3 original draft, writing 3 review & editing (equal). SW: software (supporting), visualisation (supporting), validation (supporting), writing 3 review & editing (equal). JOR: resources (supporting). PM: conceptualisation (supporting), writing 3 review & editing (equal). References 1. Jurafsky, D., Martin, J.H.: Speech and Language processing: an introduction to natural language processing, computational Linguistics, and speech recognition. Pearson Education (US), Upper Saddle River (2009) 2. Piantadosi, S.T.: Modern language models refute Chomsky9s approach to language, https://lingbuzz.net/lingbuzz/007180, (2023) 3. Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Willmore, L., Ballard, A.J., Bambrick, J., Bodenstein, S.W., Evans, D.A., Hung, C.-C., O9Neill, M., Reiman, D., Tunyasuvunakool, K., Wu, Z., Žemgulytė, A., Arvaniti, E., Beattie, C., Bertolli, O., Bridgland, A., Cherepanov, A., Congreve, M., Cowen-Rivers, A.I., Cowie, A., Figurnov, M., Fuchs, F.B., Gladman, H., Jain, R., Khan, Y.A., Low, C.M.R., Perlin, K., Potapenko, A., Savy, P., Singh, S., Stecula, A., Thillaisundaram, A., Tong, C., Yakneen, S., Zhong, E.D., Zielinski, M., Žídek, A., Bapst, V., Kohli, P., Jaderberg, M., Hassabis, D., Jumper, J.M.: Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. (2024). https://doi.org/10.1038/s41586-024-07487-w 4. Gruver, N., Finzi, M., Qiu, S., Wilson, A.G.: Large Language Models Are Zero-Shot Time Series Forecasters, https://arxiv.org/abs/2310.07820, (2023) 5. Zhavoronkov, A., Ivanenkov, Y.A., Aliper, A., Veselov, M.S., Aladinskiy, V.A., Aladinskaya, A.V., Terentiev, V.A., Polykovskiy, D.A., Kuznetsov, M.D., Asadulaev, A., Volkov, Y., Zholus, A., Shayakhmetov, R.R., Zhebrak, A., Minaeva, L.I., Zagribelnyy, B.A., Lee, L.H., Soll, R., Madge, D., Xing, L., Guo, T., Aspuru-Guzik, A.: Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat Biotechnol. 37, 103831040 (2019). https://doi.org/10.1038/s41587-019-0224-x 6. Butler, K.T., Davies, D.W., Cartwright, H., Isayev, O., Walsh, A.: Machine learning for molecular and materials science. Nature. 559, 5473555 (2018). https://doi.org/10.1038/s41586-018-0337-2 7. Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning3based sequence model. Nat Methods. 12, 9313934 (2015). https://doi.org/10.1038/nmeth.3547 8. Rolnick, D., Donti, P.L., Kaack, L.H., Kochanski, K., Lacoste, A., Sankaran, K., Ross, A.S., Milojevic-Dupont, N., Jaques, N., Waldman-Brown, A., Luccioni, A.S., Maharaj, T., Sherwin, E.D., Mukkavilli, S.K., Kording, K.P., Gomes, C.P., Ng, A.Y., Hassabis, D., Platt, J.C., Creutzig, F., Chayes, J., Bengio, Y.: Tackling Climate Change with Machine Learning. ACM Comput. Surv. 55, 1396 (2023). https://doi.org/10.1145/3485128 9. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language Models are Few-Shot Learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., and Lin, H. (eds.) Advances in Neural Information Processing Systems. pp. 187731901. Curran Associates, Inc. (2020) 10. Chater, N., Vitányi, P.: 8Ideal learning9 of natural language: Positive results about learning from positive evidence. Journal of Mathematical Psychology. 51, 1353163 (2007). https://doi.org/10.1016/j.jmp.2006.10.002 11. Contreras Kallens, P., Kristensen‐McLachlan, R.D., Christiansen, M.H.: Large Language Models Demonstrate the Potential of Statistical Learning in Language. Cognitive Science. 47, e13256 (2023). https://doi.org/10.1111/cogs.13256 12. Grindrod, J.: Modelling Language, https://arxiv.org/abs/2404.09579, (2024) 13. Mahoney, M.: Data Compression Explained. Dell Inc. (2013) 14. Bentz, C., Ferrer-i-Cancho, R.: Zipf9s law of abbreviation as a language universal. In: Bentz, C., Jäger, G., and Yanovich, I. (eds.) Proceedings of the Leiden Workshop on Capturing Phylogenetic Algorithms for Linguistics. University of Tübingen, Tübingen (2016) 15. Bentz, C.: The Low-complexity-belt: Evidence For Large-scale Language Contact In Human Prehistory? In: Roberts, S.G., Cuskley, C., McCrohon, L., Barceló-Coblijn, L., Fehér, O., and Verhoef, T. (eds.) The Evolution of Language: Proceedings of the 11th International Conference (EVOLANGX11). Online at http://evolang.org/neworleans/papers/93.html (2016) 16. Koplenig, A.: Language structure is influenced by the number of speakers but seemingly not by the proportion of non-native speakers. Royal Society Open Science. 6, 181274 (2019). https://doi.org/10.1098/rsos.181274 17. Koplenig, A.: Quantifying the efficiency of written language. Linguistics Vanguard. 7, 20190057 (2021). https://doi.org/10.1515/lingvan-2019-0057 18. Kauhanen, H., Einhaus, S., Walkden, G.: Language structure is influenced by the proportion of non-native speakers: A reply to Koplenig (2019). Journal of Language Evolution. lzad005 (2023). https://doi.org/10.1093/jole/lzad005 19. Koplenig, A.: Still no evidence for an effect of the proportion of non-native speakers on language complexity -- A response to Kauhanen, Einhaus & Walkden (2023). (2023). https://doi.org/10.48550/ARXIV.2305.00217 20. Greenberg, J.H.: A Quantitative Approach to the Morphological Typology of Language. International Journal of American Linguistics. 26, 1783194 (1960) 21. Bentz, C., Ruzsics, T., Koplenig, A., Samardzic, T.: A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora. In: Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC). , Osaka, Japan (2016) 22. Cysouw, M., Wälchli, B.: Parallel texts: using translational equivalents in linguistic typology. Language Typology and Universals. 60, 95399 (2007). https://doi.org/10.1524/stuf.2007.60.2.95 23. Wälchli, B., Cysouw, M.: Lexical typology through similarity semantics: Toward a semantic map of motion verbs. Linguistics. 50, (2012). https://doi.org/10.1515/ling-2012-0021 24. Östling, R.: Word Order Typology through Multilingual Word Alignment. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pp. 2053211. Association for Computational Linguistics, Beijing, China (2015) 25. Bentz, C., Alikaniotis, D., Cysouw, M., Ferrer-i-Cancho, R.: The Entropy of Words4Learnability and Expressivity across More than 1000 Languages. Entropy. 19, 275 (2017). https://doi.org/10.3390/e19060275 26. Koplenig, A., Meyer, P., Wolfer, S., Müller-Spitzer, C.: The statistical trade-off between word order and word structure 3 Large-scale evidence for the principle of least effort. PLOS ONE. 12, e0173614 (2017). https://doi.org/10.1371/journal.pone.0173614 27. Pimentel, T., Meister, C., Salesky, E., Teufel, S., Blasi, D., Cotterell, R.: A surprisal3duration trade-off across and within the world9s languages. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 9493962. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021) 28. de Vries, L.J.: Some remarks on the use of Bible translations as parallel texts in linguistic research. Sprachtypologie und Universalienforschung. 60, 1483157 (2007). https://doi.org/10.1524/stuf.2007.60.2.148 29. Wälchli, B.: Advantages and disadvantages of using parallel texts in typological investigations. Language Typology and Universals. 60, 1183134 (2007). https://doi.org/doi:10.1524/stuf.2007.60.2.118 30. Tiedemann, J.: Parallel Data, Tools and Interfaces in OPUS. In: LREC912 Proceedings. pp. 221432218. ELRA, Istanbul, Turkey (2012) 31. Mayer, T., Cysouw, M.: Creating a Massively Parallel Bible Corpus. In: Chair), N.C. (Conference, Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC914). European Language Resources Association (ELRA), Reykjavik, Iceland (2014) 32. Levshina, N.: Verbs of letting in Germanic and Romance languages: A quantitative investigation based on a parallel corpus of film subtitles. LiC. 16, 843117 (2016). https://doi.org/10.1075/lic.16.1.04lev 33. Goldhahn, D., Eckart, T., Quasthoff, U.: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC912). pp. 7593765. European Language Resources Association (ELRA), Istanbul, Turkey (2012) 34. Scannell, K.P.: The Crúbadán Project: Corpus building for under-resourced languages. In: Proceedings of the 3rd Web as Corpus Workshop: Building and Exploring Web Corpora. pp. 5315 (2007) 35. Koplenig, A., Wolfer, S., Meyer, P.: A large quantitative analysis of written language challenges the idea that all languages are equally complex. Sci Rep. 13, 15351 (2023). https://doi.org/10.1038/s41598-023-42327-3 36. Ehret, K., Blumenthal-Dramé, A., Bentz, C., Berdicevskis, A.: Meaning and Measures: Interpreting and Evaluating Complexity Metrics. Front. Commun. 6, 640510 (2021). https://doi.org/10.3389/fcomm.2021.640510 37. Sampson, G.: A linguistic axiom challenged. In: Sampson, G., Gil, D., and Trudgill, P. (eds.) Language complexity as an evolving variable. pp. 1318. Oxford University Press, Oxford (2009) 38. Ehret, K., Berdicevskis, A., Bentz, C., Blumenthal-Dramé, A.: Measuring language complexity: challenges and opportunities. Linguistics Vanguard. 0, (2023). https://doi.org/10.1515/lingvan-2022-0133 39. Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley-Interscience, Hoboken, N.J (2006) 40. Futrell, R., Hahn, M.: Information Theory as a Bridge Between Language Function and Language Form. Front. Commun. 7, 657725 (2022). https://doi.org/10.3389/fcomm.2022.657725 41. Ren, G., Takahashi, S., Tanaka-Ishii, K.: Entropy Rate Estimation for English via a Large Cognitive Experiment Using Mechanical Turk. Entropy. 21, 1201 (2019). https://doi.org/10.3390/e21121201 42. Meister, C., Pimentel, T., Wiher, G., Cotterell, R.: Locally Typical Sampling, https://arxiv.org/abs/2202.00666, (2022) 43. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. International Journal of Computer Mathematics. 2, 1573168 (1968). https://doi.org/10.1080/00207166808803030 44. Kontoyiannis, I.: The Complexity and Entropy of Literary Styles. NSF Technical Report, Department of Statistics, Stanford University. 97, (1996) 45. Cover, T.M.: Kolmogorov Complexity, Data Compression, and Inference. In: Skwirzynski, J.K. (ed.) The Impact of Processing Techniques on Communications. pp. 23333. Springer Netherlands, Dordrecht (1985) 46. Baroni, M.: On the proper role of linguistically-oriented deep net analysis in linguistic theorizing, https://arxiv.org/abs/2106.08694, (2021) 47. Bromham, L.: Solving Galton9s problem: practical solutions for analysing language diversity and evolution. PsyArXiv (2022) 48. Bromham, L., Yaxley, K.J.: Neighbours and relatives: accounting for spatial distribution when testing causal hypotheses in cultural evolution. Evolut. Hum. Sci. 5, e27 (2023). https://doi.org/10.1017/ehs.2023.23 49. Guzmán Naranjo, M., Becker, L.: Statistical bias control in typology. Linguistic Typology. 26, 6053670 (2022). https://doi.org/10.1515/lingty-2021-0002 50. Claessens, S., Kyritsis, T., Atkinson, Q.D.: Cross-national analyses require additional controls to account for the non-independence of nations. Nat Commun. 14, 5776 (2023). https://doi.org/10.1038/s41467-023-41486-1 51. Shcherbakova, O., Michaelis, S.M., Haynie, H.J., Passmore, S., Gast, V., Gray, R.D., Greenhill, S.J., Blasi, D.E., Skirgård, H.: Societies of strangers do not speak less complex languages. Science Advances. 9, eadf7704 (2023). https://doi.org/10.1126/sciadv.adf7704 52. Koplenig, A., Wolfer, S.: Languages with more speakers tend to be harder to (machine-)learn. Sci Rep. 13, 18521 (2023). https://doi.org/10.1038/s41598-023-45373-z 53. Hall, S., Moskovitz, C., Pemberton, M., Text Recycling Research Project: Understanding text recycling. A guide for researchers V.1, https://textrecycling.org/files/2021/06/Understanding-Text-Recycling_A-Guide-for-Researchers-V.1.pdf, (2021) 54. Simons, G.F., Fennig, C.D.: Global Dataset Ethnologue: Languages of the World, Twentieth edition., http://www.ethnologue.com., (2017) 55. Bromham, L., Dinnage, R., Skirgård, H., Ritchie, A., Cardillo, M., Meakins, F., Greenhill, S., Hua, X.: Global predictors of language endangerment and the future of linguistic diversity. Nat Ecol Evol. 6, 1633173 (2022). https://doi.org/10.1038/s41559-021-01604-y 56. Hammarström, H., Forkel, R., Haspelmath, M., Bank, S.: glottolog/glottolog: Glottolog database 4.8, https://zenodo.org/record/8131084, (2023) 57. Simons, G.F., Fennig, C.D. eds: Ethnologue: languages of Africa and Europe. SIL, Dallas, TX (2017) 58. Simons, Gary F, Fennig, C.D.: Ethnologue: Languages of the World. SIL International, Dallas, Texas (2017) 59. Lewis, M.P., Simons, G.F.: Assessing Endangerment: Expanding Fishman9s GIDS. Revue Roumaine de Linguistique. 55, 1033120 (2010) 60. Baker, M.: Corpus Linguistics and Translation Studies 4 Implications and Applications. In: Baker, M., Francis, G., and Tognini-Bonelli, E. (eds.) Text and Technology. p. 233. John Benjamins Publishing Company, Amsterdam (1993) 61. Cotterell, R., Mielke, S.J., Eisner, J., Roark, B.: Are All Languages Equally Hard to Language-Model? In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). pp. 5363541. Association for Computational Linguistics, New Orleans, Louisiana (2018) 62. Jäger, G.: Global-scale phylogenetic linguistic inference from lexical resources. Scientific Data. 5, 180189 (2018). https://doi.org/10.1038/sdata.2018.189 63. Wichmann, S., Holman, E.W., Brown, C.H., Forkel, R., Tresoldi, T.: CLDF dataset derived from Wichmann et al.9s 8ASJP Database9 v17 from 2016, https://zenodo.org/record/3835942, (2016) 64. Stewart, W.Α.: A Sociolinguistic Typology for Describing National Multilingualism. In: Fishman, J.A. (ed.) Readings in the Sociology of Language. pp. 5313545. DE GRUYTER (1968) 65. Bentz, C., Dediu, D., Verkerk, A., Jäger, G.: The evolution of language families is shaped by the environment beyond neutral drift. Nature Human Behaviour. 2, 8163821 (2018). https://doi.org/10.1038/s41562-018-0457-6 66. Jaeger, T.F., Graff, P., Croft, W., Pontillo, D.: Mixed effect models for genetic and areal dependencies in linguistic typology. Linguistic Typology. 15, (2011) 67. Shannon, C.E.: Prediction and Entropy of Printed English. Bell System Technical Journal. 30, 50364 (1951). https://doi.org/10.1002/j.1538-7305.1951.tb01366.x 68. Chaitin, G.J.: On the intelligibility of the universe and the notions of simplicity, complexity and irreducibility. arXiv:math/0210035. (2002) 69. Delétang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L.K., Aitchison, M., Orseau, L., Hutter, M., Veness, J.: Language Modeling Is Compression, https://arxiv.org/abs/2309.10668, (2023) 70. Rissanen, J.J.: Generalized Kraft Inequality and Arithmetic Coding. IBM Journal of Research and Development. 20, 1983203 (1976). https://doi.org/10.1147/rd.203.0198 71. Jurafsky, D., Martin, J.H.: Speech and Language Processing. (2021) 72. Cleary, J., Witten, I.: Data Compression Using Adaptive Coding and Partial String Matching. IEEE Transactions on Communications. 32, 3963402 (1984). https://doi.org/10.1109/TCOM.1984.1096090 73. Shkarin, D.: PPM: one step to practicality. In: Proceedings DCC 2002. Data Compression Conference. pp. 2023211. IEEE Comput. Soc, Snowbird, UT, USA (2002) 74. Pavlov, I.: 7-zip, https://7-zip.org/, (2023) 75. Knoll, B., Freitas, N. de: A Machine Learning Perspective on Predictive Coding with PAQ8. In: 2012 Data Compression Conference. pp. 3773386. IEEE, Snowbird, UT, USA (2012) 76. Veness, J., Lattimore, T., Budden, D., Bhoopchand, A., Mattern, C., Grabska-Barwinska, A., Sezener, E., Wang, J., Toth, P., Schmitt, S., Hutter, M.: Gated Linear Networks. (2019). https://doi.org/10.48550/ARXIV.1910.01526 77. Mahoney, M.: PAQ8, http://mattmahoney.net/dc/paq8l.zip, (2007) 78. Mahoney, M.: Adaptive weighing of context models for lossless data compression, http://hdl.handle.net/11141/154, (2005) 79. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation. 9, 173531780 (1997) 80. Bellard, F.: NNCP v3.1: Lossless Data Compression with Transformer. Presented at the (2021) 81. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, \Lukasz, Polosukhin, I.: Attention is All You Need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 600036010. Curran Associates Inc., Red Hook, NY, USA (2017) 82. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv:1901.02860 [cs, stat]. (2019) 83. Bellard, F.: Lossless Data Compression with Neural Networks, https://bellard.org/nncp/nncp.pdf, (2019) 84. Bellard, F.: NNCP v2: Lossless Data Compression with Transformer, https://bellard.org/nncp/nncp_v2.1.pdf, (2021) 85. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature. 323, 5333536 (1986). https://doi.org/10.1038/323533a0 86. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. (2014). https://doi.org/10.48550/ARXIV.1412.6980 87. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language Models are Unsupervised Multitask Learners, https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf, (2019) 88. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: HuggingFace9s Transformers: State-of-the-art Natural Language Processing, https://arxiv.org/abs/1910.03771, (2019) 89. Unicode Consortium: Unicode Text Segmentation, http://www.unicode.org/reports/tr29/#Word_Boundaries 90. Koplenig, A.: Stata tip 129: Efficiently processing textual data with Stata9s new Unicode features. Stata Journal. 18, 2873289 (2018) 91. Mielke, S.J., Cotterell, R., Gorman, K., Roark, B., Eisner, J.: What Kind of Language Is Hard to Language-Model? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 497534989. Association for Computational Linguistics, Florence, Italy (2019) 92. Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 171531725. Association for Computational Linguistics, Berlin, Germany (2016) 93. Schürmann, T., Grassberger, P.: Entropy estimation of symbol sequences. Chaos: An Interdisciplinary Journal of Nonlinear Science. 6, 414 (1996). https://doi.org/10.1063/1.166191 94. MacKay, D.J.C.: Information theory, inference, and learning algorithms. Cambridge University Press, Cambridge, UK ; New York (2003) 95. Yaglom, A.M., Yaglom, I.M.: Probability and information. D. Reidel ; Sold and distributed in the U.S.A. by Kluwer Boston, Dordrecht, Holland ; Boston : Hingham, MA (1983) 96. Adami, C.: What is information? Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 374, 20150230 (2016). https://doi.org/10.1098/rsta.2015.0230 97. Tibshirani, R.: Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological). 58, 2673288 (1996) 98. Belloni, A., Chernozhukov, V., Wei, Y.: Post-Selection Inference for Generalized Linear Models With Many Controls. Journal of Business & Economic Statistics. 34, 6063619 (2016). https://doi.org/10.1080/07350015.2016.1166116 99. Moran, P.A.P.: Notes on Continuous Stochastic Phenomena. Biometrika. 37, 17 (1950). https://doi.org/10.2307/2332142 100. Kelejian, H.H., Prucha, I.R.: On the asymptotic distribution of the Moran I test statistic with applications. Journal of Econometrics. 104, 2193257 (2001). https://doi.org/10.1016/S0304-4076(01)00064-1 101. Griffith, D.A.: A Spatial Filtering Specification for the Autologistic Model. Environ Plan A. 36, 179131811 (2004). https://doi.org/10.1068/a36247 102. Tiefelsdorf, M., Griffith, D.A.: Semiparametric Filtering of Spatial Autocorrelation: The Eigenvector Approach. Environment and Planning A. 39, 119331221 (2007). https://doi.org/10.1068/a37378 103. Oberdabernig, D.A., Humer, S., Crespo Cuaresma, J.: Democracy, Geography and Model Uncertainty. Scottish J Political Eco. 65, 1543185 (2018). https://doi.org/10.1111/sjpe.12140 104. Baayen, R.H., Davidson, D.J., Bates, D.M.: Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language. 59, 3903412 (2008). https://doi.org/10.1016/j.jml.2007.12.005 105. Nettle, D.: Social scale and structural complexity in human languages. Philosophical Transactions of the Royal Society B: Biological Sciences. 367, 182931836 (2012). https://doi.org/10.1098/rstb.2011.0216 106. Burnham, K.P., Anderson, D.R. eds: Model Selection and Multimodel Inference. Springer New York, New York, NY (2004) 107. Buckland, S.T., Burnham, K.P., Augustin, N.H.: Model Selection: An Integral Part of Inference. Biometrics. 53, 603 (1997). https://doi.org/10.2307/2533961 108. Steel, M.F.J.: Model Averaging and Its Use in Economics. Journal of Economic Literature. 58, 6443719 (2020). https://doi.org/10.1257/jel.20191385 109. Akaike, H.: A new look at the statistical model identification. IEEE Transactions on Automatic Control. 19, 7163723 (1974). https://doi.org/10.1109/TAC.1974.1100705 110. Gibson, E., Futrell, R., Piandadosi, S.T., Dautriche, I., Mahowald, K., Bergen, L., Levy, R.: How Efficiency Shapes Human Language. TRENDS in Cognitive Science. 23, 3893407 (2019). https://doi.org/10.1016/j.tics.2019.02.003 111. Claessens, S., Atkinson, Q.: The non-independence of nations and why it matters. PsyArXiv (2022) 112. Schwarz, G.: Estimating the Dimension of a Model. Ann. Statist. 6, (1978). https://doi.org/10.1214/aos/1176344136 113. Vonesh, E., Chinchilli, V.M.: Linear and Nonlinear Models for the Analysis of Repeated Measurements. CRC Press (1996) 114. Gurka, M.J.: Selecting the Best Linear Mixed Model Under REML. The American Statistician. 60, 19326 (2006). https://doi.org/10.1198/000313006X90396 115. Yang, Y., Piantadosi, S.T.: One model for the learning of language. Proc Natl Acad Sci USA. 119, e2021865119 (2022). https://doi.org/10.1073/pnas.2021865119 116. Wolff, J.G.: Language acquisition, data compression and generalization. Language & Communication. 2, 57389 (1982). https://doi.org/10.1016/0271-5309(82)90035-0 117. Shannon, C.E.: A Mathematical Theory of Communication. Bell System Technical Journal. 27, 3793423 (1948). https://doi.org/10.1002/j.1538-7305.1948.tb01338.x 118. Aceves, P., Evans, J.A.: Human languages with greater information density have higher communication speed but lower conversation breadth. Nat Hum Behav. (2024). https://doi.org/10.1038/s41562-024-01815-w 119. Koplenig, A.: Corpus size strongly matters when analysing word frequency distributions, https://osf.io/p5nhd, (2024) 120. Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht (2001) 121. Tweedie, F.J., Baayen, R.H.: How Variable May a Constant be? Measures of Lexical Richness in Perspective. Computers and the Humanities. 32, 3233352 (1998) 122. Koplenig, A., Wolfer, S., Müller-Spitzer, C.: Studying Lexical Dynamics and Language Change via Generalized Entropies: The Problem of Sample Size. Entropy. 21, (2019). https://doi.org/10.3390/e21050464 123. Aceves, P., Evans, J.: Conditional Word Count and Huffman Code Size are Two Sides of the Same Coin: Response to Koplenig, https://osf.io/4b7mc, (2024) 124. Lupyan, G., Contreras Kallens, P., Dale, R.: Information density as a predictor of communication dynamics. Trends in Cognitive Sciences. 28, 4893491 (2024). https://doi.org/10.1016/j.tics.2024.03.012 125. Schmitt, A.O., Herzel, H.: Estimating the Entropy of DNA Sequences. Journal of Theoretical Biology. 188, 3693377 (1997). https://doi.org/10.1006/jtbi.1997.0493 126. Bar-Hillel, Y.: An Examination of Information Theory. Philosophy of Science. 22, 863105 (1955) 127. Montemurro, M.A., Zanette, D.H.: Towards the quantification of the semantic information encoded in written language. Advances in Complex Systems. 13, 1353153 (2010). https://doi.org/10.1142/S0219525910002530 128. Weaver, W.: Translation. In: Locke, W.N. and Boothe, A.D. (eds.) Machine Translation of Languages. pp. 15323. MIT Press, Cambridge, MA (1949) 129. Wray, A., Grace, G.W.: The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form. Lingua. 117, 5433578 (2007). https://doi.org/10.1016/j.lingua.2005.05.005 130. Lupyan, G., Dale, R.: Why Are There Different Languages? The Role of Adaptation in Linguistic Diversity. TRENDS in Cognitive Science. 20, 6493660 (2016). https://doi.org/10.1016/j.tics.2016.07.005 131. Lupyan, G., Dale, R.: Language Structure Is Partly Determined by Social Structure. PLoS ONE. 5, e8559 (2010). https://doi.org/10.1371/journal.pone.0008559 132. Raviv, L., Meyer, A., Lev-Ari, S.: Larger communities create more systematic languages. Proceedings of the Royal Society B: Biological Sciences. 286, 20191262 (2019). https://doi.org/10.1098/rspb.2019.1262 133. Raviv, L., Peckre, L.R., Boeckx, C.: What is simple is actually quite complex: A critical note on terminology in the domain of language and communication. Journal of Comparative Psychology. (2022). https://doi.org/10.1037/com0000328 134. Kontoyiannis, I., Algoet, P.H., Suhov, Yu.M., Wyner, A.J.: Nonparametric entropy estimation for stationary processes and random fields, with applications to English text. IEEE Transactions on Information Theory. 44, 131931327 (1998). https://doi.org/10.1109/18.669425 135. Montemurro, M.A., Zanette, D.H.: Universal Entropy of Word Ordering Across Linguistic Families. PLoS ONE. 6, e19875 (2011). https://doi.org/10.1371/journal.pone.0019875 136. Wyner, A.D., Ziv, J.: Some Asymptotic Properties of the Entropy of a Stationary Ergodic Data Source with Applications to Data Compression. IEEE Trans. Inf. Theor. 35, 125031258 (1989). https://doi.org/10.1109/18.45281 137. Ornstein, D.S., Weiss, B.: Entropy and Data Compression Schemes. IEEE Trans. Inf. Theor. 39, 78383 (1993). https://doi.org/10.1109/18.179344
Clean Full Text	(not set)
Language	(not set)
Doi	10.31219/osf.io/8xgqz
Arxiv	(not set)
Mag	(not set)
Acl	(not set)
Pmid	(not set)
Pmcid	(not set)
Pub Date	2024-08-27 01:00:00
Pub Year	2024
Journal Name	(not set)
Journal Volume	(not set)
Journal Page	(not set)
Publication Types	(not set)
Tldr	(not set)
Tldr Version	(not set)
Generated Tldr	(not set)
Search Term Used	Jehovah's AND yearPublished>=2024
Reference Count	(not set)
Citation Count	(not set)
Influential Citation Count	(not set)
Last Update	2024-11-14 00:00:00
Status	0
Aws Job	(not set)
Last Checked	(not set)
Modified	2025-01-13 22:06:16
Created	2025-01-13 22:06:16