Chromatic transitions in the emergence of syntax networks
The emergence of syntax during childhood is a remarkable example of how complex correlations unfold in nonlinear ways through development. In particular, rapid transitions seem to occur as children reach the age of two, which seems to separate a two-word, tree-like network of syntactic relations among words from the scale-free graphs associated with the adult, complex grammar. Here, we explore the evolution of syntax networks through language acquisition using the chromatic number, which captures the transition and provides a natural link to standard theories on syntactic structures. The data analysis is compared to a null model of network growth dynamics which is shown to display non-trivial and sensible differences. At a more general level, we observe that the chromatic classes define independent regions of the graph, and thus, can be interpreted as the footprints of incompatibility relations, somewhat as opposed to modularity considerations.
1. Introduction
The origins of human language have been a matter of intense debate. Language is a milestone in our evolution as a dominant species and is likely to pervade the emergence of cooperation and symbolic reasoning [1–4]. Maybe the most defining and defeating trait is its virtually infinite generative potential: words and sentences can be constructed in recursive ways to generate nested structures of arbitrary length [3,5]. Such structures are the product of a set of rules defining syntax, which are extracted by human brains through language acquisition during childhood after a small sample of the whole combinatorial universe of sentences has been learned. And yet, in spite of its complexity, syntax is accurately acquired by children, who master their mother tongue in a few years of learning. Indeed, around the age of two, linguistic structures produced by children display a qualitative shift on their complexity, indicating a deep change on the rules underlying them [6–8]. This sudden increase of grammar complexity is known as the syntactic spurt, and defines the edge between the two words stage, where only isolated words or combinations of two words occur, to a stage where the grammar rules governing this syntax are close to the one we can find in adult speech—although the cognitive maturation of kids makes the semantic content or the pronunciation different from the adult one. How can we explain or interpret such nonlinear pattern?
Statistical physicists have approached the problem of language evolution showing, for example, that non-trivial patterns are shared between language inventories—collections of words—and some genetic and ecological neutral models [9] (see [10] and references therein). However, most of these models do not make any assumption about the role played by actual interactions among words, or, more generally, linguistic units, which largely define the nature of linguistic structures. In this context, a promising approach to its structure and evolution involves considering language in terms of networks of interconnected units instead of unstructured collections of elements—e.g. words or syllables [11–15]. In this context, syntactic networks, in which nodes are words and links the projection of actual syntactic relations, have been shown to be an interesting abstraction to grasp general patterns of language production [7,8,16]. Specially valuable has been the quantitative data obtained from syntax networks obtained along the process of syntax acquisition, for they provided solid and quantitative evidence of sudden qualitative shifts in the cognitive machinery involved in the process, present also in other linguistic domains [7,8,16–19].
At the fundamental level, syntax can be understood as a set of symbols associated under a universe of potential combinations somewhat similar to chemistry. Atoms and words would then be linked through compatibility relations defining what can be combined and what is forbidden. The power of this picture is supported by the use of linguistic methods in the systematic characterization of chemical structures [20]. Chemical structure diagrams can thus be seen as some sort of language, with chemical species and bonds as key ingredients. In a more abstract fashion, we can say that general rules of combining elements within a given set of interacting pieces with well-defined functional meaning is at work in both language and chemistry. Following the chemical analogy, where abstract classes of ‘nodes’ can be defined, we will take advantage of graph colourability theory as a general framework to detect transitions based on qualitative changes of compatibilities. Specifically, we suggest that a new combinatorial approach grounded on graph colouring may enable a better understanding of the evolution of networks having internal relations of compatibility—e.g. some kind of syntactic rules. In this context, we propose the chromatic number—and complementary measures—of the graph [21–23] as an indicator of network complexity. The chromatic number is defined as the minimal number of colours needed to paint all nodes of the graph in a way that no adjacent nodes have the same colour [22]. In other words, classes of nodes would be defined precisely by the fact that there are no connections among them, a measure conceptually opposite to graph modularity. The q-colouring problem, i.e. to know whether a graph can be coloured with q different colours, is one of the most important NP-complete problems. From the statistical physics point of view, an analogous problem is defined within the context of the Pottsmodel [24]. Transitions in the evolution of the chromatic number, which is the main objective of this work, have been widely studied in abstract models of random graphs [23,25–27].
The chromatic number may convey structural information among the classes of relations the graph is showing within the system it aims to abstract. This is exactly the point by which graph colouring is relevant for syntactic phenomena. We will work with a graph of aggregated syntactic relations using the paradigm of dependence grammar [28], which provides a natural framework to extract graphs from syntactic relations [7,8]. The aim of using the chromatic number comes from the intuition that syntactic relations do not glue elements for freebut display consistent rules of compatibility/incompatibility among lexical elements. This may seem obvious at the level of the sentence analysis, but to extrapolate how these combinatorial rules among different classes of elements work at the global system level is a hard task, and even harder, if we want to do it quantitatively. For example, to grasp the relevance of chromatic number, one must perform parallel measurements on the network using indicators taking into account potential deviations—which will be the footprints of the non-trivial compatibility relations. The interplay between the evolution of the chromatic number and its deviations from a null model made of free associations will be the target of our paper. As we shall see, non-trivial transitions between different, increasing chromatic numbers, along with interesting deviations from a null model of syntax-free sentence generation are identified. This is, to the best of our knowledge, the first time that such transitions have been reported in a real system.
2. Graphs and colouring: basics
We will work over undirected graphs. An undirected graph
We can map the chromatic problem into the antiferromagnetic q-dimensional Potts model at T = 0 [24]. This model is a generalization of the classical Ising model for lattices: at every node of this lattice we place a particle having a spin which energetically constrains the state of its neighbours. Traditionally, spins can have only two states, namely | ↑ 〉 and | ↓ 〉. In the Potts model, compatibility relations take into account an arbitrary number q > 2 of different states. Let us consider a partition of nodes V containing q different classes, namely, Gq(V) = {g1, …, gq} of V, i.e.
Despite the high complexity of this problem—computing the chromatic number in an arbitrary graph is a NP-hard problem—several bounds can be defined. A lower bound can be defined from the so-called clique number. A clique is a subgraph in which every node is connected to all other nodes in the subgraph. The clique number
3. The evolution of χ along syntax acquisition
Here, we study the evolution of the chromatic number through language development as captured by syntax graphs. We compare the chromatic number with the lower and upper bounds provided by the clique number and the maximal K-core, respectively. We assess the relevance of computed chromatic numbers with the corresponding minimal energy. The combination of these two measurements enable us to interpret the nature of the chromatic number. Specifically, we can check whether changes in this number reflect a global pattern or instead some anomalous behaviour of a small, localized subgraph. Finally, we provide further validation of our analysis by comparing chromatic numbers in empirical and synthetic networks obtained through a random sentence generator.
3.1. Building the networks of early syntax
Through the process, networks built upon the aggregation of syntactic structures from child’s productions grow and change in a smooth fashion until a rapid transition occurs [7,8,30,31] (see also [13]). We reconstruct syntactic networks by projecting the raw constituent structure, i.e. phrase structure of children’s utterances, into linear relations among lexical items, in what is known as dependency grammar analysis [28,32]. Then, we aggregate all these productions in a single graph where nodes are lexical items and links represent syntactic relations between them [7,30,31]. We emphasize that these networks have been built by hand, in the sense that no automatic procedure has been at work. The reason stems from the fact that early child language is far from normative, but, still, structures can be identified. Therefore, each link is discussed after checking its suitability according to specific linguistic criteria developed for this analysis (see [7,30,31] and references therein). These networks provide a unique window into the patterns of change occurring in the language acquisition process.
The two cases studied here are obtained from the CHILDES database [33,34] which includes conversations between children and parents. Specifically, we choose Peter and Carl’s corpora, whose structure has been accurately extracted and curated. For both Peter and Carl’s corpora, we choose 11 different recorded conversations distributed in approximately uniform time intervals ranging from the age of approximately 20 months to the age of approximately 28 months. The chosen interval corresponds to the period in which the syntactic spurt takes place. From every recorded conversation, we extract the syntactic network of child’s utterances obtaining a sequence of 11 syntactic graphs corresponding to the sequence of Peter's conversations
3.2. Chromatic transition from bipartite to multicoloured networks
From our graph collection (see §3.1), we obtain two sequences of chromatic numbers sP(χ) and sC(χ) corresponding to the evolution of the chromatic number in Peter and Carl datasets, respectively:
The grammar at this stage mainly generates pairs of complementary words, like:
Still, the behaviour of the chromatic number of a graph
Table 1.
Relative energy values of q-colourings in the Peter (top) and Carl (bottom) datasets. Relative energies reveal the fraction of frustrated links in the optimal colouring using q different colours.
3.3. Real syntax versus null model
Here, we compare the evolution of the chromatic number in real and simulated networks. A data-driven, syntax-free model that generates random child’s utterances having the same statistics of word production as Peter and Carl datasets is used as a null model [7]. Underlying the null model outlined below, there is the aim to reproduce a syntax-free speech flow, with the same statistical indicators as the real data. That means that we prioritized the simulation of a speaker whose statistics over words usage and sentence length mimic the ones given by the data. Different realizations of the model may lead, for example, to slightly different number of used words—for it is a stochastic phenomena with fluctuations at the sizes we are working in. This is due to the fact that our aim has been, not to randomize the network itself—which would have been the standard approach—but the process that creates the network. Since the network is a surrogate of an underlying phenomenon, it is more realistic to create a random version of such underlying phenomenon and, then, build the network, than randomize the network itself. This model definition enables us to assess if the high combinatorics displayed by post-transition networks emerge directly from an increasingly rich vocabulary. We build our model by extracting the following statistical parameters from the 11 recorded conversations in Peter and Carl corpora:
(1) |
The number of sentences |SP(i)|, SC(i) in the Peter and Carl datasets. |
||||
(2) |
The probability distribution of structure lengths or the probability P(s) that any syntactic structure has s words. We obtain two different distributions, one for each dataset. |
||||
(3) |
We assume that the probability of the ith most frequent word is a scaling law
|
We run the above model in the two datasets by generating |SP,C(i)| random sentences, each experiment is repeated 20 times. From the collection of randomly generated syntactic structures we construct a comparable sequence of syntax networks following the same method as in the real datasets (see §3.1). Figure 3 shows that our model generates random syntax networks with size and connectivity comparable to the ones measured in real networks. These statistical indicators display a huge increase during the studied period, this increase being sharper around the age of two, i.e. during the syntactic spurt [7]. As discussed in §2, both the mean connectivity and network size play an important role when determining the values of ω, χ and
Now, we compute the sequence of averaged chromatic numbers,
A very interesting feature is found at the first stages of the simulated Peter sequence: the random networks are no longer bipartite—see §3.2. In particular, the third random corpus has an average chromatic number of 4, which is significantly higher than the observed chromatic number. In this case, the two-stage grammar imposes severe constraints on what is actually plausible in any pre-transition syntactic structure. This trend is also observed at later stages of language acquisition. In general, simulated networks have higher chromatic numbers than empirical networks, although both two types of networks have similar connectivities—by definition. In some cases, the average chromatic number of the graphs belonging to the random ensemble is twice the real one (figure 2). To better understand the nature of these deviations, we have compared the behaviour of chromatic numbers against mean connectivity and the size of the largest connected component. Figure 4 shows a well-defined, non-trivial deviation between real networks and random networks. In particular, when comparing the relation between the chromatic number and the average degree of Peter’s corpus (figure 4a) and Carl’s corpus (figure 4b) with the simulated ones, we observe a clear trend of the real networks towards smaller chromatic numbers. The comparison of the size of the giant connected component, GCC, clearly depending on the size in the case of scale-free networks [29], shows the same trend both in Peter’s (figure 4c) and Carl’s (figure 4d) corpora. In general, the expected chromatic number is larger than the one observed in the real networks. These plots suggest that the chromatic number is capturing essential combinatorial properties of the underlying system, which cannot be reproduced with a simple, syntax-free random generation model. We support this argument by providing a linear regression fit to the data. In the case of the relation of the GCC and the chromatic number, the mean squared error (MSE) is substantially higher in the case of the real instances, Peter and Carl (figure 4).
4. Discussion
Syntax is a characteristic, complex and defining feature of language organization. It pervades its capacity for unbounded generative power of the linguistic system [5], allows sentences to be organized in highly structured ways and is acquired in almost full power by children after being exposed to a limited repertoire of examples. Syntax is also one aspect of the whole: semantic and phonological aspects need to be taken into account, and they are all embedded in (and run by) a cognitive, brain-embodied framework [35]. Because of the dominant role played by how words actually interact with each other, computational and theoretical approaches dealing with word inventories or other statistical trends ignoring interactions are likely to be limited. As an example of the high degree of intricacy involved in linguistic acquisition, we mention two recent works: First, recent studies on acquisition in French toddlers provided strong evidences for non-trivial interactions between phonological, semantic and syntactic modules, showing the presence of inhibition/activation patterns in the acquisition dynamics involving cross-dependencies among those modules [36]. The second example comes from the framework of multiplex networks—i.e. networks involving different layers of interaction (see [37,38] and references therein). Specifically, it has been shown that, taking into account different layers of interactions, a critical phase can be identified from a simple, restricted grammar towards a flexible, more complex grammar [18,19], consistent with the results provided in our paper.
Following previous work that takes advantage of complex networks approaches to language organization [13] we have made a step further in studying the structure of syntax graphs using graph colouring. The motivation of this approximation is twofold. On the one hand, graph colourability allows to properly detect correlations that are not captured by topological approaches. On the other hand, it seems a natural way to substantiate previous claims connecting syntax with compatibility relations common with other types of systems, such as chemical structures. In this context, standard network measurements like average degree, clustering or degree distribution are much more limited. Since graph colouring naturally defines compatibility through the presence or absence of a common label to every pair of nodes, it seems the right framework to study the process of network growth in child language. The behaviour of the chromatic number accurately marks the syntactic spurt in language acquisition, i.e. it is a footprint of the generative power of the underlying grammar.
There are limitations associated with the network definition. Syntactic relations are structure-dependent, not sequence dependent. Because the network is an aggregation of text sequences, it cannot fully grasp the hierarchical nature associated with syntactic constructs. Still, the chromatic number is a global measurement that can detect grammar constraints by analysing the pattern of network interaction at different scales. That is, the network representation is an indicator of global linguistic proficiency and includes some combinatorial signal which can be properly detected with the chromatic number. Besides the suitability of the measure, it is in force to highlight that more longitudinal studies are needed. In this case, we studied two single individuals whose data is of excellent quality. Moreover, we assembled the syntactic networks by discussing the linguistic validity of each syntactic relation in detail. We therefore have chosen this high level of accuracy in our analysis, in spite of performing a massive one with less delicate assembling methodology. Further studies should perform much more longitudinal explorations, involving eventually other languages or bilingual/multilingual children, with the same degree of detail in the analysis, when possible.
There are other, broader implications of our work. The chromatic number can be viewed as a reciprocal measure of standard community detection. Here, the chromatic number defines a partition of the network in classes of unlinked nodes. This definition is particularly relevant in networks where some kind of compatibility relation is at work in the wiring process. In this case, the standard community structure can be misleading, because elements of the same class cannot be connected. The case for syntactic graphs is paradigmatic but the partition induced by the chromatic number could shed light into the behaviour of many other systems. Additionally, we have proposed to assess the statistical significance of these partitions with the sequence of minimal violations—see equation (2.10). Future work should explore how the chromatic number—and related measures—can be exploited to detect forbidden links in the network. Deviations of the chromatic number, as the ones observed in this paper, suggest the presence of combinatorial constraints that must be taken into account, for example, when defining proper null-models.