Royal Academy of Sciences New Zealand Open Science
Open Science

The power of associative learning and the ontogeny of optimal behaviour

Published:

Behaving efficiently (optimally or near-optimally) is central to animals' adaptation to their environment. Much evolutionary biology assumes, implicitly or explicitly, that optimal behavioural strategies are genetically inherited, yet the behaviour of many animals depends crucially on learning. The question of how learning contributes to optimal behaviour is largely open. Here we propose an associative learning model that can learn optimal behaviour in a wide variety of ecologically relevant circumstances. The model learns through chaining, a term introduced by Skinner to indicate learning of behaviour sequences by linking together shorter sequences or single behaviours. Our model formalizes the concept of conditioned reinforcement (the learning process that underlies chaining) and is closely related to optimization algorithms from machine learning. Our analysis dispels the common belief that associative learning is too limited to produce ‘intelligent’ behaviour such as tool use, social learning, self-control or expectations of the future. Furthermore, the model readily accounts for both instinctual and learned aspects of behaviour, clarifying how genetic evolution and individual learning complement each other, and bridging a long-standing divide between ethology and psychology. We conclude that associative learning, supported by genetic predispositions and including the oft-neglected phenomenon of conditioned reinforcement, may suffice to explain the ontogeny of optimal behaviour in most, if not all, non-human animals. Our results establish associative learning as a more powerful optimizing mechanism than acknowledged by current opinion.

1. Introduction

We often marvel at animals performing efficiently long sequences of behaviour, and theoretical and empirical studies confirm that animals behave optimally or near-optimally under many circumstances [13]. Typically, optimal behaviour has been assumed to result from natural selection of genetically determined behaviour strategies [4], yet in many species behaviour is crucially shaped by individual experiences and learning [57]. Existing work has considered how learning can optimize single responses [813] or specific sequences of two or three behaviours [14,15]. However, the question of how, and how much, learning contributes to optimal behaviour is still largely open. Here we analyse in general the conditions under which associative learning can optimize sequences of behaviour of arbitrary complexity.

Associative learning is acknowledged to contribute to adaptation by enabling animals to anticipate meaningful events (Pavlovian, or ‘classical’ conditioning) and to respond appropriately to specific stimuli (operant, or instrumental conditioning) [16,17]. Associative learning, however, is also considered mindless, outdated and too limited to learn complex behaviour such as tool use, foraging strategies or any behaviour that requires coordinating actions over a span of time (e.g. [1821]). Such behaviour, when it is not considered genetically determined, is attributed to other learning mechanisms, usually termed ‘cognitive’ (e.g. [2224]). Associative learning, however, has not been evaluated rigorously as a potential route to optimal behaviour [25,26]. Rather, claims about its limitations have rested on intuition rather than formal analysis and proof. In this paper, we develop an associative learning model that can be proved to closely approximate optimal behaviour in many ecologically relevant circumstances. The model has two key features: it augments standard associative learning theory with a mathematical model of conditioned reinforcement, and it integrates instinctual and learned aspects of behaviour in one theoretical framework. The latter aspect is discussed later; in this introduction, we focus on conditioned reinforcement.

Conditioned reinforcement (also referred to as secondary reinforcement) is a learning process whereby initially neutral stimuli that predict primary reinforcers can themselves become reinforcers [2730]. For example, a dog that repeatedly hears a click before receiving food will eventually consider the click rewarding in itself, after which it will learn to perform behaviour whose sole outcome is to hear the click [31]. Conditioned reinforcement was a prominent topic in behaviourist psychology [27,29,3234], but interest in it waned with behaviourism [35]. As a result, conditioned reinforcement was left out of the mathematical models of the 1970s and 1980s that still form the core of animal learning theory [3640]. There are two fields, however, that have carried on the legacy of conditioned reinforcement research. The first is animal training, in which methods that rely on conditioned reinforcement are the primary tool to train behaviour sequences (see below and [31]). The second is the field of reinforcement learning, a branch of artificial intelligence that blends ideas from optimization theory and experimental psychology [41,42], and which has also become influential in computational neuroscience (e.g. [43,44]). The key element of reinforcement learning algorithms, referred to as learning based on temporal differences, is closely related to conditioned reinforcement [4549]. A remarkable result of reinforcement learning research is that conditioned reinforcement implements a form of dynamic programming. The latter is an optimization technique used extensively by biologists to find optimal behavioural strategies, and therefore, to assess whether animals behave optimally [1,2]. It is not, however, a realistic model of how animals can learn to behave optimally, as it requires perfect knowledge of the environment and extensive computation. Conditioned reinforcement, on the other hand, is computationally simple as well as taxonomically widespread, suggesting that optimal behaviour may be learned rather than inherited [47].

The conceptual connections that we just summarized have been noted previously (e.g. [41,47]), but have not translated into a coherent research program. Conditioned reinforcement has not been systematically integrated with animal learning theory, nor with knowledge about instinctual behaviour from ethology, nor with the study of optimal behaviour in behavioural ecology. Our goal is to sketch a first such synthesis. We call our learning model ‘chaining’ after Skinner [30,50,51], who described how conditioned reinforcement can link together single behaviours to form sequences (chains) that ultimately lead to primary reinforcement.

2. Chaining: dynamic programming in vivo

To highlight connections to associative learning theory, behavioural ecology and reinforcement learning, we present our model in successive steps. We first consider a standard model of associative learning without conditioned reinforcement. This model can optimize single behaviours but not behaviour sequences. We then add conditioned reinforcement, obtaining our chaining model. Lastly, using ideas from reinforcement learning, we show that chaining can optimize sequences of behaviour in a similar way to dynamic programming.

Our general framework is as follows. We consider an animal that can find itself in a finite (albeit arbitrarily large) number of environmental states, among which transitions are possible. For example, states may represent spatial locations, and state transitions movement from one location to another. We assume that the animal can perceive without ambiguity which environmental states it is in (see §(c) and appendix A.3 for discussion, and for a model that does not require this assumption). By choosing its behaviour, the animal can influence transitions from one state to the next. Transitions can be deterministic (in each state, each behaviour always leads to the same next state) or stochastic (in each state, a behaviour may lead to different states, with fixed probabilities). Each state S has a primary reinforcement value, uS, which is genetically determined and serves to guide learning towards behaviour that promotes survival and reproduction. For example, a state corresponding to the ingestion of food would typically have positive value, while a state representing harm to the body would have a negative value. States that describe neutral conditions, e.g. waiting, are assumed to have a small negative value, corresponding to the time and energy expended while in the state. The animal's goal is to choose its behaviour to maximize the total value collected. To begin with, we do not assume any innate knowledge of the environment beyond the ability to recognize a number of biologically relevant situations such as pain and the ingestion of food, which are assumed to have suitable uS values. Hence, the appropriate behaviour must be learned.

2.1. Learning a single behaviour

Consider first the optimization of a single behavioural choice. For example, we may consider a bird that finds a fruit and can choose out of a repertoire of mbehaviours (peck, fly, sit, preen, etc.). One behaviour (peck) leads to a food reward (tasting the fruit's sweet juice); all others have no meaningful consequences. We can imagine the animal as attempting to estimate the value of each behaviour, in order to then choose the one with highest value (this notion will be made precise below). Suppose the animal is in state S, chooses behaviour B and finds itself in state S′. Note that, in general, a state S may be followed by a number of states S′, either because the environment is not deterministic or because the animal does not always use the same behaviour B when it finds itself in state S. Hence S′ does not represent a fixed state, but rather whichever state follows S on a particular occasion. Let vSB be the value estimated by the animal for choosing behaviour B in state S, and uS the primary reinforcement value of S′. A simple way of learning useful estimates is to update vSB as follows after each experience:

2.1

where ΔvSB is the change in vSB, and αv is a positive learning rate. The meaning of equation (2.1) is easily understood in a deterministic environment. In this case, the state S′ is always the same, hence uS is fixed. Over repeated experiences, equation (2.1) causes vSB to approach the value uS. Thus, the value of choosing B in state S is equated with the primary reinforcement value that can be obtained by such a choice. If the environment is not deterministic, vSB approaches the average reward value of all states S′ that follow, each weighed by its probability of occurrence, provided αv is not too large. Equation (2.1) is identical to the classic Rescorla–Wagner learning rule [36], but we consider it in an instrumental rather than a Pavlovian setting [37].

To complete our model, we need to specify how behaviours are chosen. The basic requirement for a viable decision rule is that it should preferentially choose behaviours that have a higher estimated value (so that rewards can be collected), while at the same time leaving some room for exploring alternative behaviours (so that accurate value estimates can be learned). A simple way to address both concerns is the so-called ‘softmax’ rule, which specifies the probability of behaviour B in state S as:

2.2

where the sum runs over all possible behaviours. The parameter β regulates exploration: if β=0 all behaviours are equally likely irrespective of estimated value, whereas if β is very large only the behaviour with the highest estimated value occurs with any likelihood. Equation (2.2) is broadly compatible with known aspects of animal choice behaviour. For example, if two behaviours B1 and B2have different estimated values in state S, equation (2.2) does not choose exclusively the more profitable one. Rather, the relative probability of choice depends on the difference in estimated values:

2.3

This relative preference is compatible with the ‘matching law’ of experimental psychology, according to which the probability of choosing a behaviour is an increasing function of the amount of reinforcement obtained from the behaviour [52,53].

2.2. Learning behaviour sequences

while equation (2.1) can optimize a single behavioural choice, it cannot optimize sequences of behaviours. Figure 1 shows a simple environment in which the animal has to perform correctly a sequence of l actions in order to reach a reward. Equation (2.1) can learn the correct behaviour in state l−1 because it results in a reward. In the other states, however, correct behaviours are not rewarded, and equation (2.1) will learn to assign them a value of −c, i.e. the same value as the incorrect behaviours. The problem can be overcome if states can acquire conditioned reinforcement value. For example, if the animal repeatedly chooses the correct action in state l−1 and thereby experiences the rewarding state l, then state l−1 will acquire conditioned reinforcement value. Conditioned reinforcement functions in the same way as primary reinforcement. That is, if the animal now takes the correct action in state l−2 and thereby transitions in state l−1, then state l−1 will be deemed reinforcing. Thus taking the correct action will be reinforced; additionally, state l−2 will in turn acquire conditioned reinforcement value. In this way, value can backtrack all the way to the beginning of the chain and eventually reinforce correct actions even in state 0. We now formalize these intuitions.

Figure 1.

Figure 1. A simple environment in which a sequence of l actions is required in order to reach a reward. The animal can be in any of l+1 states, numbered 0 to l and represented as circles. Numbers inside the circles represent primary reward values (uS). The last state has positive value; other states have negative value (b,c>0). In each state, the animal can choose a behaviour from a repertoire of mbehaviours. In each state there is a ‘correct’ behaviour that brings the animal to the next state (shown by arrows). All other behaviours bring the animal back to state 0 (not shown to avoid clutter), at which point the animal can attempt again to reach the rewarding state l. When state l is reached, the animal goes back to state 0 and can try again to reach the reward.

Let wS be the conditioned reinforcement value of state S. It is natural to modify equation (2.1) as follows:

2.4

In other words, the value of behaviour B in state S is taken to be the sum of the primary and conditioned reinforcement values of state S′. In this way, reaching a state with conditioned value can be reinforcing, even if the state has no primary value. But how do states acquire conditioned value? We assume that the conditioned value of S is updated according to:

2.5

where ΔwS is the change in wS and αw a positive parameter akin to αv in equation (2.4). According to equation (2.5), the conditioned reinforcement value wS is updated to approach the value uS+wS, i.e. the total value of the following state. We continue to assume that decision making operates according to equation (2.2). Equations (2.4) and (2.5) constitute our chaining model. They appear in [54] in the context of machine learning, but have been used only sporadically in this field. They also appear in [15] without justification. In the next section, we provide two examples of how chaining can relate to animal learning, and in the following section we discuss how chaining can learn optimal behavioural strategies.

2.3. Two examples: self-control and expectations

To illustrate our model, and substantiate the claim that associative learning is currently underestimated, we consider two ‘cognitive’ phenomena, self-control and expectations of future events, that are commonly thought to lie beyond the scope of associative learning [55,56].

Many species, both in the laboratory and in nature, demonstrate a degree of self-control, ‘the ability to inhibit a pre-potent but ultimately counter-productive behaviour’ [56]. At first sight, associative learning would seem to always prefer an immediately rewarded response to a non-rewarded one, which would result in poor self-control [55]. However, ‘wait’ is also a behaviour that can be reinforced [31]. Figure 2, indeed, shows that self-control can be learned through chaining in a task similar to those used in the literature on self-control [56]. In a representative simulation, we see that waiting initially struggles to increase in frequency, but eventually it is learned as the most profitable option. Functionally, this is simply a consequence of the optimizing power of chaining: it can learn to forfeit an immediate reward for a future one given that it is optimal to do so. Mechanistically, waiting can be reinforced if it results in stimuli that are conditioned reinforcers. This is what happens in our simulation: the correct sequence (wait, then take) is initially performed by chance, which leads to the intermediate state acquiring conditioned reinforcement value. At this point, waiting can be reinforced and taking the small reward is progressively abandoned.

Figure 2.

Figure 2. Self-control through chaining. (a) A task in which the animal can either take a small reward immediately, or wait and take a larger reward later. Each circle is a state with its value inscribed. We set c=0.2, b1=1 and b2=5. In all states, a third action (not represented) causes the animal to leave the task and go back to the initial state (Small reward). (b) Sample simulation of the chaining model on this task. An ‘attempt’ (horizontal axis) is defined as a sequence of actions comprising two successive visits to the initial state. Model parameter values where αv=0.1, αw=0.1 and β=2.

Chaining, however, can fail to learn self-control if the animal cannot distinguish between states in which it pays to wait and states in which it pays to act, or if the benefit of waiting is not experienced. These considerations may help explain why many animals find it hard to postpone rewards (e.g. [57]). Such ‘impulsivity’ is not necessarily maladaptive: it often pays a cheetah to wait upon sighting a prey, but it seldom pays a toad. Differences in self-control can result from genetic evolution tuning learning mechanisms to each species' environment (see below and [58,59]). For example, cheetahs may have a genetic predisposition for waiting rather than attacking immediately, which would facilitate the discovery that waiting is profitable. One way to build such a predisposition into our model is to let the β value for waiting be higher than that for attacking, in states that correspond to a prey having been spotted. Such a difference in β would lead to waiting being chosen more often in these states. A high initial value of vSB in these states would also make waiting more likely.

The view that self control is learned based on a genetic predisposition for waiting is consistent with the observation that self-control correlates with absolute, but not relative brain size [56]. The latter is often considered a proxy for cognitive sophistication, while the former correlates with body size and lifespan. Hence a possible reading of the data is that longer lived animals have more self-control, which is expected as they have more time to invest in learning longer behaviour sequences [60,61]. Thus taxonomic variation in self-control may result from tuning chaining to the needs of different species, rather than from different species having a more or less developed ‘cognitive ability’ for self-control.

Similar arguments apply to expectations. Animals often show ‘expectations’ about forthcoming events. For example, they may react to the omission of an expected food reward by searching for the missing food, or with aggressive behaviour [62,63]. Search is appropriate when food is likely to have disappeared out of sight, whereas aggression is suitable when food is likely to have been stolen. At first sight, such behaviour seems hard to reconcile with associative learning, including chaining, because these mechanisms do not formulate explicit predictions about forthcoming events (e.g. [16,64]). Associative learning models, however, do compute the extent to which the reinforcement value of stimuli is ‘surprising’ or ‘unexpected’ [36,65]. Our model, for example, calculates differences between estimated and realized values, such as equation (2.4). For brevity, let us write the error term in equation (2.4) as

2.6

A negative dv means a smaller reward than expected, whereas a positive dvindicates a larger reward. The usual role of differences such as dv in animal learning theory is to drive learning (see equation (2.4)), but they have also been suggested to influence choice of behaviour [28,66,67]. Animals may have genetic predispositions that favour certain behaviours when dv signals a violated expectation. Formally, we can let dv influence the value of β in equation (2.2). For example, setting β=β0dv (where β0 is a baseline value) for aggressive behaviour will make aggression more likely when dv<0, i.e. when an expected reward is omitted. (Aggression would also be less likely when dv>0, e.g. when a reward is larger than expected.) This assumption is consistent with the observation that larger violations of expectations trigger more pronounced responses [62].

2.4. Learning optimal behaviour

We derived equations (2.5) and (2.4) by adding conditioned reinforcement to a standard model of associative learning. The same equations can be derived from considerations of optimality based on work from reinforcement learning [41]. This derivation shows how chaining is connected to dynamic programming and optimization. Consider a task that is guaranteed to last a finite time, and let be the expected reward that can be gained from all states that come after state S, when following a given behavioural strategy (in our case, equation (2.2) with a given set of vSB values). Formally

2.7

where the sum runs on successive states, S′ being the state that follows S, and Send the state that ends the task. ES(⋅) is the expectation with respect to all possible successions of states, from S until the end of task. We have an expectation rather than a fixed number because both the task and the behavioural strategy may not be deterministic, so that many possible sequences of states and actions are possible starting from state S. In equation (2.7), the first term in the sum is uS:

2.8

2.9

If PS,S is the probability to go from S to S′, the first expectation is simply

2.10

because uS depends only on S′ but not on later steps. In the second expectation, we can also make explicit this first step:

2.11

and note that, by the definition in equation (2.7), the remaining expectation is , where S′′ is the state that follows S′. We can thus rewrite equation (2.7) as:

2.12

This is a necessary consistency condition that values must satisfy in order to represent the reward expected after state S [41]. Equation (2.12) expresses the fact that the reward expected after state S is the reward expected from the next state, , plus the reward expected from all following states, which must equal by definition of . In this way, values take into account long-term outcomes in addition to immediate reward. Equations such as (2.12) are referred to as Bellman equations and are the foundation of dynamic programming [41,68]. As we recalled in the introduction, dynamic programming is a useful computational tool to find optimal behavioural strategies, but it does not explain how animals may learn such strategies. Crucially, however, we can see that chaining performs approximate dynamic programming during an animal's lifetime. Indeed, from equation (2.5) we can calculate the expected change in wS over one step of the dynamics:

2.13

In the parenthesis, we can now recognize an approximation to , obtained by replacing the true value of the next state () with its conditioned reinforcement value (wS), which has been learned through experience. Thus, over successive passes through S, equation (2.5) works to reduce the difference between the current value of wS and an estimate of its true expected value. A similar argument can be made for equation (2.4). In appendix A, we show that this process is expected to eventually approximate closely the true and values (the latter being the long-term reward expected after choosing B in S). Thus, chaining can behave optimally within the limits of exploration given by equation (2.2). In practice, a requirement for convergence is that αw and αv be small enough that accumulating changes in wS and vSB over successive experiences approximates the averaging operation in equation (2.13) and the analogous equation for vSB.

3. Efficiency of chaining

We established above that chaining is expected to learn optimal behaviour sequences, but we did not discuss how long that may take. In this section, we delimit the conditions under which chaining can learn within a realistic time frame. We start with the task in figure 1, followed by variants that apply to different learning scenarios.

3.1. The basic learning time

In the task in figure 1, an animal has to learn a sequence of l behaviours, each of which can be chosen from a repertoire of m behaviours. At each step, only one action is correct, resulting in progression to the next step; all other actions cause the animal to have to start from scratch. Although chaining can learn the task, the expected learning time is at least ml attempts (see appendix), implying that even moderately long sequences may not be learned realistically. A sequence of four behaviours drawn from a repertoire of just 10 behaviours is expected to take 104=10 000 attempts to learn. The fundamental reason for such long learning times is that chaining (or any other learning mechanism) is not magic: it still needs to find the reward, which initially can happen only by chance. Another factor complicates the problem further in natural environments. Imagine an inexperienced predator that can choose to hunt either a small prey that can be caught with a single behaviour, or a large prey that can be caught through a sequence of behaviours. The predator may be initially unbiased in its choice, yet hunting the small prey will be rewarded much more often than hunting the large one. Thus, the predator may end up hunting the small prey more and more often, making it less and less likely that it will practise with hunting the large prey. In general, the possibility of choosing between easy and hard problems weighs against learning the hard problems, because the learner can obtain rewards on easy problems that will cause it to spend less time exploring the hard ones.

In summary, it is important to recognize that, although chaining can learn arbitrarily long sequences of behaviours, in practice, it cannot do so within a reasonable time unless favourable circumstances intervene. Understanding what these circumstances may be is crucial in evaluating whether chaining can plausibly account for the ontogeny of optimal behaviour. One favourable circumstance is having some knowledge of the task, so that not everything has to be learned. We will discuss below how this can be accomplished through genetic predispositions that favour appropriate behaviour. Another circumstance leading to shorter learning times does not depend on prior knowledge, but rather on the environment facilitating task solution, as we discuss next.

3.2. Entry and exit patterns

The goal of this section is to show that learning times can be dramatically shorter than what we just calculated, provided a task is sometimes entered from states that are close to the reward. For example, a squirrel that is learning to handle hazelnuts may occasionally find one with a broken or weakened shell, which would reinforce the final prying behaviour without the need to master biting and cracking the shell. In §(a), we will consider other examples.

We call the rules that govern how often each state is entered the ‘entry pattern’ of a task. For the sake of illustration, we consider the model task in Figure 1 with the following family of entry patterns:

  • — With probability p, the animal enters the task at a random state (excluding the last, rewarded state).

  • — With probability 1−p, the animal enters at the first state.

We continue to assume that all mistakes bring the animal back to the first state. Setting p=0 means always entering at the first state. In this case, the learning time is exponential in the length of the sequence, as discussed above. As pincreases, entering at states that are closer to the reward becomes more likely. Figure 3a shows the dramatic effect that even a moderate facilitation, p=0.1, has on learning times. Thus a favourable entry pattern can bring within reach a task that otherwise could not be learned in a realistic time frame. If p=1 (entry at a random state), the learning time is only quadratic in the length of the sequence (table 1 and appendix).

Figure 3.

Figure 3. (a) Graph of some of the equations in table 1, a=1 and m=2. Note the logarithmic vertical axis. (b) Number of attempts required to learn sequences of different length trained by forward chaining, supporting the linear relationship expected under the hypothesis that chaining is the mechanism underlying such learning. Learning criteria was 80% correct in the macaque studies [69], 70% in pigeon 1 (only 5 of 7 pigeons learned the four-step sequence [70]) and 75% for pigeon 2 [71]. The behaviour sequence consisted of pressing response keys in a specific order.

Table 1.

Learning times of the chaining model as a function of ‘entry patterns’, i.e. ways in which an animal may enter a task to be learned. Symbols: l, sequence length; m, number of behaviours available at each step; a, number of trials required to learn a rewarded action. Note that forward chaining is not purely an entry pattern as it includes rewarding intermediate states, see text.

Animal trainers exploit favourable entry patterns to teach remarkably long sequences of behaviour. In backward chaining (where ‘chaining’ refers to the training technique rather than to a learning process), sequences are taught starting from the last step. That is, the animal is placed repeatedly at state l−1 of an l-step sequence, until the correct behaviour is learned. This also endows state l−1 with conditioned reinforcement value. At this point, the animal is placed in state l−2, where correct behaviour can be reinforced by the conditioned reinforcement value that state l−1 has acquired during the previous training. Once the second-last step is learned, the animal is placed in the third-last and so on. In this way, the animal is always learning just one behaviour, and the learning time becomes linear in the length of the sequence. In forward chaining, training starts from the first step, but the reward structure of the task is altered so that initially even performing the first behaviour is rewarded. Once the first behaviour is learned, the animal is required to perform two behaviours to obtain the reward, and so on until the whole sequence is learned. This also results in linear learning time, which agrees with observations (figure 3b).

Related to entry patterns are ‘exit patterns’, by which we refer to what happens when an incorrectbehaviour is chosen. So far, we have assumed that incorrect behaviours would result in starting back from scratch. For example, a predator that makes a mistake when approaching prey will most probably lose the opportunity to catch the prey. In other cases, however, mistakes have milder consequences. A squirrel attempting to open a nut, for example, can try again if the first bite does not succeed. In the most favourable case, an animal may have virtually unlimited opportunities to try to progress in the sequence, without losing what has been accomplished so far. Such a sequence could be learned rapidly (it takes ml(l+1)/2 time steps to learn the complete sequence, see appendix). We will study a concrete example of a favourable exit pattern when discussing primate stone tool use below.