Calculating mutual information for spike trains and other data with distances but no coordinates
Many important data types, such as the spike trains recorded from neurons in typical electrophysiological experiments, have a natural notion of distance or similarity between data points, even though there is no obvious coordinate system. Here, a simple Kozachenko–Leonenko estimator is derived for calculating the mutual information between datasets of this type.
2. Introduction
This article describes a simple formula for calculating the mutual information between random variables where one or both of the variables take values in a metric space. This is relevant to neuroscience because electrophysiological data, whether spike trains from single neurons or collections of spike trains from a population of neurons, can be naturally considered to take values in a metric space [1,2]. It is, in turn, useful to be able to calculate information theory quantities for these data as part of an investigation into effective coding theories of neurodynamics or as a tool for quantifying the relationship between the activity of different neurons or different neuronal populations. However, the relevance is not limited to spike trains, it extends to other electrophysiological data types, such as calcium or electroencephalogram traces. Indeed, non-coordinate metrics or similarity measures are also used, for example, in genetics and biochemistry [3,4], in image analysis [5] and in information retrieval [6].
The aim here is to address two difficulties associated with estimating mutual information on metric spaces. The first difficulty is that information theory is most typically applied to problems where the data are either discrete or take values in an integrable manifold. The second difficulty is that many approaches to estimating information theory quantities can demand an unrealistically large amount of data. The second of these difficulties has been addressed in the past by the Kozachenko–Leonenko estimator [7–10], but, in line with the first difficulty, this estimator is derived specifically for an integrable manifold or using a local effective dimension. The aim here is to provide a simple approach to a Kozachenko–Leonenko estimator which applies to metric spaces. With this extension, mutual information can be calculated for the broad class of important data where there is a similarity measure, but no coordinates.
The particular application which motivates this article is the problem of calculating information theory quantities for spike trains. Three different types of neuroscience experiment could be considered typical. In the first, the activity of a neuron or group of neurons is recorded from the brain of an anesthetized or restrained animal while that animal is being presented with a series of stimuli. The challenge is to estimate the mutual information between the stimulus and the neuronal activity during the presentation. In the second, neuronal activity is recorded while an animal is moving freely in an arena and the mutual information is to be estimated between the position of the animal at a given time and the neuronal activity in a temporal window centred on that time. In the third example, the mutual information is to be estimated between temporal slices of the spike trains produced by different neurons so that this can be used to measure the relationship between those neurons, for example, at different times during development.
Estimating mutual information is not straightforward in any of these examples because there is no obvious coordinate system for describing neuronal activity. One approach to solving this difficulty is to discretize the spike trains, turning individual fragments of spike train into sequences of ones and zeros with each bit accounting for the presence or absence of a spike in a corresponding time slot [11]. However, the amount of possible words is huge and so this approach is bedevilled by the large amount of data it requires. One response to this problem is to exploit what is sometimes called the birthday problem and to look at coincidences [12,13]. However, the proximity structure of the space of spike trains is poorly approximated by examining the coincidence whereby two spike trains are discretized to the same sequence of ones and zeros. Here, it is proposed that, instead, one of the many metrics or similarity measures on the space of spike trains be used to define proximity [14–20].
3. Method
Two simple formulae for mutual information are presented in this article, one for the mutual information between a discrete space and a metric space and one for the mutual information between two metric spaces. These two formulae are intended to cover, respectively, the first and the second and third neuroscientific examples described above. There are two main steps to deriving these formulae. Firstly, probabilities are estimated using a simplified version of the Kozachenko–Leonenko approach [7–10]; this estimate of probability involves terms that depend on the volumes. In the second step, these volumes are estimated using the probability distribution as a measure.
3.1 A formula for the entropy
Consider estimating entropy for a random variable X which takes values in a space
Now, consider the probability Pk(xi) that the region B(xi,V) contains precisely kpoints; it is
The formula for the entropy, equation (3.10), is similar in spirit to the one given in [7,9]. However, it is not identical, and is, in fact, simpler, because here the probability is estimated using the expected number of points in a ball rather than the size of the ball that contains a given number of points; the latter requires the trinomial, as opposed to the binomial, expansion. Also, of course, the approach here is chosen because it makes it possible to avoid quantities that are only defined on integrable manifolds.
3.2 Estimating the volume
The problem with this is that there may not be an obvious measure. Certainly in the case of spike trains there are no good coordinates and so there is no way to calculate the volume of a region based on the usual sort of coordinate-based measure. However, a probability distribution always defines a measure: the volume of a region can be defined as being equal to the probability mass it contains,
This gives a trivial estimate of the entropy: differential entropy is zero if the probability distribution is used as the measure and the approximation used here is exact in this case: if V =h/N for some integer h≤N
Here, two cases will be considered: in the first case one random variable is discrete and the other takes its values in a metric space; in the second case both random variables take values in metric spaces. In addition, an estimate is derived for the Kullback–Leibler (KL) divergence between two distributions over the same metric space.
3.3 The mutual information where one random variable is discrete
In many electrophysiological experiments, stimuli from a discrete corpus are presented repeatedly while spike trains are recorded. In this situation, the stimuli are represented by a discrete random variable and the response by a random variable taking values in a metric space. This situation is considered here; the situation is more general than the neuroscientific application, but, for convenience, the terminology is based on that application.
Let
The metric on
With this measure H(R)=0. The same measure can be used for the conditioned probabilities; that is, for calculating H(R|S=s) using the conditioned probability pR|S=s(r). Hence,
This formula is the same as the one proposed in [21]. However, it is derived there in a convoluted way which leaned heavily on intuition, whereas here the derivation is straightforward and can be easily extended to the case where both
3.4 The mutual information where both random variables take values in metric spaces
If
It is instructive to consider what happens when the two variables are independent. By definition, BR(ri,h1/N) contains h1 points out of N; as R and S are independent this means the average number of points in BS(si,h2/N) which are also in BR(ri,h1/N) is h1h2/N so
The formula includes two integer parameters, h1 and h2, which need to be chosen; large values for these parameters reduce the accuracy of the approximation in equation (3.7) where the probability is taken as constant throughout the region, whereas taking a smaller value reduces the accuracy of the approximation in equations (3.6) and (3.12) where the mean or volume is estimated by counting.
3.5 The Kullback–Leibler divergence on a metric space
The approach described in this article also gives an estimate for the KL divergence. Consider two random variables R and S on a metric space
4. Conclusion
The formula derived here is very simple and is derived without any reference to coordinates on the sample space. It is intended that this demonstrates that the Kozachenko–Leonenko approach applies to metric spaces. The Kozachenko–Leonenko formula presented in [9] relies on a manifold structure in its derivation, but in its final form it is also applicable to metric spaces. It seems unlikely that the performance of the estimates here will be different from the performance of that formula.
In the estimates here, the distance or similarity measure is only required to order the points. Although electrophysiological data are used as a paradigmatic example, the formula can be applied to any pair of random variables taking values in metric spaces or indeed any space with a similarity or distance function suitable for defining regions surrounding each data point. As such it could be used in a straightforward way to calculate, for example, the mutual information between local field potentials and spike trains, or between the position of an animal in a maze and a neuronal population response, or, indeed, between two collections of images, or between images and text.
The estimate provided here relies on an approximation in which probabilities are replaced by counting how many data points fall within a region in space. This makes sound intuitive sense, but it has not been proved here that the estimate is a good one. Indeed, no principle has been described that would allow the volume of the regions to be chosen to sensibly balance the two competing requirements: small regions to reduce the error in assuming the probability is constant throughout the region and large regions so that counting-based estimates are robust. There are two ways in which these difficulties will need to be addressed: theoretically, in demonstrating that the estimate converges under sensible conditions, and practically, in demonstrating that the estimate is accurate and not overly sensitive to the choice of volumes for the sorts of data that are likely to be of interest. Hopefully, the simplicity of the approach described here will aid further development.