Open Science

A global optimization paradigm based on change of measures

Published:02 September 2015

A global optimization framework, COMBEO (Change Of Measure Based Evolutionary Optimization), is proposed. An important aspect in the development is a set of derivative-free additive directional terms, obtainable through a change of measures en route to the imposition of any stipulated conditions aimed at driving the realized design variables (particles) to the global optimum. The generalized setting offered by the new approach also enables several basic ideas, used with other global search methods such as the particle swarm or the differential evolution, to be rationally incorporated in the proposed set-up via a change of measures. The global search may be further aided by imparting to the directional update terms additional layers of random perturbations such as ‘scrambling’ and ‘selection’. Depending on the precise choice of the optimality conditions and the extent of random perturbation, the search can be readily rendered either greedy or more exploratory. As numerically demonstrated, the new proposal appears to provide for a more rational, more accurate and, in some cases, a faster alternative to many available evolutionary optimization schemes.

1. Introduction

Potential applications of global optimization are ubiquitous across a large spectrum of problems in science and engineering, from very-large-scale integration circuit design, to optimization of transportation routes to determination of protein structures. An essential aspect of the solution to any such problem is the extremization of an objective functional, possibly subject to a prescribed set of constraints. The solution is deemed to have been achieved if the global extremum of the objective functional is reached. For many practical applications, the objective functional could be non-convex, non-separable and even non-smooth. This precludes gradient-based local optimization approaches to treat these problems, an aspect that has driven extensive research in devising global optimization strategies, many of which employ stochastic (i.e. random evolutionary) search of heuristic or meta-heuristic origin [1]. Indeed, finding the global optimum within a search space of the parameters or design variables is much more challenging than arriving at a local extremum. In the absence of generic directional information (equivalent to the Gateaux or directional derivative in the search for a local extremum of a sufficiently smooth cost functional), a direct search seems to be the only option left. Some of the notable schemes in this genre include variants of the genetic algorithm (GA) [2], differential evolution (DE) [3], particle swarm optimization (PSO) [4], ant colony optimization [5] and covariance matrix adaptation evolution strategy (CMA-ES) [6]. In a search for the global extremum, stochastic schemes typically score over their gradient-based deterministic counterparts [7,8], even for cases involving sufficiently smooth cost functionals with well-defined directional derivatives. Nevertheless, thanks to the proper directional information guiding the update, gradient-based methods possess the benefit of faster convergence to the nearest extremum, as long as the objective functional is sufficiently smooth. To the authors' knowledge, the existing evolutionary global optimization schemes do not offer the benefit of well-founded directional information. Most evolutionary schemes depend on a random (diffusion-type) scatter applied to the available candidate solutions or particles and some criteria for selecting the new particles. Unfortunately, a ‘greedy’ approach for the global optimum (e.g. selecting particles of higher ‘fitness’ with higher probability), though tempting from the perspective of faster convergence, faces the hazard of getting trapped in local extrema. As a fix to a premature and possibly wrong convergence, many evolutionarily global search schemes adopt safeguards. Despite the wide acceptability of a few evolutionary methods of the heuristic/meta-heuristic origin [9], the underlying justification is often based on sociological or biological metaphors [1–6,10] that are hardly founded on a sound probabilistic basis even though a random search forms a key ingredient of the algorithm. In this context, some interesting comments on the facile usage of natural metaphors in the development of many global optimization algorithms, without a serious engagement with scientific ratiocination, may be found in [10]. Be that as it may, the popular adoption of these schemes is not only owing to the algorithmic simplicity, but mainly because of their effectiveness in treating many non-deterministic polynomial-time hard optimization problems, which may be contrasted with a relatively poorer performance of some of the well-grounded stochastic schemes, e.g. simulated annealing [11], stochastic tunnelling [12], etc. Whether a slow convergence to the global optimum, not infrequently encountered with some of the popular evolutionary schemes, could be fixed by reorienting them with an alternative stochastic framework, however, remains a moot point, which is addressed in this work to an extent. Another related question is regarding the precise number of parameters in the algorithm that the end-user must tune for an appropriate exploration–exploitation trade-off [13,14], a phrase used to indicate the relative balance of random scatter of the particles vis-a-vis their selection based on the ‘fitness’. Admittedly, the notion of random evolution, built upon Monte Carlo (MC) sampling of the particles, efficiently explores the search space, though at the cost of possibly slower convergence [15] in contrast to a gradient-based method. For better exploration, many such schemes, e.g. the GA, adopt steps like ‘crossover’, ‘mutation’, ‘selection’, etc. While ‘crossover’ and ‘mutation’ impart variations in the particles, by the ‘selection’ step each particle is assigned a ‘weight’ or fitness value (a measure of importance) determined as the ratio of the realized value of the objective functional to the available maximum of the same. In the selection step, the fitness values, used to update the particles via selection of the best-fit individuals for subsequent crossover and mutation, may be considered functionally analogous to the derivatives used in gradient-based approaches. Analogy may also be drawn between the fitness values and the likelihood ratios commonly employed in nonlinear stochastic filtering, e.g. the family of sequential importance sampling filters [16]. Even though a weight-based approach reduces a measure of misfit between the available best and the rest within a finite ensemble of particles, they bring with them the curse of degeneracy, wherein all particles but one tend to receive zero weights as the iterations progress [17]. This problem, also referred to as ‘particle collapse’ in the stochastic filtering parlance, can be arrested, in principle, by exponentially increasing the ensemble size (number of particles), a requirement that can hardly be met in practice [18]. Evolutionary schemes that replace the weight-based multiplicative approach by adopting additive updates for the particles, often succeed in eliminating particle degeneracy. Evolutionary schemes like DE, PSO, etc., that are known to perform better than GA, use such additive particle update strategies without adversely affecting the ‘craziness’ or randomness in the evolution. On reaching the optimum, the additive correction shrinks to zero. Unfortunately, none of these methods obtain the additive term, which may be looked upon as a stochastic equivalent to the derivative-based directional information, in a rigorous or optimal manner—a fact that is perhaps responsible for a painstakingly large number of functional evaluations in some cases.

In framing a rigorous stochastic basis leading to directional updates of the additive type, several possible schemes are suggested in this work. First, it is shown that the problem of optimization may be generically posed as a martingale problem [19,20], which must however be randomly perturbed to facilitate the global search. For instance, one may perform a greedy search for an extremum of an objective functional by solving a martingale problem, which is in the form of an integro-differential equation herein referred to as the extremal equation. In line with the Freidlin–Wentzell theory [21], the martingale problem may further be perturbed randomly so as to enable the greedy scheme to converge to the global extremum as the strength of the perturbation vanishes asymptotically. The solution through this approach, which is viewed as a stochastic process parametered via the iterations, depends crucially on an error or ‘innovation’ function and is deemed to have been realized when the innovation function, interpreted as a stochastic process, is driven to a zero-mean noise or martingale [19]. The martingale structure of the innovation essentially implies that the mean of the computed cost functional in future iterations remains constant and invariant to zero-mean random perturbations of the argument vector (the design variables). The argument vector should then correspond to an extremum. In the evolutionary random set-up, both the cost functional and its argument vector are treated as stochastic diffusion processes even though the original extremization problem is deterministically posed. Realizing a zero-mean martingale structure for the innovation requires a change of measures that in turn leads to additive gain-type updates on the particles. The gain coefficient matrix, which is a replacement for and generalization over the Frechet derivative of a smooth cost functional, provides an efficacious directional search without requiring the functional to be differentiable. Additionally, in order to facilitate the global search, a general random perturbation strategy (using steps such as ‘scrambling’, ‘relaxation’ and ‘selection’) is adopted to insure against a possible trapping of particles in local extrema.

An important feature of the present approach is the flexibility with which the innovation process may be designed in order to meet a set of conflicting demands en route the detection of the global extremum. The efficiency of the global search basically relies upon the ability of the algorithm to explore the search space while preserving some directionality that helps to quickly resolve the nearest extremum. The development of the proposed set-up, referred to as COMBEO (Change Of Measure Based Evolutionary Optimization), recognizes the near impossibility of a specific optimization scheme performing uniformly well across a large class of problems. In this context, one may refer to variants of the famous ‘no free lunch theorem’, where it is proved that the average performance of any two algorithms across all possible problems is identical [22]. Accommodation of this fact had earlier led to a proposal of an evolutionary scheme that simultaneously ran different optimization methods for a given problem with some communications built among the updates by the different methods [23]. A major related contribution of this work is to bring the basic ideas for global search used in a few well-known optimization schemes under a single unified framework propped up by a sound probabilistic basis. In a way explained later, the ideas (or their possible generalizations) behind some of the existing optimization methods may sometimes be readily incorporated in the present setting by just tweaking the innovation process.

The rest of the paper is organized as follows. In §2, the problem of finding an extremum of an objective functional is posed as a martingale problem represented through an integro-differential equation, whose solution only realizes a local optimum. The integro-differential equation is discretized and weakly solved within an MC setting so as to circumvent its inherent circularity. Section 3 discusses several random perturbation schemes to arrive at the global optimum efficiently without getting stuck in local traps. By combining these tools in different ways, a few pseudo-codes are presented in §4. In §5, the performance of COMBEO is compared with DE, PSO and CMA-ES in the context of several benchmark problems. In the same section, the new optimization approach is also applied for the quantitative recovery of boundary layer density variation in a high-speed flow given the light travel time data from an experiment. Finally, the conclusions of this study are drawn in §6.

2. Search for an extremum through a martingale problem

Here, it is demonstrated that functional extremization, subject to a generic set of equality constraints, can be posed as a martingale problem upon a proper characterization of the design variables within a probabilistic setting. However, before elaborating on this, a few fundamental features of the new evolutionary approach are listed below:

(i) at a given iterate, the solution is a random variable (taking values in the search space);
(ii) hence the solution process along the iteration axis may be considered a stochastic process and the iterations must be so designed that the optimal solution is asymptotically approached by the mean (i.e. the first moment) of the solution process; and
(iii) since upon convergence the mean should be iteration-invariant, random fluctuations about the converged mean must assume the structure of a zero-mean stochastic process (along the iteration axis), thereby allowing us to identify the fluctuations as a noise process, e.g. a Brownian motion, or, more generally, a zero-mean martingale.

In posing the global optimization problem within a stochastic framework, a complete probability space [19] is considered, within which the solution to the optimization problem must exist as an -measurable random variable. Here Ω, known as the population set (sample space), necessarily includes all possible candidate solutions from which a finite set of randomly chosen realizations or particles is evolved along the iteration axis τ. The introduction of τ as a positive monotonically increasing function on is required to qualify the evolution of the solution as a continuously parametered stochastic process. A necessary aspect of the evolution is a random initial scatter imparted to the particles so as to search the sample space for solutions that extremize the cost functional while satisfying the posed constraints, if any. Since the extremal values, global or otherwise, of the objective functional may not be known a priori, the particles are updated iteratively (i.e. along τ) by conditioning on the evolution history of a so-called extremal cost process based on the available particles. The extremal cost process is defined as a stochastic process such that, for any ξ∈[0,τ], its mean is given by the available best objective functional based on its evaluations across the particles at the iteration step denoted by ξ. Now denoting by the filtration (i.e. statistical information pertaining to the evolution history) containing the increasing family of sub σ-algebras generated by the extremal cost process, the aim is to determine the update by conditioning the guessed or predicted solution on . The guessed solution could be a Brownian motion or a random walk along τ. In case there are equality constraints, also contains history for the best realized constraint until τ. For a multivariate multimodal nonlinear objective functional in , a point x* needs to be found such that . Since the design variable x is evolved in τ as a stochastic process, it is parametrized as x_τ:=x(τ). It may be noted that there may not be any inherent physical dynamics in its evolution, and in such cases x_τ may be evolved as a zero-mean Brownian motion or random walk in Ω(even though, as will be shown later, this step is not absolutely necessary). Since only a finite number of iterations can be performed in practice, τ is discretized as τ₀<τ₁<⋯<τ_M so that x_τ is evolved over every τ increment. Thus, for the ith iteration with τ∈(τ_i−1,τ_i], x_τ may be thought of as being governed by the following stochastic differential equation (SDE):

2.1

Here ξ_τ is a zero-mean vector Brownian process with the covariance matrix

, where

is the noise intensity matrix with its (j,k)th element denoted as g^jk. The discrete τ-marching map for equation (2.1) has the form:

2.2

where (⋅)_i stands for (⋅)_{τ_i}. Let the extremal cost process, generating the filtration

, be denoted as

. Within a strictly deterministic setting, a smooth functional f(x) tends to become stationary, in the sense of a vanishing first variation, as xapproaches an extremal value x*. In the stochastic setting, a counterpart of this scenario could be that any future conditioning of the process x_τ on

, which is the filtration generated by

, s<ξ, until a past iteration (i.e. ξ<τ), identically yields the random variable x_ξ itself, i.e.

, where E represents the expectation operator. Via the Markov structure of x_τ one may equivalently postulate that a necessary condition for the extremization of f_τ:=f(x_τ) is to require that E(x_τ|

)=x_ξ. This characterization renders the process x_τ a martingale with respect to the extremal cost filtration

, i.e. once extremized, any future conditional mean of x_τon the cost filtration remains iteration invariant. See [24] for an introductory treatise on the theory of martingales. An equivalent way of stating this condition is based on an error or ‘innovation’ process defined as

−f_τ. An extremization would require updating x_τ to statistically match the extremal process

so that the innovation

is driven to a zero mean martingale, e.g. a zero-mean Brownian motion or, more generally, a stochastic integral in Ito's sense (or, in a τ-discrete setting, a zero-mean random walk). This enables one to pose the determination of an extremum as a martingale problem, originally conceptualized by Stroock & Varadhan [20] to provide an improved setting for solutions to SDEs beyond Ito's theory. If the update scheme is effective, one anticipates

asymptotically, i.e. as

. In other words,

and, as the norm of the noise intensity associated with x_τapproaches zero,

. However, since strictly zero noise intensity is infeasible within the stochastic set-up, the solution in a deterministic setting may be thought of as a degenerate version (with a Dirac measure) of that obtained through the stochastic scheme.

Note that a ready extension of this approach is possible with multi-objective functions, which would merely require building up the innovation as a vector-valued process. Similarly, the approach may also be adapted for treating equality constraints, wherein zero-valued constraint functions, perturbed by zero-mean noise processes, may be used to construct the innovation processes. The easy adaptability of the current set-up for multiple objective functions or constraints may be viewed as an advantage over most existing evolutionary optimization schemes, where a single cost functional needs to be constructed based on all the constraints, a feature that may possibly engender instability for a class of large dimensional problems.

Driving x_τ to an -martingale, or forcing the innovation process to a zero-mean martingale, may be readily achieved through a change of measures. In evolutionary algorithms, e.g. the GA, the accompanying change of measures requires the assignment of weights or fitness values to the current set of particles and a subsequent selection of those with higher fitness, e.g. by rejection sampling. For a better exploration of the sample space, the GA uses steps like ‘crossover’ and ‘mutation’. These steps are biologically inspired safeguards against an intrinsic limitation of the weight-based update, which tends to diminish all the weights but one to zero, thereby leaving the realized particles nearly identical after a few iterations. This problem is referred to as that of weight collapse or particle impoverishment and precipitates premature convergence to a wrong solution. As a way out of this degeneracy in the realized population set (i.e. the ensemble), the present scheme employs a purely additive update, derived following a Girsanov change of measures P→Q (see [19] for a detailed exposition on the Girsanov transformation), which ensures that the innovation process, originally described under measure P, becomes a zero-mean martingale under the new measure Q. A basic tool in effecting this change of measure is the Radon–Nikodym derivative Λ_τ:=dP/dQ (assuming absolute continuity of Q w.r.t. P and vice versa), a scalar valued random variable also called the likelihood ratio (or the fitness or the weight) that multiplicatively updates the design process x_τ as x_τΛ_τ. In order to bypass the degeneracy problem noted above, one may obtain an additive update by expanding x_τΛ_τusing Ito's formula [19]. Within an MC set-up, an immediate consequence of the additive update is that particles with lower weights are never eliminated, but rather corrected to have more fitness by being driven closer to an extremum. Drawing analogy with Taylor's expansion of a smooth function(al) that obtains its first-order term based on Newton's directional derivative, the present version of the additive correction may be interpreted as a non-Newton directional term that drives the innovation process to a zero-mean martingale and a precise form of this term is derived later in this section. In general, the innovation process may be a vector. One such case occurs when, by way of addressing possible numerical instability, a single cost functional is split into several so that each of the correspondent innovations is driven to a zero-mean martingale. This in turn ensures that the innovation corresponding to the original cost functional is also driven to a zero-mean martingale. For a simpler exposition of the basic ideas, the method below is presented using a single cost functional only. The formulation can be trivially extended for vector innovation processes of any finite dimension.

Within a τ-discrete framework with τ∈(τ_i−1,τ_i], x_τ is evolved following equation (2.1). An innovation constraint that may be satisfied for an accelerated (greedy) search in a neighbourhood of an extremum is given by

2.3

where

is the extremal cost process, f the objective functional and

a P-Brownian increment representing the diffusive fluctuations. Deriving the subsequent integro-differential equation for the search is best accomplished by recasting equation (2.3) as an SDE. Towards this, an

-measureable process

may be constructed to arrive at the following incremental form:

2.4

Here Δτ_i:=τ_i−τ_i−1 is taken as a ‘small’ increment. Since the τ-axis is fictitious, Δτis replaced by Δτ_i=τ_i−τ_i−1, so that equation (2.4) is modified to

2.5

which is essentially correspondent to the SDE:

2.6

Note that the replacement of Δτ by Δτ_i merely modifies the intensity of the noise process Δη_τ in equation (2.3). The form of SDE (2.6) has the desirable feature that the fictitious diffusion coefficient Δτ_i is an order ‘smaller’ relative to the drift coefficient f(x_τ). However, since η_τ is not standard Brownian, it is more convenient to rewrite equation (2.6) as

2.7

where W_τ is standard P-Brownian and ρ_τ a more general form of (scalar-valued) noise intensity that may be made a function of τ. For multi-objective optimization problems (where

is a vector stochastic process), ρ_τ would be the intensity matrix and the expressions below are so written that they are valid for both scalar and vector cases. Equation (2.7) may be rewritten as

2.8

where

. While it is possible, and even desirable, to replace the Brownian noise term above by one whose quadratic variation is zero (e.g. a Poisson martingale), such a modification is not central to the basic idea and will be considered in future extensions of the work. SDE (2.7) is assumed to satisfy the standard conditions [19] so as to ensure the existence of a weak solution. The

-measurable locally optimal (extremal) solution may now be identified with the conditional mean

. Considering a new measure Q under which x_τ from equation (2.1) satisfies equation (2.7), the conditional mean may be expressed via the generalized Bayes' formula as

2.9

where the expectation E_Q(⋅) is taken with respect to the new measure Q and Λ_τis the scalar fitness given by

2.10

As shown in appendix A (a theorem and its corollary) using Ito's expansions of x_τΛ_τ and (Λ_τ)⁻¹, the incrementally additive updates to arrive at an extremum must follow from the differential equation:

2.11

and

2.12

Note that the directionality of a search process provided by the second (integral) term on the right-hand side (r.h.s.) of the equation above may also be gauged from the fact the integrand in this term can be interpreted as a Malliavin-type derivative using Clarke–Ocone theorem [25]. But the appearance of the unknown term π_τ(f) on the r.h.s. prevents equations (2.11) or (2.12) to be qualified as an SDE in π_τ(x). Indeed, a necessarily nonlinear dependence of f on x_τ and the consequent non-Gaussianity of π_τ(f) would prevent writing the latter in terms of π_τ(x). This results in the so-called closure problem in solving for π_τ(x).