Theory of estimation-of-distribution algorithms

Carsten Witt is an associate professor at the Technical University of Denmark. He received his M. Sc. in 2000 and Ph. D. degree in 2004, both in Computer Science from the Technical University of Dortmund, Germany. His main expertise is in the algorithmic analysis of metaheuristics, including evolutionary algorithms, ant colony optimization and estimation-of-distribution algorithms. Carsten has over 80 peer-reviewed publications and has given tutorials about bio-inspired computation in combinatorial optimization at previous GECCO and PPSN conferences. He is member of the editorial boards of Evolutionary Computation Journal and Theoretical Computer Science.


Some Benchmark Functions
Theoretical results often consider simple problems, which we have to understand first.
OneMax Compact GA (cGA) (Harik et al., 1999) Simulates behavior of a GA with population size K in a compact way.

Genetic Drift
If the fitness function is constant/flat (does not give a signal), frequencies move randomly to a border (DEMO).

Genetic Drift
If the fitness function is constant/flat (does not give a signal), frequencies move randomly to a border (DEMO). Early Results (< 2000) Models of cGA, UMDA and others, allowing estimations of the dynamical behavior (e. g., Thierens et al., 1998;. Two effects: 1. Overall progress of probabilistic model (roughly: p i ) and time to convergence to good distribution ( p i ≈ n), 2. time for single frequencies to drift to wrong border by genetic drift → should be bigger than convergence time.
Avoiding genetic drift requires precise enough model → lower bound on runtime. Models of EDAs estimated progress of frequencies: which was made rigorous recently.

Lower bound on preciseness of model
For different EDAs on OneMax, same threshold identified multiple times (Thierens et al., 1998;Lobo et al., 2000;Shapiro, 2005): need at least K = Ω( √ n) different frequency values to prevent genetic drift.

Upper bound on time to convergence
Time for the EDA to converge to optimal solution ≈ K √ n.
⇒ Best possible time complexity Θ(n)? First runtime analysis of EDAs Droste (2006) considered cGA on OneMax (and other functions). Studied runtime = no. iterations until optimum found.

Demo and Landscape
Expected runtime of cGA depending on n and K Proof Idea in the "Large K " Regime Show limits on genetic drift: with high prob. all frequencies stay above 1/3 in a phase of Θ(K √ n) steps.
To analyze speed: consider Φ t = n i=1 p t,i and analyze its drift. Important: how does a single frequency evolve?
Consider the two offspring x and y and look into bit i.  Biased step occurs with probability at least Ω(1/ √ n). If offspring differ in the bit (prob. 2p i (1 − p i )) then raised by 1/K . Otherwise, frequency is expected to stay put. Altogether: Note similarity to p t+1,i ≈ p t,i + I Biased step occurs with probability at least Ω(1/ √ n). If offspring differ in the bit (prob. 2p i (1 − p i )) then raised by 1/K . Otherwise, frequency is expected to stay put. Altogether:  What Happens for Small K Now look into K < √ n log n: Lower bound Ω(n) until 2015. Improved to Ω(n log n) (Sudholt and W., 2016). Heavy genetic drift occurs.

DEMO.
Whether optimum can be found at all, depends very much on the borders on the frequencies.
If no borders, with high probability frequencies locked to 0 ⇒ infinite runtime. If borders {1/n, 1 − 1/n} used, optimum can still be found in polynomial time! No proof for cGA, but for UMDA (Lehre and Nguyen, 2017;W., 2017).
⇒ Borders may make the algorithm efficient despite genetic drift.

of 63
Idea for the Lower Bound

Coupon collector
You have to collect n different coupons. In each round, you are given one coupon chosen uniformly at random with replacement. In expectation, it takes Ω(n log n) rounds to collect all of them.
CC-BY-SA-4.0 by Jarek Tuszyński, 2015 Here the coupons are the frequencies at the lower border. Each of them has probability 1/n of being raised.
If many frequencies move to the lower border before optimum is found, we cannot be faster than n log n.

of 63
Medium and Large K : Overview Phase transition between smooth behavior and strong genetic drift at K ∼ √ n log n If no borders, algorithm fails at K = o( √ n log n) With borders, efficient behavior as long as K = ω(log n). Upper bound O(Kn) only conjectured here.

of 63
Small K (Neumann et al., 2010), similar to cGA: landslides of frequencies occur.

DEMO.
Frequency that have attained their maximum 1 − 1/n are nevertheless likely to drop down to minimum. Very unstable behavior, exponential optimization time.
Runtime of cGA on OneMax: Complete Picture Theory of EDAs › OneMax 32 of 63

Analysis of UMDA
How does UMDA perform on OneMax? Surprisingly, in terms of runtimes (no. of iterations · λ), not very differently from cGA.
Obstacles in analysis: frequencies can change drastically, even from minimum to maximum in one generation, two parameters: µ and λ.
The first runtime analysis of UMDA on OneMax stems from 2015! Theory of EDAs › OneMax

Results for UMDA: Upper bounds
First runtime result (Dang and Lehre, 2015): expected runtime of UMDA on OneMax is O(nλ log λ) for λ > 13eµ and λ = Ω(log n). Bound is O(n log n log log n) for best possible parameter setting.
Bound independently improved to O(n log n) by Lehre and Nguyen (2017) and W. (2017): Theory of EDAs › OneMax

Results for UMDA: Lower Bounds
We also obtain similar bounds as with cGA: Theorem (Krejca and W., 2017) Proof idea again: at a very high level, we estimate 1. progress (drift) of sum of frequencies per iteration 2. time for genetic drift to move frequencies to wrong border both for upper and lower bounds. Especially 2. is challenging.
Overview of Runtime Bound for UMDA on OneMax Exponential time below log n strongly conjectured. Remaining bounds proved. If no borders on frequencies, runtime infinite below √ n log n Theorem (Lengler et al., 2018) Expected runtime of cGA on OneMax is Ω(K 1/3 n + n log n) for K = O( √ n/log 2 n).
⇒ Setting K = Θ(log n) gives us runtime Θ(n log n), so does K = Θ( √ n log n), but values in between make runtime worse. EDAs on LeadingOnes i j=1 x j has not been considered much in the theory of EDAs (except for UMDA).
Possible reason: behavior is more obvious than on OneMax and not very different from classical EAs.

of 63
EDAs on LeadingOnes i j=1 x j has not been considered much in the theory of EDAs (except for UMDA).
Possible reason: behavior is more obvious than on OneMax and not very different from classical EAs.

of 63
EDAs on LeadingOnes i j=1 x j has not been considered much in the theory of EDAs (except for UMDA).
Approach in the analysis again: 1. Determine total time for frequency vector to converge to optimality. 2. Determine lower bound on preciseness of model to prevent genetic drift before convergence.
Large λ used to prevent genetic drift also for the last optimized bit (possibly too large).

Recent results for LeadingOnes
Use the borders.
Theorem (Dang and Lehre, 2015) If λ ≥ c ln n then expected runtime of UMDA with borders on LeadingOnes is O(nλ log λ + n 2 ).
Holds also for PBIL (generalized UMDA) if learning rate not too small (Lehre and Nguyen, 2018).

When EAs and EDAs Differ on LeadingOnes
Overall analysis was similar to OneMax. No new insights through the study of LeadingOnes?
Consider following example from Chen et al. (2009a).

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

Proof idea for (1+1) EA
Typically, (1+1) EA starts out by gaining more and more leading ones.

1100101010011111111111111111
When > n/4 leading ones, the first bit(s) no longer contributes to fitness. These bits slowly become random again. Optimum missed, NIAH Theory of EDAs › LeadingOnes

of 63
Proof idea for UMDA Also UMDA starts out by gaining more and more leading ones. This is reflected in the frequencies.
1 n 1 − 1 n 10101110110111011010 Even after the first bits no longer contribute to fitness, their frequencies are expected to remain the same. All-ones string can be sampled.
Need large enough λ to prevent genetic drift.

of 63
Proof idea for UMDA Also UMDA starts out by gaining more and more leading ones. This is reflected in the frequencies.
Need large enough λ to prevent genetic drift.

Proof idea for UMDA
Also UMDA starts out by gaining more and more leading ones. This is reflected in the frequencies.
1 n 1 − 1 n 1111111111111110**** Even after the first bits no longer contribute to fitness, their frequencies are expected to remain the same. All-ones string can be sampled.
Need large enough λ to prevent genetic drift. Proof idea for UMDA Also UMDA starts out by gaining more and more leading ones. This is reflected in the frequencies.
1 n 1 − 1 n 11111111111111111111 Even after the first bits no longer contribute to fitness, their frequencies are expected to remain the same. All-ones string can be sampled.
Need large enough λ to prevent genetic drift.

LeadingOnes: Summary
Typical behavior of EDAs: On OneMax they optimize all bits roughly at the same time.
On LeadingOnes they optimize bits from left to the right.
However, there are other EDAs that do not behave like this (see later).
being somewhere between LeadingOnes and OneMax. A bit outweighs all less significant bits together, but every bit contributes to fitness.
Often, a runtime analysis of LeadingOnes also gives a runtime bound for BinVal (e. g., for UMDA/PBIL, Lehre and Nguyen, 2018).

Optimization under Uncertainty
Assume that an evaluation of the objective function is subject to random noise.
Example: OneMax with additive Gaussian noise Typical measures to handle noise in evolutionary computation: large populations resampling … EDAs have a built-in noise handling mechanism.

EDAs Beat Population-based EAs in Noisy Settings
Definition (Friedrich et al., 2017): an algorithm scales gracefully if its expected runtime depends polynomially on the noise strength. For example: runtime is proportionally to σ 2 . Proof Ideas (µ+1) EA uses mutation, which drifts away from the optimum when there are many correct bits and noise disturbs fitness signal.

Theorem
cGA does not use mutation and is balanced: in expectation, frequency vector and thus fitness does not decrease over time.

Further Noisy Settings
Also simple ACO algorithms can be considered EDAs and are superior to EAs in some noisy settings (also from combinatorial optimization): Sudholt and Thyssen (2012) for ACO for noisy shortest paths with a ground truth Doerr et al. (2012) where the noise is intrinsic Feldmann and Kötzing (2013) for ACO with fitness-proportional updates: leads to convergence to expected best solution Many of the results can probably be transferred to more classical EDAs such as cGA and UMDA. See talk at this GECCO.

Balancedness, Stability and Genetic Drift
Have seen: without fitness signal, stochastic model of EDAs is expected to be the same. Term: balanced (Friedrich et al., 2016a).
However, this is only the expected value. Genetic drift plays major role in classical EDAs. Term: EDA is stable if a frequency in absence of fitness signal stays close to its initial value.
cGA, UMDA, … are not stable. Frequencies quickly converge to either maximum or minimum (each with probability 1/2) due to genetic drift.