The backend library for conducting hyper-parameter searches within the BPt is nevergrad, a library developed by facebook. They implement a whole bunch of methods, and have limited documentation explaining them. This page will try to break down the different avaliable options.

One Shot Optimization

Implemented one-hot optimization methods which are ‘hopefully better than random search by ensuring more uniformity’. The algorithms vary on the following parameters,

sampler

Type of random sampling. Either,

  • ‘Halton’ : A low quality sampling method when the dimension is high

  • ‘Hammersley’ : Hammersley sampling

  • ‘LHS’ : Latin Hypercube Sampling

(default = ‘Halton’)

scrambled

Adds scrambling to the search

  • True : scrambling is added

  • False : scrambling is not added

(default = False)

middle_point

Optional enforcement of the first suggested point as zero. Either,

  • True : Enforced middle suggested point

  • False : Not enforced

(default = False)

cauchy

Use Cauchy inverse distribution instead of Gaussian when fitting points to real space Either,

  • True : Use the cauchy ditribution

  • False : Use a gaussian distribution

(default = False)

rescaled

Rescale the sampling pattern to reach the boundaries. Either,

  • True : rescale

  • False : don’t rescale

(default = False)

autorescale

Perform auto-rescaling

  • True : Auto rescale

  • False : don’t auto rescale

(default = False)

recommendation_rule

Method for selecting best point. Either,

  • ‘average_of_best’ : take average over all better then median

  • ‘pessimistic’ : selecting pessimistic best

(default = ‘pessimistic’)

opposition_mode

symmetrizes exploration wrt the center: (e.g. https://ieeexplore.ieee.org/document/4424748) - “opposite” : full symmetry - “quasi” : Random * symmetric - None

(default = None)

‘HaltonSearch’

Defaults Only

‘HaltonSearchPlusMiddlePoint’

middle_point: True

‘ScrHaltonSearch’

scrambled: True

‘ScrHaltonSearchPlusMiddlePoint’

middle_point: True
scrambled: True

‘HammersleySearch’

sampler: 'Hammersley'

‘HammersleySearchPlusMiddlePoint’

sampler: 'Hammersley'
middle_point: True

‘ScrHammersleySearchPlusMiddlePoint’

scrambled: True
sampler: 'Hammersley'
middle_point: True

‘ScrHammersleySearch’

sampler: 'Hammersley'
scrambled: True

‘OScrHammersleySearch’

sampler: 'Hammersley'
scrambled: True
opposition_mode: 'opposite'

‘QOScrHammersleySearch’

sampler: 'Hammersley'
scrambled: True
opposition_mode: 'quasi'

‘CauchyScrHammersleySearch’

cauchy: True
sampler: 'Hammersley'
scrambled: True

‘LHSSearch’

sampler: 'LHS'

‘CauchyLHSSearch’

sampler: 'LHS'
cauchy: True

‘MetaRecentering’

cauchy: False
autorescale: True
sampler: 'Hammersley'

‘MetaTuneRecentering’

cauchy: False
autorescale: "autotune"
sampler: 'Hammersley'
scrambled: True

HAvgMetaRecentering

cauchy: False
autorescale: True,
sampler: "Hammersley"
scrambled: True
recommendation_rule: "average_of_hull_best"

AvgMetaRecenteringNoHull

cauchy: False
autorescale: True
sampler: "Hammersley"
scrambled: True,
recommendation_rule: "average_of_exp_best"

One Plus One

This is a family of evolutionary algorithms that use a technique called 1+1 or One Plus One. ‘simple but sometimes powerful class of optimization algorithm. We use asynchronous updates, so that the 1+1 can actually be parallel and even performs quite well in such a context - this is naturally close to 1+lambda.’

The algorithms vary on the following parameters,

noise_handling

How re-evaluations are performed.

  • ‘random’ : a random point is reevaluated regularly

  • ‘optimistic’ : the best optimistic point is reevaluated regularly

  • a coefficient can to tune the regularity of these reevaluations

(default = (None, .05))

mutation

The strategy for producing changes / mutations.

  • ‘gaussian’ : standard mutation by adding a Gaussian random variable (with progressive widening) to the best pessimistic point

  • ‘cauchy’ : same as Gaussian but with a Cauchy distribution.

  • ‘discrete’ : discrete distribution

  • ‘fastga’ : FastGA mutations from the current best

  • ‘doublefastga’ : double-FastGA mutations from the current best (Doerr et al, Fast Genetic Algorithms, 2017)

  • ‘portfolio’ : Random number of mutated bits (called niform mixing in Dang & Lehre ‘Self-adaptation of Mutation Rates in Non-elitist Population’, 2016)

(default = ‘gaussian’)

crossover

Optional additional of genetic cross over.

  • True : Add genetic crossover step every other step.

  • False : No crossover.

(default = False)

‘OnePlusOne’

Defaults Only

‘NoisyOnePlusOne’

noise_handling: 'random'

‘OptimisticNoisyOnePlusOne’

noise_handling: 'optimistic'

‘DiscreteOnePlusOne’

mutation: 'discrete'

‘DiscreteLenglerOnePlusOne’

mutation: 'lengler'

‘AdaptiveDiscreteOnePlusOne’

mutation: "adaptive"

‘AnisotropicAdaptiveDiscreteOnePlusOne’

mutation: "coordinatewise_adaptive"

‘DiscreteBSOOnePlusOne’

mutation: "discreteBSO"

‘DiscreteDoerrOnePlusOne’

mutation: "doerr"

‘CauchyOnePlusOne’

mutation: "cauchy"

‘OptimisticDiscreteOnePlusOne’

noise_handling: 'optimistic'
mutation: 'discrete'

‘NoisyDiscreteOnePlusOne’

noise_handling: ('random', 1.0)
mutation: 'discrete'

‘DoubleFastGADiscreteOnePlusOne’

mutation: 'doublefastga'

‘FastGADiscreteOnePlusOne’

mutation: 'fastga'

‘RecombiningPortfolioOptimisticNoisyDiscreteOnePlusOne’

crossover: True
mutation: 'portfolio'
noise_handling: 'optimistic'

CMA

This refers to the covariance matrix adaptation evolutionary optimzation strategy Background: https://en.wikipedia.org/wiki/CMA-ES

The following parameters are changed

diagonal

To use the diagonal version of CMA (advised in large dimensions)

  • True : Use diagonal

  • False : Don’t use diagonal

fcmaes

To use fast implementation, doesn’t support diagonal=True. produces equivalent results, preferable for high dimensions or if objective function evaluation is fast.

‘CMA’

diagonal: False
fcmaes: False

‘DiagonalCMA’

diagonal: True
fcmaes: False

‘FCMA’

diagonal: False
fcmaes: True

Further variants of CMA include CMA with test based population size adaption. It sets Population-size equal to lambda = 4 x dimension. It further introduces the parameters:

popsize_adaption

To use CMA with popsize adaptation

  • True : Use popsize adaptation

  • False : Don’t…

covariance_memory

Use covariance_memory

  • True : Use covariance

  • False : Don’t…

‘EDA’

popsize_adaption: False
covariance_memory: False

‘PCEDA’

popsize_adaption: True
covariance_memory: False

‘MPCEDA’

popsize_adaption: True
covariance_memory: True

‘MEDA’

popsize_adaption: False
covariance_memory: True

Evolution Strategies

Experimental evolution-strategy-like algorithms. Seems to use mutations and cross-over. The following parameters can be changed

recombination_ratio

If 1 then will recombine all of the population, if 0 then won’t use any combinations just mutations

(default = 0)

popsize

The number of individuals in the population

(default = 40)

offsprings

The number of offspring from every generation

(default = None)

only_offsprings

If true then only keep offspring, none of the original population.

(default = False)

ranker

Either ‘simple’ or ‘nsga2’

‘ES’

recombination_ratio: 0
popsize: 40
offsprings: 60
only_offsprings: True
ranker: 'simple'

‘RecES’

recombination_ratio:1
popsize: 40
offsprings: 60
only_offsprings: True
ranker: 'simple'

‘RecMixES’

recombination_ratio: 1
popsize: 40
offsprings: 20
only_offsprings: False
ranker: 'simple'

‘RecMutDE’

recombination_ratio: 1
popsize: 40
offsprings: None
only_offsprings: False
ranker: 'simple'

‘MixES’

recombination_ratio: 0
popsize: 40
offsprings: 20
only_offsprings: False
ranker: 'simple'

‘MutDE’

recombination_ratio: 0
popsize: 40
offsprings: None
only_offsprings: False
ranker: 'simple'

‘NSGAIIES’

recombination_ratio: 0
popsize: 40
offsprings: 60
only_offsprings: True
ranker: "nsga2"

Differential Evolution

Background: https://en.wikipedia.org/wiki/Differential_evolution

In the below descriptions the different DE choices vary on a few different parameters.

initialization

The algorithm/distribution used for the initialization phase. Either,

  • ‘LHS’ : Latin Hypercube Sampling

  • ‘QR’ : Quasi-Random

  • ‘gaussian’ : Normal Distribution

(default = ‘gaussian’)

scale

The scale of random component of the updates

Either,

  • ‘mini’ : 1 / sqrt(dimension)

  • 1 : no change

(default = 1)

crossover

The crossover rate value / strategy used during DE. Either,

  • ‘dimension’ : crossover rate of 1 / dimension

  • ‘random’ : different random (uniform) crossover rate at each iteration

  • ‘onepoint’ : one point crossover

  • ‘twopoints’ : two points crossover

(default = .5)

popsize

The size of the population to use. Either,

  • ‘standard’ : max(num_workers, 30)

  • ‘dimension’ : max(num_workers, 30, dimension +1)

  • ‘large’ : max(num_workers, 30, 7 * dimension)

Note: dimension refers to the dimensions of the hyperparameters being searched over. ‘standard’ by default.s

(default = ‘standard’)

recommendation

Choice of the criterion for the best point to recommend. Either,

  • ‘optimistic’ : best

  • ‘noisy’ : add noise to choice of best

(default = ‘optimistic’)

‘DE’

Defaults Only

‘OnePointDE’

crossover: 'onepoint'

‘TwoPointsDE’

crossover: 'twopoint'

‘LhsDE’

initialization: 'LHS'

‘QrDE’

initialization: 'QE'

‘MiniDE’

scale: 'mini'

‘MiniLhsDE’

initialization: 'LHS'
scale: 'mini'

‘MiniQrDE’

initialization: 'QE'
scale: 'mini'

‘NoisyDE’

recommendation: 'noisy'

‘AlmostRotationInvariantDE’

crossover: .9

‘AlmostRotationInvariantDEAndBigPop’

crossover: .9
popsize: 'dimension'

‘RotationInvariantDE’

crossover: 1
popsize: 'dimension'

‘BPRotationInvariantDE’

crossover: 1
popsize: 'large'

Algorithm Selection

Algorithm selection works by first splitting the search budget up between trying different search algorithms, and the ‘budget_before_choosing’ is up, it uses the rest of the search budget on the strategy that did the best.

In the case that budget_before_choosing is 1, then the algorithm is a passive portfolio of the different options, and will split the full budget between all of them.

The parameter options refers to the algorithms it tries before choosing.

‘ASCMA2PDEthird’

options: ['CMA', 'TwoPointsDE']
budget_before_choosing: 1/3

‘ASCMADEQRthird’

options: ['CMA', 'LhsDE', 'ScrHaltonSearch']
budget_before_choosing: 1/3

‘ASCMADEthird’

options: ['CMA', 'LhsDE']
budget_before_choosing: 1/3

‘TripleCMA’

options: ['CMA', 'CMA', 'CMA']
budget_before_choosing: 1/3

‘MultiCMA’

options: ['CMA', 'CMA', 'CMA']
budget_before_choosing: 1/10

‘MultiScaleCMA’

options: ['CMA', 'ParametrizedCMA(scale=1e-3)', 'ParametrizedCMA(scale=1e-6)']
budget_before_choosing: 1/3

‘Portfolio’

options: ['CMA', 'TwoPointsDE', 'ScrHammersleySearch']
budget_before_choosing: 1

‘ParaPortfolio’

options: ['CMA', 'TwoPointsDE', 'PSO', 'SQP', 'ScrHammersleySearch']
budget_before_choosing: 1

‘SQPCMA’

options: ['CMA', n_jobs - n_jobs // 2 'SQP']
budget_before_choosing: 1

Competence Maps

Competence Maps essentially just automatically select an algorithm based on the parameters passed, the number of workers, the budget, ect…

‘NGO’

Nevergrad optimizer by competence map., Based on One-Shot options

‘NGOpt’

Nevergrad optimizer by competence map.

‘CM’

Competence map, simplest

‘CMandAS’

Competence map, with algorithm selection in one of the cases

‘CMandAS2’

Competence map, with algorithm selection in one of the cases (3 CMAs).

‘CMandAS3’

Competence map, with algorithm selection in one of the cases (3 CMAs).

‘Shiva’

“Shiva” choices - “Nevergrad optimizer by competence map”

Misc.

These optimizers did not seem to naturally fall into a category. Brief descriptions are listed below.

‘NaiveIsoEMNA’

Estimation of Multivariate Normal Algorithm This algorithm is quite efficient in a parallel context, i.e. when the population size is large.

‘TBPSA’

Test-based population-size adaptation, for noisy problems where the best points will be an average of the final population.

‘NaiveTBPSA’

Test-based population-size adaptation Where the best point is the best point, no average across final population.

‘NoisyBandit’

Noisy bandit simple optimization

‘PBIL’

Population based incremental learning “Implementation of the discrete algorithm PBIL” https://www.ri.cmu.edu/pub_files/pub1/baluja_shumeet_1994_2/baluja_shumeet_1994_2.pdf

‘PSO’

Standard Particle Swarm Optimisation, but no randomization of the population order.

‘SQP’

Scipy Minimize Base See: https://docs.scipy.org/doc/scipy-1.1.0/reference/generated/scipy.optimize.minimize.html Note: does not support multiple jobs at once.

‘SPSA’

The First order SPSA algorithm, See: https://en.wikipedia.org/wiki/Simultaneous_perturbation_stochastic_approximation Note: does not support multiple jobs at once.

‘SplitOptimizer’

Combines optimizers, each of them working on their own variables. By default uses CMA and RandomSearch’s

‘cGA’

Implementation of the discrete Compact Genetic Algorithm (cGA) https://pdfs.semanticscholar.org/4b0b/5733894ffc0b2968ddaab15d61751b87847a.pdf

‘chainCMAPowell’

A chaining consists in running algorithm 1 during T1, then algorithm 2 during T2, then algorithm 3 during T3, etc. Each algorithm is fed with what happened before it. This ‘chainCMAPowell’ chains first ‘CMA’ then the ‘Powell’ optimizers. Note: does not support multiple jobs at once.

Experimental Variants

Nevergrad also comes with a number of Experimental variants, to see all of the different options run:

import nevergrad as ng
import nevergrad.optimization.experimentalvariants
print(sorted(ng.optimizers.registry.keys()))