Hyper Parameter Optimization

M1 student @ Tokyo tech

Bing BAI

June 19, 2021

Hyper-parameter Optimization

HP categories to be optimized

  1. Table of principal HPs grouped by role and domain
hp-parameter example
continuous learning rate
Discrete poolSize, kernal size, sliding, hcHidden, data batch, numEpoch
categorical Choice of optimizer, Activation function
binary whether to early stop

Existing Methods

summary

Table. 1: comparison of common HPO algorithm (n is the number of hyper-parameter values and k is the number of hyper-parameters)

HPO methods Strengths Limitations Time Complexity
Grid Search Simple Time consuming; Only efficient with categorical HPs O(n^k)
Random Search More efficient than GS Not consider previous results; Not efficient with conditional HPs. O(n)
Gradient-based models Fast convergence speed for continuous HPs Only support continuous HPs; May only detect local optimums O(n^k)
BO-GP Fast convergence speed for continuous HPs Poor capacity for parallelization; Not efficient with conditional HPs. O(n^3)
SMAC Efficient with all types of HPs Poor capacity for parallelization O(nlog(n))
BO-TPE Efficient with all types of HPs; Keep conditional dependencies. Poor capacity for parallelization O(nlog(n))
Hyperband Enable parallelization Not efficient with conditional HPs; Require subsets with small budgets to be representative O(nlog(n))
BOHB Efficient with all types of HPs Require subsets with small budgets to be representative. O(nlog(n))
GA Efficient with all types of HPs; Not require good initialization Poor capacity for parallelization. O(n^2)
PSO Efficient with all types of HPs; Enable parallelization Require proper initialization O (nlogn)

Yang+, On hyperparameter optimization of machine learning algorithms: Theory and practice, Neurocomputing 2020

Grid Search performs an exhaustive search through a manually specified subset of the hyperparameter space defined in the searchspace file. https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

Random Search might be surprisingly effective despite its simplicity. We suggest using Random Search as a baseline when no knowledge about the prior distribution of hyper-parameters is available. Bergstra+, Random Search for Hyper-Parameter Optimization, Journal of Machine Learning Research Vol.13, pp. 281-305, 2012

Gradient-based optimization

First randomly selecting a data point, then moves towards the opposite direction of the largest gradient to locate the next data point. Therefore, a local optimum can be reached after convergence. The local optimum is also the global optimum for convex functions.

Bayesian optimization

Multidimensional KDE is used to guide the selection of configurations for the next iteration. The sampling procedure (using Multidimensional KDE to guide selection) is summarized by the pseudocode below. Screen Shot 2021-06-16 at 2.45.58.png (121.9 kB) To fit useful KDEs, we require a minimum number of data points Nmin; this is set to d + 1 for our experiments, where d is the number of hyperparameters. To build a model as early as possible, we do not wait until Nb = |Db|, where the number of observations for budget b is large enough to satisfy q · Nb ≥ Nmin. Instead, after initializing with Nmin + 2 random configurations, we choose the best and worst configurations respectively to model the two densities. Screen Shot 2021-06-16 at 2.59.51.png (24.2 kB) Workflow bohb_6.jpg (124.3 kB) This image shows the workflow of BOHB. Here we set max_budget = 9, min_budget = 1, eta = 3, others as default. In this case, s_max = 2, so we will continuously run the {s=2, s=1, s=0, s=2, s=1, s=0, …} cycle. In each stage of SuccessiveHalving (the orange box), we will pick the top 1/eta configurations and run them again with more budget, repeating the SuccessiveHalving stage until the end of this iteration. At the same time, we collect the configurations, budgets and final metrics of each trial and use these to build a multidimensional KDE model with the key “budget”.

opensourced code

Metaheuristic algorithms