This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Minghao Yan, University of Wisconsin-Madison;
(2) Hongyi Wang, Carnegie Mellon University;
(3) Shivaram Venkataraman, myan@cs.wisc.edu.
Our objective is to automatically find the optimal hardware configurations that minimize energy consumption while satisfying latency SLOs. Formally, we are solving the optimization problem:
1. The search space is large, and performing a grid search will take hours depending on the model size. On TX2 and Orin, there are 5005 and 1820 points in the grid if we allow 5 different batch sizes. An exhaustive search would take 14 and 5 hours, respectively.
2. To satisfy latency constraints, it is hard to decouple each dimension and optimize them separately as they jointly affect inference latency in a non-trivial way.
3. The optimization landscape may vary across models and devices.
We observe that CPU frequency can be decoupled from GPU frequency, memory frequency, and batch size, as it mainly affects the preprocessing latency and energy consumption. We pipeline the requests so different requests can use CPU and GPU resources at the same time to increase inference throughput.
Based on this observation, we propose a two-phase hardware tuning framework, where CPU tuning is done separately from tuning other hardware components. The challenge that remains is to efficiently optimize for an unknown function with noise. As shown in Figure 3 and 4, the performance of neural network inference with changes in memory and GPU frequency is difficult to predict, therefore, a good solution must be able to handle the variance while converging to a near-optimal configuration in a sample-efficient fashion. This requires the method to adaptively balance the tradeoff between exploration and exploitation. To solve this, we formulate the optimization problem as a Bayesian Optimization problem and leverage recent advances in the field to incorporate the SLO constraints unique to our setting.
Bayesian Optimization is a prevalent method for hyperparameter tuning (Kandasamy et al., 2020; Klein et al., 2015), as it can optimize various black-box functions. This method is especially advantageous when evaluating the objective function is expensive and requires a substantial amount of time and resources.
However, some applications may involve constraints that must be satisfied in addition to optimizing the objective function. Constrained Bayesian Optimization (CBO) (Gardner et al., 2014) is an extension of Bayesian Optimization that tackles this challenge by incorporating constraints into the optimization process.
In CBO, the objective function and constraints are treated as distinct functions. The optimization algorithm seeks to identify the set of input parameters that maximize the objective function while adhering to the constraints. These constraints are usually expressed as inequality constraints that must be satisfied during the optimization process. The expected constrained improvement acquisition function in CBO is defined as follows: EIC (ˆx) = P F(ˆx) × EI(ˆx).
Here EI(ˆx) represents the expected improvement (EI) (Brochu et al., 2010) within an unconstrained Bayesian Optimization scenario, while P F(ˆx) is a univariate Gaussian cumulative distribution function, delineating the anticipated probability of whether xˆ can fulfill the constraints. Intuitively, EI chooses the next configuration by optimizing the expected improvement relative to the best recently explored configuration. In PolyThrottle, we choose EI since our empirical findings and corroborations from additional studies (Alipourfard et al., 2017) show that EI performs better than
other widely-used acquisition functions (Snoek et al., 2012).
CBO (Gardner et al., 2014) also employs a joint prior distribution over the objective and constraint functions that captures their correlation structure. This joint prior is constructed by assuming that the objective and constraint functions are drawn from a multivariate Gaussian distribution with a parameterized mean vector and covariance matrix. These hyperparameters are learned from data using maximum likelihood estimation.
During the optimization process, the algorithm uses this joint prior to compute an acquisition function that balances exploration (sampling points with high uncertainty) and exploitation (sampling points where the objective function is expected to be low and subject to feasibility constraints). The algorithm then selects the next point to evaluate based on this acquisition function. During each iteration, the algorithm will test whether the selected configuration violates any of the given constraints and take the result into account for the next iteration. Encoding more system-specific hints as constraints can be of independent research interests, however, we show in Section 7 that the current formulation performs well under a variety of scenarios.