From Bayesian Optimization to Bandits and RL

Date: August 23, 2024

From Bayesian Optimization to Bandits and RL

Weitong Zhang Prepared for Chong’s Lab @ 08-29-2024, UCLA —

Starting from Bayesian Optimization

Objective: $\min_x f(x)$
We do not know $f(\cdot)$ in beginning
- For each round, select an $x \in \mathbb R^d$ (Through the dragonfly package…)
  - Exploration: College the data, acquisition
- Receive the Noisy observation $y \approx f(x)$ (Automated system)
- Try to update the function approximation $\tilde f(\cdot)$ (Through the dragonfly package)
  - Exploitation: Use the data
Bayesian Optimization:
- The formulation of $f(\cdot)$ is fixed (Gaussian Kernel)
- The exploration strategy (selecting $x$) is fixed (Kernel TS)

bg left:30% 100%

A more general formulation: Bandits

Objective: $\max_x r(x)$ (find the optimal arm)
Exploration: College the data $(x, r)$
Exploitation: Estimate the reward $r(\cdot)$

Notations / Metrics:

Optimal arm: $x^* = \arg\max_x r(x)$
Policy $\pi$: Select different $x$
Regret: $\text{Reg}(K) = \sum_k r(x^*) - r(x_k)$
Sample Complexity: $r(x^*) - r(x_k) \le \epsilon \forall k \ge K(\epsilon)$

Differences between regret and sample complexity?

cr. Img.1

Goals for Bandits

Regret minimization: $\min \text{Reg}(K) = \sum_k r(x^*) - r(x_k)$
- $x_k$ selection: Exploration + Exploitation
Pure exploration / Explore-then-commit
- Explore the environment for $K(\epsilon)$ steps and commit an $\epsilon$-optimal policy

Setting of functions

Multi-armed Bandits: $x \in {x_1, x_2, \cdots, x_n}$, $r(x) = {r_1, r_2, \cdots, r_n}$
Contextual Bandits: $x \in \mathbb R^d$ or $x \in \mathcal D \subset \mathbb R^d$ (feature of observation)
- Linear Bandits: $r(x) = \langle \theta, x\rangle$
- Kernel Bandits: $r(x) = \langle \theta, \phi(x)\rangle$ (similar with Bayesian Optimization)
- Neural Networks / General function approximation $r_\theta(x)$.

Challenges and Philosophies

Tradeoff between exploration and exploitation

Collected Data: $\mathcal D = {(x_1, 1), (x_1, 1.3), (x_1, 1), (x_2, 1)}$

$\tilde r(x_1) = 1.1, \tilde r(x_2) = 1$ (True reward: $r(x_1) = 1.1, r(x_2) = 1.2$)

Regret minimization: $\min \sum_k r(x^*) - r(x_k)$ (Should we always select $x_1$?)

Uncertainty quantification: $\sigma_1 = 1/\sqrt{3}, \sigma_2 = 1$
Leveraging the uncertainty:
- Posterior Sampling: $r(x) \sim \mathcal N(\tilde r(x), \sigma(x))$:
  - Bayesian Optimization, Thompson Sampling
- Encourage exploration: $r(x) = \tilde r(x) + \beta \sigma(x)$
  - Upper Confidence Bound (UCB), Optimistic

Extended settings (I) - Safety and Robustness

Safety of bandit: $\max r(x) \text{ s.t. } c(x) \le C$
- Preliminary solution: $\max r(x) - \lambda c(x)$
Robust to model error: when $r(x)$ is different from function approximation $\tilde r(x)$
- E.g. $\tilde r(x) = \langle w, x\rangle$, $r(x) = \langle w, x\rangle + \eta(x)$
- Preliminary solution: Learn from large uncertainties
Robust to adversarial attack: after select $x$, receive $r = r(x) + \eta(x) + \epsilon$
- Difference from model error: $\eta(x)$ is selected afterwards
- Preliminary solution: weighted by uncertainties / Mirror descent
Non-stationary bandits: $r_k(x)$ is changing over time $k$
- E.g. $r_k(x) = \langle w + \theta t, x\rangle$: slow drifting linear function
- Preliminary solution: sliding window, only learn from $[k-T, k]$ time window

Extended settings (II) - Different feedbacks

Regular feedback: Receive $r = r(x) + \epsilon$, $\epsilon$: zero mean (sub)-Gaussian
- Extension: $\epsilon$ follows some more general distribution (heavy tailed, etc.)
Binary feedback: Receive $r \sim \text{Ber.}(r(x))$
- E.g. Recommendation system, if a user click the recommendation
- Preliminary solution: change the loss function to cross-entropy loss
Dueling bandit: Select two arms (A) and (B), and receive the winner (better) one
- E.g. Reinforcement Learning with Human Feedback
- Preliminary solution: admit some model: $P(w_A) = \exp(r(A) - r(B))$
- Extension: Ranking over several actions (A > B > C > D) (rank aggregation)
Partial Monitoring: Only receive feedback when the agent request
- E.g. Request label for spam email from a human, otherwise no feedback
- Preliminary solution: query when uncertainty is large
No feedback (reward free, pure exploration): Collect data for future tasks / rewards

Extended settings (III) - Multi-step decision

Delayed feedback: reward $r_t$ will be received at $r_{t + T}$
- E.g., Experiments taking longer time to finish
Batched update: perform $B$ experiments at the same time and receive $r_t, \cdots, r_{t + B}$
Low switching: Explore the environment without changing too much policies
- E.g. Changing the policy when the current exploration policy is not good enough
Reinforcement Learning:
- Sequential decision making steps: $a_1, \cdots a_H$
- Consider the cumulative reward function $V = \sum_{h=1}^H r_h(s_h, a_h)$
- Markov Decision Processes: $s_{h+1} \sim P(\cdot s_h, a_h)$
  - Bandits: $s_{h+1} \sim \mu(\cdot)$ or fixed (ignored)

Extended settings (IV) - Structured data

Combinatorial Bandits:
- Action $[a, b, c]$ where $a \in [A_1, A_2, \cdots], b \in [B_1, B_2, \cdots], c \in [C_1, C_2, \cdots]$
- E.g., Different combination of substances
- Semi-bandit Feedback (A): Reward for each substances
- Bandit Feedback: Reward for all substances
Graph structured data: Contexts are represented as a graph $\mathcal G$
- E.g. social graph for recommendations
Collaborative filtering: Recommend different users with different items
- Heuristic: an item is good if it is good or the specific user like it

Rule of thumb to apply the bandit algorithms (not BO only)

Problem setting: what’s the action space (features) and the outcome (labels)?
How is the outcome automated / costly? Noise? What kind of outcome?
What’s the prior knowledge on the actions?
- E.g. Which action is better? pre-prepared offline dataset?
The knowledge on function approximations: Linear? Bayesian Optimization?
- Simpler the better! Why we need to go beyond BO?
How much exploration you need?
- Not all automated tasks need to explore so seriously!!
Goal of your algorithm: Purely explore or getting to something good?

Thank you

Share on

Twitter Facebook LinkedIn