From Bayesian Optimization to Bandits and RL
Date:
From Bayesian Optimization to Bandits and RL
Weitong Zhang Prepared for Chong’s Lab @ 08-29-2024, UCLA —
Starting from Bayesian Optimization
- Objective: $\min_x f(x)$
- We do not know $f(\cdot)$ in beginning
- For each round, select an $x \in \mathbb R^d$ (Through the dragonfly package…)
- Exploration: College the data, acquisition
- Receive the Noisy observation $y \approx f(x)$ (Automated system)
- Try to update the function approximation $\tilde f(\cdot)$ (Through the dragonfly package)
- Exploitation: Use the data
- For each round, select an $x \in \mathbb R^d$ (Through the dragonfly package…)
- Bayesian Optimization:
- The formulation of $f(\cdot)$ is fixed (Gaussian Kernel)
- The exploration strategy (selecting $x$) is fixed (Kernel TS)
A more general formulation: Bandits
- Objective: $\max_x r(x)$ (find the optimal arm)
- Exploration: College the data $(x, r)$
- Exploitation: Estimate the reward $r(\cdot)$
Notations / Metrics:
- Optimal arm: $x^* = \arg\max_x r(x)$
- Policy $\pi$: Select different $x$
- Regret: $\text{Reg}(K) = \sum_k r(x^*) - r(x_k)$
- Sample Complexity: $r(x^*) - r(x_k) \le \epsilon \forall k \ge K(\epsilon)$
Differences between regret and sample complexity?
cr. Img.1
Goals for Bandits
- Regret minimization: $\min \text{Reg}(K) = \sum_k r(x^*) - r(x_k)$
- $x_k$ selection: Exploration + Exploitation
- Pure exploration / Explore-then-commit
- Explore the environment for $K(\epsilon)$ steps and commit an $\epsilon$-optimal policy
Setting of functions
- Multi-armed Bandits: $x \in {x_1, x_2, \cdots, x_n}$, $r(x) = {r_1, r_2, \cdots, r_n}$
- Contextual Bandits: $x \in \mathbb R^d$ or $x \in \mathcal D \subset \mathbb R^d$ (feature of observation)
- Linear Bandits: $r(x) = \langle \theta, x\rangle$
- Kernel Bandits: $r(x) = \langle \theta, \phi(x)\rangle$ (similar with Bayesian Optimization)
- Neural Networks / General function approximation $r_\theta(x)$.
Challenges and Philosophies
Tradeoff between exploration and exploitation
- Collected Data: $\mathcal D = {(x_1, 1), (x_1, 1.3), (x_1, 1), (x_2, 1)}$
- $\tilde r(x_1) = 1.1, \tilde r(x_2) = 1$ (True reward: $r(x_1) = 1.1, r(x_2) = 1.2$)
- Regret minimization: $\min \sum_k r(x^*) - r(x_k)$ (Should we always select $x_1$?)
- Uncertainty quantification: $\sigma_1 = 1/\sqrt{3}, \sigma_2 = 1$
- Leveraging the uncertainty:
- Posterior Sampling: $r(x) \sim \mathcal N(\tilde r(x), \sigma(x))$:
- Bayesian Optimization, Thompson Sampling
- Encourage exploration: $r(x) = \tilde r(x) + \beta \sigma(x)$
- Upper Confidence Bound (UCB), Optimistic
- Posterior Sampling: $r(x) \sim \mathcal N(\tilde r(x), \sigma(x))$:
Extended settings (I) - Safety and Robustness
- Safety of bandit: $\max r(x) \text{ s.t. } c(x) \le C$
- Preliminary solution: $\max r(x) - \lambda c(x)$
- Robust to model error: when $r(x)$ is different from function approximation $\tilde r(x)$
- E.g. $\tilde r(x) = \langle w, x\rangle$, $r(x) = \langle w, x\rangle + \eta(x)$
- Preliminary solution: Learn from large uncertainties
- Robust to adversarial attack: after select $x$, receive $r = r(x) + \eta(x) + \epsilon$
- Difference from model error: $\eta(x)$ is selected afterwards
- Preliminary solution: weighted by uncertainties / Mirror descent
- Non-stationary bandits: $r_k(x)$ is changing over time $k$
- E.g. $r_k(x) = \langle w + \theta t, x\rangle$: slow drifting linear function
- Preliminary solution: sliding window, only learn from $[k-T, k]$ time window
Extended settings (II) - Different feedbacks
- Regular feedback: Receive $r = r(x) + \epsilon$, $\epsilon$: zero mean (sub)-Gaussian
- Extension: $\epsilon$ follows some more general distribution (heavy tailed, etc.)
- Binary feedback: Receive $r \sim \text{Ber.}(r(x))$
- E.g. Recommendation system, if a user click the recommendation
- Preliminary solution: change the loss function to cross-entropy loss
- Dueling bandit: Select two arms (A) and (B), and receive the winner (better) one
- E.g. Reinforcement Learning with Human Feedback
- Preliminary solution: admit some model: $P(w_A) = \exp(r(A) - r(B))$
- Extension: Ranking over several actions (A > B > C > D) (rank aggregation)
- Partial Monitoring: Only receive feedback when the agent request
- E.g. Request label for spam email from a human, otherwise no feedback
- Preliminary solution: query when uncertainty is large
- No feedback (reward free, pure exploration): Collect data for future tasks / rewards
Extended settings (III) - Multi-step decision
- Delayed feedback: reward $r_t$ will be received at $r_{t + T}$
- E.g., Experiments taking longer time to finish
- Batched update: perform $B$ experiments at the same time and receive $r_t, \cdots, r_{t + B}$
- Low switching: Explore the environment without changing too much policies
- E.g. Changing the policy when the current exploration policy is not good enough
- Reinforcement Learning:
- Sequential decision making steps: $a_1, \cdots a_H$
- Consider the cumulative reward function $V = \sum_{h=1}^H r_h(s_h, a_h)$
-
Markov Decision Processes: $s_{h+1} \sim P(\cdot s_h, a_h)$ - Bandits: $s_{h+1} \sim \mu(\cdot)$ or fixed (ignored)
Extended settings (IV) - Structured data
- Combinatorial Bandits:
- Action $[a, b, c]$ where $a \in [A_1, A_2, \cdots], b \in [B_1, B_2, \cdots], c \in [C_1, C_2, \cdots]$
- E.g., Different combination of substances
- Semi-bandit Feedback (A): Reward for each substances
- Bandit Feedback: Reward for all substances
- Graph structured data: Contexts are represented as a graph $\mathcal G$
- E.g. social graph for recommendations
- Collaborative filtering: Recommend different users with different items
- Heuristic: an item is good if
it is good
orthe specific user like it
- Heuristic: an item is good if
Rule of thumb to apply the bandit algorithms (not BO only)
- Problem setting: what’s the action space (features) and the outcome (labels)?
- How is the outcome automated / costly? Noise? What kind of outcome?
- What’s the prior knowledge on the actions?
- E.g. Which action is better? pre-prepared offline dataset?
- The knowledge on function approximations: Linear? Bayesian Optimization?
- Simpler the better! Why we need to go beyond BO?
- How much exploration you need?
- Not all automated tasks need to explore so seriously!!
- Goal of your algorithm: Purely explore or getting to something good?