From Bayesian Optimization to Bandits and RL

Date:

From Bayesian Optimization to Bandits and RL

Weitong Zhang Prepared for Chong’s Lab @ 08-29-2024, UCLA —

Starting from Bayesian Optimization

  • Objective: $\min_x f(x)$
  • We do not know $f(\cdot)$ in beginning
    • For each round, select an $x \in \mathbb R^d$ (Through the dragonfly package…)
      • Exploration: College the data, acquisition
    • Receive the Noisy observation $y \approx f(x)$ (Automated system)
    • Try to update the function approximation $\tilde f(\cdot)$ (Through the dragonfly package)
      • Exploitation: Use the data
  • Bayesian Optimization:
    • The formulation of $f(\cdot)$ is fixed (Gaussian Kernel)
    • The exploration strategy (selecting $x$) is fixed (Kernel TS)

bg left:30% 100%

A more general formulation: Bandits

  • Objective: $\max_x r(x)$ (find the optimal arm)
  • Exploration: College the data $(x, r)$
  • Exploitation: Estimate the reward $r(\cdot)$

Notations / Metrics:

  • Optimal arm: $x^* = \arg\max_x r(x)$
  • Policy $\pi$: Select different $x$
  • Regret: $\text{Reg}(K) = \sum_k r(x^*) - r(x_k)$
  • Sample Complexity: $r(x^*) - r(x_k) \le \epsilon \forall k \ge K(\epsilon)$

Differences between regret and sample complexity?

cr. Img.1


Goals for Bandits

  • Regret minimization: $\min \text{Reg}(K) = \sum_k r(x^*) - r(x_k)$
    • $x_k$ selection: Exploration + Exploitation
  • Pure exploration / Explore-then-commit
    • Explore the environment for $K(\epsilon)$ steps and commit an $\epsilon$-optimal policy

Setting of functions

  • Multi-armed Bandits: $x \in {x_1, x_2, \cdots, x_n}$, $r(x) = {r_1, r_2, \cdots, r_n}$
  • Contextual Bandits: $x \in \mathbb R^d$ or $x \in \mathcal D \subset \mathbb R^d$ (feature of observation)
    • Linear Bandits: $r(x) = \langle \theta, x\rangle$
    • Kernel Bandits: $r(x) = \langle \theta, \phi(x)\rangle$ (similar with Bayesian Optimization)
    • Neural Networks / General function approximation $r_\theta(x)$.

Challenges and Philosophies

Tradeoff between exploration and exploitation

  • Collected Data: $\mathcal D = {(x_1, 1), (x_1, 1.3), (x_1, 1), (x_2, 1)}$
  • $\tilde r(x_1) = 1.1, \tilde r(x_2) = 1$ (True reward: $r(x_1) = 1.1, r(x_2) = 1.2$)
  • Regret minimization: $\min \sum_k r(x^*) - r(x_k)$ (Should we always select $x_1$?)
  • Uncertainty quantification: $\sigma_1 = 1/\sqrt{3}, \sigma_2 = 1$
  • Leveraging the uncertainty:
    • Posterior Sampling: $r(x) \sim \mathcal N(\tilde r(x), \sigma(x))$:
      • Bayesian Optimization, Thompson Sampling
    • Encourage exploration: $r(x) = \tilde r(x) + \beta \sigma(x)$
      • Upper Confidence Bound (UCB), Optimistic

Extended settings (I) - Safety and Robustness

  • Safety of bandit: $\max r(x) \text{ s.t. } c(x) \le C$
    • Preliminary solution: $\max r(x) - \lambda c(x)$
  • Robust to model error: when $r(x)$ is different from function approximation $\tilde r(x)$
    • E.g. $\tilde r(x) = \langle w, x\rangle$, $r(x) = \langle w, x\rangle + \eta(x)$
    • Preliminary solution: Learn from large uncertainties
  • Robust to adversarial attack: after select $x$, receive $r = r(x) + \eta(x) + \epsilon$
    • Difference from model error: $\eta(x)$ is selected afterwards
    • Preliminary solution: weighted by uncertainties / Mirror descent
  • Non-stationary bandits: $r_k(x)$ is changing over time $k$
    • E.g. $r_k(x) = \langle w + \theta t, x\rangle$: slow drifting linear function
    • Preliminary solution: sliding window, only learn from $[k-T, k]$ time window

Extended settings (II) - Different feedbacks

  • Regular feedback: Receive $r = r(x) + \epsilon$, $\epsilon$: zero mean (sub)-Gaussian
    • Extension: $\epsilon$ follows some more general distribution (heavy tailed, etc.)
  • Binary feedback: Receive $r \sim \text{Ber.}(r(x))$
    • E.g. Recommendation system, if a user click the recommendation
    • Preliminary solution: change the loss function to cross-entropy loss
  • Dueling bandit: Select two arms (A) and (B), and receive the winner (better) one
    • E.g. Reinforcement Learning with Human Feedback
    • Preliminary solution: admit some model: $P(w_A) = \exp(r(A) - r(B))$
    • Extension: Ranking over several actions (A > B > C > D) (rank aggregation)
  • Partial Monitoring: Only receive feedback when the agent request
    • E.g. Request label for spam email from a human, otherwise no feedback
    • Preliminary solution: query when uncertainty is large
  • No feedback (reward free, pure exploration): Collect data for future tasks / rewards

Extended settings (III) - Multi-step decision

  • Delayed feedback: reward $r_t$ will be received at $r_{t + T}$
    • E.g., Experiments taking longer time to finish
  • Batched update: perform $B$ experiments at the same time and receive $r_t, \cdots, r_{t + B}$
  • Low switching: Explore the environment without changing too much policies
    • E.g. Changing the policy when the current exploration policy is not good enough
  • Reinforcement Learning:
    • Sequential decision making steps: $a_1, \cdots a_H$
    • Consider the cumulative reward function $V = \sum_{h=1}^H r_h(s_h, a_h)$
    • Markov Decision Processes: $s_{h+1} \sim P(\cdot s_h, a_h)$
      • Bandits: $s_{h+1} \sim \mu(\cdot)$ or fixed (ignored)

Extended settings (IV) - Structured data

  • Combinatorial Bandits:
    • Action $[a, b, c]$ where $a \in [A_1, A_2, \cdots], b \in [B_1, B_2, \cdots], c \in [C_1, C_2, \cdots]$
    • E.g., Different combination of substances
    • Semi-bandit Feedback (A): Reward for each substances
    • Bandit Feedback: Reward for all substances
  • Graph structured data: Contexts are represented as a graph $\mathcal G$
    • E.g. social graph for recommendations
  • Collaborative filtering: Recommend different users with different items
    • Heuristic: an item is good if it is good or the specific user like it

Rule of thumb to apply the bandit algorithms (not BO only)

  1. Problem setting: what’s the action space (features) and the outcome (labels)?
  2. How is the outcome automated / costly? Noise? What kind of outcome?
  3. What’s the prior knowledge on the actions?
    • E.g. Which action is better? pre-prepared offline dataset?
  4. The knowledge on function approximations: Linear? Bayesian Optimization?
    • Simpler the better! Why we need to go beyond BO?
  5. How much exploration you need?
    • Not all automated tasks need to explore so seriously!!
  6. Goal of your algorithm: Purely explore or getting to something good?

Thank you