TY - JOUR
T1 - Bandit Task Assignment with Unknown Processing Time
AU - Ito, Shinji
AU - Hatano, Daisuke
AU - Sumita, Hanna
AU - Takemura, Kei
AU - Fukunaga, Takuro
AU - Kakimura, Naonori
AU - Kawarabayashi, Ken Ichi
N1 - Publisher Copyright:
© 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - This study considers a novel problem setting, referred to as bandit task assignment, that incorporates the processing time of each task in the bandit setting. In this problem setting, a player sequentially chooses a set of tasks to start so that the set of processing tasks satisfies a given combinatorial constraint. The reward and processing time for each task follow unknown distributions, values of which are revealed only after the task has been completed. The problem generalizes the stochastic combinatorial semi-bandit problem and the budget-constrained bandit problem. For this problem setting, we propose an algorithm based on upper confidence bounds (UCB) combined with a phased-update approach. The proposed algorithm admits a gap-dependent regret upper bound of O(MN(1/∆)log T) and a gap-free regret upper bound of Õ(√MNT), where N is the number of the tasks, M is the maximum number of tasks run at the same time, T is the time horizon, and ∆ is the gap between expected per-round rewards of the optimal and best suboptimal sets of tasks. These regret bounds nearly match lower bounds.
AB - This study considers a novel problem setting, referred to as bandit task assignment, that incorporates the processing time of each task in the bandit setting. In this problem setting, a player sequentially chooses a set of tasks to start so that the set of processing tasks satisfies a given combinatorial constraint. The reward and processing time for each task follow unknown distributions, values of which are revealed only after the task has been completed. The problem generalizes the stochastic combinatorial semi-bandit problem and the budget-constrained bandit problem. For this problem setting, we propose an algorithm based on upper confidence bounds (UCB) combined with a phased-update approach. The proposed algorithm admits a gap-dependent regret upper bound of O(MN(1/∆)log T) and a gap-free regret upper bound of Õ(√MNT), where N is the number of the tasks, M is the maximum number of tasks run at the same time, T is the time horizon, and ∆ is the gap between expected per-round rewards of the optimal and best suboptimal sets of tasks. These regret bounds nearly match lower bounds.
UR - http://www.scopus.com/inward/record.url?scp=85191150606&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85191150606&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85191150606
SN - 1049-5258
VL - 36
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
Y2 - 10 December 2023 through 16 December 2023
ER -