TY - JOUR
T1 - Delay and cooperation in nonstochastic linear bandits
AU - Ito, Shinji
AU - Hatano, Daisuke
AU - Sumita, Hanna
AU - Takemura, Kei
AU - Fukunaga, Takuro
AU - Kakimura, Naonori
AU - Kawarabayashi, Ken Ichi
N1 - Funding Information:
This work was supported by JST, ERATO, Grant Number JPMJER1201, Japan. SI was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan. TF was supported by JST, PRESTO, Grant Number JPMJPR1759, Japan. NK and KK were supported by JSPS, KAKENHI, Grant Number JP18H05291, Japan.
Publisher Copyright:
© 2020 Neural information processing systems foundation. All rights reserved.
PY - 2020
Y1 - 2020
N2 - This paper offers a nearly optimal algorithm for online linear optimization with delayed bandit feedback. Online linear optimization with bandit feedback, or nonstochastic linear bandits, provides a generic framework for sequential decision-making problems with limited information. This framework, however, assumes that feedback can be observed just after choosing the action, and, hence, does not apply directly to many practical applications, in which the feedback can often only be obtained after a while. To cope with such situations, we consider problem settings in which the feedback can be observed d rounds after the choice of an action, and propose an algorithm for which the expected regret is Õ(pm(m + d)T), ignoring logarithmic factors in m and T, where m and T denote the dimensionality of the action set and the number of rounds, respectively. This algorithm achieves nearly optimal performance, as we are able to show that arbitrary algorithms suffer the regret of O(pm(m + d)T) in the worst case. To develop the algorithm, we introduce a technique we refer to as distribution truncation, which plays an essential role in bounding the regret. We also apply our approach to cooperative bandits, as studied by Cesa-Bianchi et al. [18] and Bar-On and Mansour [12], and extend their results to the linear bandits setting.
AB - This paper offers a nearly optimal algorithm for online linear optimization with delayed bandit feedback. Online linear optimization with bandit feedback, or nonstochastic linear bandits, provides a generic framework for sequential decision-making problems with limited information. This framework, however, assumes that feedback can be observed just after choosing the action, and, hence, does not apply directly to many practical applications, in which the feedback can often only be obtained after a while. To cope with such situations, we consider problem settings in which the feedback can be observed d rounds after the choice of an action, and propose an algorithm for which the expected regret is Õ(pm(m + d)T), ignoring logarithmic factors in m and T, where m and T denote the dimensionality of the action set and the number of rounds, respectively. This algorithm achieves nearly optimal performance, as we are able to show that arbitrary algorithms suffer the regret of O(pm(m + d)T) in the worst case. To develop the algorithm, we introduce a technique we refer to as distribution truncation, which plays an essential role in bounding the regret. We also apply our approach to cooperative bandits, as studied by Cesa-Bianchi et al. [18] and Bar-On and Mansour [12], and extend their results to the linear bandits setting.
UR - http://www.scopus.com/inward/record.url?scp=85108402412&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85108402412&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85108402412
SN - 1049-5258
VL - 2020-December
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 34th Conference on Neural Information Processing Systems, NeurIPS 2020
Y2 - 6 December 2020 through 12 December 2020
ER -