|
NeuroCOLT
Technical Report NC-TR-02-130
2002-130
Using Confidence Bounds for Exploitation - Exploration Trade offs
Peter Auer
ABSTRACT
We show how a standard tool from statistics - namely confidence
bounds can be used to elegantly deal with situations which exhibit
an exploitation-exploration trade-off. Our technique for designing
and analyzing algorithms for such situations is general and can be
applied
when an algorithm has to make exploitation-versus-exploration
decisions based on uncertain information provided by a random process.
We apply our technique
to two models with such an
exploitation-exploration trade-off. For the adversarial bandit
problem with shifting our new algorithm suffers only (ST)^(1/2) regret
with high probability over T trials with S shifts. Such a regret bound
was previously known only in expectation. The second model we consider
is associative reinforcement learning with linear value functions.
For this model our technique improves the regret from T^(3/4) to T^(1/2).
Download
Postscript
|