Seminars Events Calendar

Filter by
Jump to

The Confidence Bound Method For The Multi-Armed Bandit Problem With Large Arm Size

Mr Hu ShouriNational University of Singapore

Date:15 October 2020, Thursday


Time:03:00pm - 04:00pm, Singapore time

PhD Oral Presentation:

The confidence bound method is well-established for the multi-armed bandit problem and has been shown to be optimal when the number of rewards goes to infinity with arm size fixed. However in practice when the arm size is large the asymptotics may not be accurate. We consider here two modifications of the confidence bound method that achieve optimality in the case of large arm sizes. In the first situation we assume that the number of arms is infinitely large. We establish a regret lower bound in this situation and construct confidence bounds that achieve the regret lower bound asymptotically. A novel feature of these confidence bounds is a target value for deciding whether to stick to arms that have been played, or to play a new arm. In the second situation we consider asymptotics when the number of arms grows at a polynomial rate with respect to the number of rewards. We show that under this situation the regret lower bound has an expression similar to that of Robbins and Lai (1985), but with a smaller asymptotic constant. We show how the confidence bounds proposed by Lai (1987) and Agarwal (1995) can be corrected for arm size so that the new regret lower bound is achieved. The reason for these differences is that when the number of arms is large, the inevitable experimentation costs is much larger compared to the fixed arm size situation. This allows for regret savings by less conservative confidence bounds that provides better balances of the experimentation and exploitation costs. Numerical studies here show that the new confidence bounds performs better than classical confidence bounds that do not correct for arm size, and also better than other multi-armed bandit algorithms.

Key words and phrases: confidence bound method, large arm sizes, multi-armed bandit, optimality, regret lower bound.