### Horia Mania

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2020-150

August 13, 2020

### http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-150.pdf

With potential applications as diverse as self-driving cars, medical robots, and network protocols, recent years witnessed a staggering interest in building autonomous agents that learn to interact with the world. Despite impressive successes in games and robotic demonstrations, known reinforcement learning (RL) algorithms remain data hungry, unreliable, and complex. To address these issues we need to better understand the limits of training agents that interact with the world.

We theoretically analyze the data requirements of RL in several simple settings. To understand the sample complexity of system identification, a fundamental building block of model-based RL and feedback control, we focus on the estimation of linear dynamical systems and, more generally, on the estimation of dynamical systems whose state transitions depend linearly on a known feature embedding of state-action pairs. For linear dynamical systems we present a specialized analysis that captures the correct signal-to-noise behavior of the problem, showing that more unstable linear systems are easier to estimate. While linear systems can be identified from data generated by i.i.d. random inputs, to estimate nonlinear dynamical systems we must use a judicious choice of inputs. We propose an active learning method that addresses this challenge.

Then, we study the Linear Quadratic Regulator (LQR), a classical problem in control theory, from a RL perspective by assuming the underlying dynamics are unknown. We consider two solutions that use estimates of the dynamics to synthesize controllers: certainty equivalence and robust LQR. Certainty equivalence is the most straightforward approach to controller synthesis for LQR with unknown dynamics. It generates the optimal controller for the estimated dynamics, disregarding the effects of the estimation error. We show that when the estimation error is sufficiently small the difference between the cost achieved by the certainty equivalent controller and the optimal cost scales like the square of the estimation error.

We also consider a robust LQR approach that can operate with larger estimation errors. Our robust LQR method relies on System Level Synthesis to formulate the robust control problem as a quasi-convex optimization problem. We show that the performance gap of robust LQR scales linearly with the estimation error. Therefore, certainty equivalence can outperform robust LQR when the estimation error is small, but the latter approach can operate with larger estimation error.

Finally, in many settings RL agents have to operate in the presence of other decision makers. To study RL in such a scenario we take inspiration from the study of two-sided markets and stable matchings. Agents acting in markets often have to learn about their preferences through exploration. With the advent of massive online markets powered by data-driven matching platforms, it has become necessary to better understand the interplay between learning and market objectives. We propose a statistical learning model in which one side of the market does not have a priori knowledge about its preferences for the other side and is required to learn these from stochastic rewards. Our model extends the standard multi-armed bandits framework to multiple players, with the added feature that arms have preferences over players. We show surprising exploration-exploitation trade-offs compared to the single player multi-armed bandits setting.

**Advisor:** Michael Jordan and Benjamin Recht

BibTeX citation:

@phdthesis{Mania:EECS-2020-150, Author = {Mania, Horia}, Title = {The sample complexity of simple reinforcement learning}, School = {EECS Department, University of California, Berkeley}, Year = {2020}, Month = {Aug}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-150.html}, Number = {UCB/EECS-2020-150}, Abstract = { With potential applications as diverse as self-driving cars, medical robots, and network protocols, recent years witnessed a staggering interest in building autonomous agents that learn to interact with the world. Despite impressive successes in games and robotic demonstrations, known reinforcement learning (RL) algorithms remain data hungry, unreliable, and complex. To address these issues we need to better understand the limits of training agents that interact with the world. We theoretically analyze the data requirements of RL in several simple settings. To understand the sample complexity of system identification, a fundamental building block of model-based RL and feedback control, we focus on the estimation of linear dynamical systems and, more generally, on the estimation of dynamical systems whose state transitions depend linearly on a known feature embedding of state-action pairs. For linear dynamical systems we present a specialized analysis that captures the correct signal-to-noise behavior of the problem, showing that more unstable linear systems are easier to estimate. While linear systems can be identified from data generated by i.i.d. random inputs, to estimate nonlinear dynamical systems we must use a judicious choice of inputs. We propose an active learning method that addresses this challenge. Then, we study the Linear Quadratic Regulator (LQR), a classical problem in control theory, from a RL perspective by assuming the underlying dynamics are unknown. We consider two solutions that use estimates of the dynamics to synthesize controllers: certainty equivalence and robust LQR. Certainty equivalence is the most straightforward approach to controller synthesis for LQR with unknown dynamics. It generates the optimal controller for the estimated dynamics, disregarding the effects of the estimation error. We show that when the estimation error is sufficiently small the difference between the cost achieved by the certainty equivalent controller and the optimal cost scales like the square of the estimation error. We also consider a robust LQR approach that can operate with larger estimation errors. Our robust LQR method relies on System Level Synthesis to formulate the robust control problem as a quasi-convex optimization problem. We show that the performance gap of robust LQR scales linearly with the estimation error. Therefore, certainty equivalence can outperform robust LQR when the estimation error is small, but the latter approach can operate with larger estimation error. Finally, in many settings RL agents have to operate in the presence of other decision makers. To study RL in such a scenario we take inspiration from the study of two-sided markets and stable matchings. Agents acting in markets often have to learn about their preferences through exploration. With the advent of massive online markets powered by data-driven matching platforms, it has become necessary to better understand the interplay between learning and market objectives. We propose a statistical learning model in which one side of the market does not have a priori knowledge about its preferences for the other side and is required to learn these from stochastic rewards. Our model extends the standard multi-armed bandits framework to multiple players, with the added feature that arms have preferences over players. We show surprising exploration-exploitation trade-offs compared to the single player multi-armed bandits setting.} }

EndNote citation:

%0 Thesis %A Mania, Horia %T The sample complexity of simple reinforcement learning %I EECS Department, University of California, Berkeley %D 2020 %8 August 13 %@ UCB/EECS-2020-150 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-150.html %F Mania:EECS-2020-150