Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Li, Zihao; Yang, Zhuoran; Wang, Mengdi

Computer Science > Machine Learning

arXiv:2305.18438 (cs)

[Submitted on 29 May 2023 (v1), last revised 3 Jul 2023 (this version, v3)]

Title:Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Authors:Zihao Li, Zhuoran Yang, Mengdi Wang

View PDF

Abstract:In this paper, we study offline Reinforcement Learning with Human Feedback (RLHF) where we aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift. In this paper, we focus on the Dynamic Discrete Choice (DDC) model for modeling and understanding human choices. DCC, rooted in econometrics and decision theory, is widely used to model a human decision-making process with forward-looking and bounded rationality. We propose a \underline{D}ynamic-\underline{C}hoice-\underline{P}essimistic-\underline{P}olicy-\underline{O}ptimization (DCPPO) method. \ The method involves a three-stage process: The first step is to estimate the human behavior policy and the state-action value function via maximum likelihood estimation (MLE); the second step recovers the human reward function via minimizing Bellman mean squared error using the learned value functions; the third step is to plug in the learned reward and invoke pessimistic value iteration for finding a near-optimal policy. With only single-policy coverage (i.e., optimal policy) of the dataset, we prove that the suboptimality of DCPPO almost matches the classical pessimistic offline RL algorithm in terms of suboptimality's dependency on distribution shift and dimension. To the best of our knowledge, this paper presents the first theoretical guarantees for off-policy offline RLHF with dynamic discrete choice model.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
Cite as:	arXiv:2305.18438 [cs.LG]
	(or arXiv:2305.18438v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2305.18438

Submission history

From: Zihao Li [view email]
[v1] Mon, 29 May 2023 01:18:39 UTC (35 KB)
[v2] Wed, 31 May 2023 15:47:45 UTC (36 KB)
[v3] Mon, 3 Jul 2023 13:08:46 UTC (37 KB)

Computer Science > Machine Learning

Title:Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators