报告题目：When Demonstrations meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning
报告人：Dr. Siliang Zeng（University of Minnesota）
Offline inverse reinforcement learning (Offline IRL) aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task have applications in safety-sensitive applications such as clinical decision making and autonomous driving. However, the structure of an expert’s preferences implicit in observed actions is closely linked to the expert’s model of the environment dynamics (i.e. the “world” model). Thus, inaccurate models of the world obtained from finite data with limited coverage could compound inaccuracy in estimated rewards. To address this issue, we propose a bi-level optimization formulation of the estimation task wherein the upper level is likelihood maximization based upon a conservative model of the expert’s policy (lower level). The policy model is conservative in that it maximizes reward subject to a penalty that is increasing in the uncertainty of the estimated model of the world. We propose a new algorithmic framework to solve the bi-level optimization problem formulation and provide statistical and computational guarantees of performance for the associated optimal reward estimator. Finally, we demonstrate that the proposed algorithm outperforms the state-of-the-art benchmarks by a large margin, over robotics control tasks and large language model training.
Siliang Zeng is a fourth-year Ph.D. student in the Department of Electrical and Computer Engineering at the University of Minnesota, Twin Cities. He earned his bachelor's degree in Statistics from the Chinese University of Hong Kong (Shenzhen) in 2020. Siliang's research interests focus on reinforcement learning and foundation models. His scholarly contributions are reflected in publications and pending revisions in prestigious venues, such as NeurIPS, CoRL, SIAM Journal on Optimization, and Operations Research.