Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning
- is the goal space
- is the sparse deterministic reward function
通过在不同goal的分布和上采样来生成不同任务，进而探索goal-conditioned generalization problem
Causal Reasoning with Graphical Models
- random variables with index set
- A graph consists of nodes and edges
- A node is called a parent of if and . The set of parents of is denoted by .
GCRL as Latent Variable Models
从probabilistic inference的角度, 目标是解决likelihood maximization problem for with . 将graph 作为latent variable，可以将分解后得到ELBO：
和是常数（uniform distribution），因此maximize 可以转换成objective：
Intuition：为了解上述优化问题，需要交替更新 (causal discovery)和 (model and policy learning)
We propose to model the transition corresponding to G with a collection of neural networks to obtain
- represents the values of all parents of node at time step
- follows Gaussian noise
Policy learning with planning
- MPC (random shooting):
Data-Efficient Causal Discovery
- restrict the posterior to point mass distribution and use a threshold to control the sparsity.
- perform the discovery process from the classification perspective by proposing binary classifiers to determine the existence of an edge .
- is the threshold for the p-value of the hypothesis. A larger corresponds to harder sparsity constraints, leading to a sparse since two nodes are more likely to be considered independent.
According to the definition 3, we only need to conduct classification to edges connecting nodes between and . If two nodes are dependent, we add one edge directed from the node in to the node in .
Analysis of Performance Guarantee
- causal graph越好，model learning效果越好
- model learning效果越好，value function越接近optimal
- 想要控制bound，需要更好的policy（因此需要交替进行model learning和policy learning）
Summary & Thoughts
- 通过学习causal transition model来提升generality
- 是对causal graph的显式估计，训练难度大
Problem: 不存在关系 , how to learn ?
- 如果隐式encode ，和其他方法没有大的区别
- 需要增加额外信息才能不依赖得到？e.g., 增加assumption: 与current information存在关联