人工智能培训

搜索

人工智能论文:具有广义政策改进和分歧校正的熵政策构成(Entropic Policy Composition with Generalized Policy I

[复制链接]
mstester2011 发表于 2018-12-7 11:34:37 | 显示全部楼层 |阅读模式
mstester2011 2018-12-7 11:34:37 84 0 显示全部楼层
人工智能论文:具有广义政策改进和分歧校正的熵政策构成(Entropic Policy Composition with Generalized Policy Improvement and  Divergence Correction)近年来,深度强化学习(RL)算法取得了长足的进步。一个重要的剩余挑战是能够快速将技能转化为新任务,并将现有技能与新获得的技能相结合。在通过组合技能解决任务的领域中,这种能力有望大幅降低深度RL算法的数据要求,从而提高其适用性。最近的工作已经研究了以行动 - 价值函数的形式表现出行为的方式。我们分析这些方法以突出它们的优势和弱点,并指出每种方法都容易出现性能不佳的情况。为了进行这种分析,我们将广义策略改进扩展到最大熵框架,并介绍了在连续动作空间中实现后继特征的实际方法。然后我们提出了一种新方法,原则上可以恢复最佳的policyduring转移。该方法通过明确地学习策略之间的(折扣的,未来的)差异来工作。我们在表格案例中研究了这种方法,并提出了一种适用于多维连续动作空间的可扩展变体。我们将我们的方法与现有的方法进行比较,讨论一系列具有组成结构的非平凡连续控制问题,并且尽管不需要同时观察所有任务奖励,但仍能在质量上更好地表现。
Deep reinforcement learning (RL) algorithms have made great strides in recentyears.An important remaining challenge is the ability to quickly transferexisting skills to novel tasks, and to combine existing skills with newlyacquired ones.In domains where tasks are solved by composing skills thiscapacity holds the promise of dramatically reducing the data requirements ofdeep RL algorithms, and hence increasing their applicability.Recent work hasstudied ways of composing behaviors represented in the form of action-valuefunctions.We analyze these methods to highlight their strengths andweaknesses, and point out situations where each of them is susceptible to poorperformance.To perform this analysis we extend generalized policy improvementto the max-entropy framework and introduce a method for the practicalimplementation of successor features in continuous action spaces.Then wepropose a novel approach which, in principle, recovers the optimal policyduring transfer.This method works by explicitly learning the (discounted,future) divergence between policies.We study this approach in the tabular caseand propose a scalable variant that is applicable in multi-dimensionalcontinuous action spaces.We compare our approach with existing ones on a rangeof non-trivial continuous control problems with compositional structure, anddemonstrate qualitatively better performance despite not requiring simultaneousobservation of all task rewards.人工智能论文:具有广义政策改进和分歧校正的熵政策构成(Entropic Policy Composition with Generalized Policy Improvement and  Divergence Correction) z99Ph3mM5Ii3q3RM.jpg
URL地址:https://arxiv.org/abs/1812.02216     ----pdf下载地址:https://arxiv.org/pdf/1812.02216    ----人工智能论文:具有广义政策改进和分歧校正的熵政策构成(Entropic Policy Composition with Generalized Policy Improvement and  Divergence Correction)
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则 返回列表 发新帖

mstester2011当前离线
新手上路

查看:84 | 回复:0

快速回复 返回顶部 返回列表