人工智能培训

搜索

深度学习论文:规避风险的信任区域优化,以减少奖励波动(Risk-Averse Trust Region Optimization for Reward-Vola

[复制链接]
einter 发表于 2019-12-9 13:16:07 | 显示全部楼层 |阅读模式
einter 2019-12-9 13:16:07 253 0 显示全部楼层
深度学习论文:规避风险的信任区域优化,以减少奖励波动(Risk-Averse Trust Region Optimization for Reward-Volatility Reduction)在现实世界中的决策问题中,例如在金融,机器人技术或自动驾驶领域,将不确定性控制在受控范围内与最大化预期回报同样重要。通过与收益变化相关的风险衡量,强化学习文献中已解决了风险规避问题。但是,在许多情况下,不仅要从长期角度衡量风险,而且要从逐步收益中衡量风险(例如,在交易中,为确保投资银行的稳定性,必须监控风险的投资组合头寸风险)每天)。在本文中,我们定义了一种新的风险度量,称为奖励波动,它由状态占用度量下的奖励方差组成。我们证明了奖励波动率限制了回报方差,因此减少前者也会约束后者。我们推导了具有新目标函数的策略梯度定理,该函数利用均值-波动率关系,并开发了仅基于角色的算法。此外,由于在新的目标函数下定义的Bellmanequations的线性,可以以规避风险的方式将众所周知的策略梯度算法与单调改进保证(如TRPO)相适应。最后,我们在两个模拟金融环境中测试了该方法。
In real-world decision-making problems, for instance in the fields offinance, robotics or autonomous driving, keeping uncertainty under control isas important as maximizing expected returns.Risk aversion has been addressedin the reinforcement learning literature through risk measures related to thevariance of returns.However, in many cases, the risk is measured not only on along-term perspective, but also on the step-wise rewards (eg, in trading, toensure the stability of the investment bank, it is essential to monitor therisk of portfolio positions ona daily basis).In this paper, we define a novelmeasure of risk, which we call reward volatility, consisting of the variance ofthe rewards under the state-occupancy measure.We show that the rewardvolatility bounds the return variance so that reducing the former alsoconstrains the latter.We derive a policy gradient theorem with a new objectivefunction that exploits the mean-volatility relationship, and develop anactor-only algorithm.Furthermore, thanks to the linearity of the Bellmanequations defined under the new objective function, it is possible to adapt thewell-known policy gradient algorithms with monotonic improvement guaranteessuch as TRPO in a risk-averse manner.Finally, we test the proposed approach intwo simulated financial environments.深度学习论文:规避风险的信任区域优化,以减少奖励波动(Risk-Averse Trust Region Optimization for Reward-Volatility Reduction)
URL地址:https://arxiv.org/abs/1912.03193     ----pdf下载地址:https://arxiv.org/pdf/1912.03193    ----深度学习论文:规避风险的信任区域优化,以减少奖励波动(Risk-Averse Trust Region Optimization for Reward-Volatility Reduction)
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则 返回列表 发新帖

einter当前离线
新手上路

查看:253 | 回复:0

快速回复 返回顶部 返回列表