(zhuan) 一些RL的文献(及笔记)
发布日期:2021-08-25 00:12:08
浏览次数:1
分类:技术文章
本文共 6648 字,大约阅读时间需要 22 分钟。
一些RL的文献(及笔记)
copy from: https://zhuanlan.zhihu.com/p/25770890
Introductions
Introduction to reinforcement learningICML Tutorials:NIPS Tutorials:
Deep Q-Learning
DQN: (and its nature version)Double DQNBootstrapped DQNPriority Experienced ReplayDuel DQN
Classic Literature
SuttonBook
David Silver's thesisPolicy Gradient Methods for Reinforcement Learning with Function Approximation(Policy gradient theorem)1. Policy-based approach is better than value based: policy function is smooth, while using value function to pick policy is not continuous.2. Policy Gradient method.Objective function is averaged on the stationary distribution (starting from s0).For average reward, it needs to be truly stationary.For state-action (with discount), if all experience starts with s0, then the objective is averaged over a discounted distribution (not necessarily fully-stationary). If we starts with any arbitrary state, then the objective is averaged over the (discounted) stationary distribution.Policy gradient theorem: gradient operator can “pass” through the state distribution, which is dependent on the parameters (and at a first glance, should be taken derivatives, too). 3. You can replace Q^\pi(s, a) with an approximate, which is only accurate when the approximate f(s, a) satisfies df/dw = d\pi/d\theta /\piIf pi(s, a) is loglinear wrt some features, then f has to be linear to these features and \sum_a f(s, a) = 0 (So f is an advantage function).4. First time to show the RL algorithm converges to a local optimum with relatively free-form functional estimator.DAggerActor-Critic Models
Asynchronous Advantage Actor-Critic Model
Tensorpack's BatchA3C () and GA3C ()Instead of using a separate model for each actor (in separate CPU threads), they process all the data generated by actors with a single model, which is updated regularly via optimization. On actor-critic algorithms.Only read the first part of the paper. It proves that actor-critic will converge to the local minima, when the feature space used to linearly represent Q(s, a) also covers the space spanned by \nabla log \pi(a|s) (compatibility condition), and the actor learns slower than the critic. Natural Actor-CriticNatural gradient is applied on actor critic method. When the compatibility condition proposed by the policy gradient paper is satisfied (i.e., Q(s, a) is a linear function with respect to \nabla log pi(a|s), so that the gradient estimation using this estimated Q is the same as the true gradient which uses the unknown perfect Q function computed from the ground truth policy), then the natural gradient of the policy's parameters is just the linear coefficient of Q. A Survey of Actor-Critic Reinforcement Learning Standard and Natural Policy GradientsCovers the above two papers.Continuous State/Action
Reinforcement Learning with Deep Energy-Based Policies
Use the soft-Q formulation proposed by (in the math section) and naturally incorporate the entropy term in the Q-learning paradigm. For continuous space, both the training (updating Bellman equation) and sampling from the resulting policy (in terms of Q) are intractable. For the former, they propose to use a surrogate action distribution, and compute the gradient with importance sampling. For the latter, they use Stein variational method that matches a deterministic function a = f(e, s) to the learned Q-distribution. In terms of performance, they are comparable with DDPG. But since the learnt Q could be diverse (multimodal) under maximal entropy principle, it can be used as a common initialization for many specific tasks (Example, pretrain=learn to run towards arbitrary direction, task=run in a maze). Deterministic Policy Gradient AlgorithmsSilver's paper. Learn an actor to prediction the deterministic action (rather than a conditional probability distribution \pi(a|s)) in Q-learning. When trained with Q-learning, propagate through Q to \pi. Similar to Policy Gradient Theorem (gradient operator can “pass” the state distribution, which is dependent on the parameters), there is also deterministic version of it. Also interesting comparison with stochastic offline actor-critic model (stochastic = \pi(a|s)). Continuous control with deep reinforcement learning (DDPG)Deep version of DPG (with DQN trick). Neural network + minibatch → not stable, so they also add target network and replay buffer.Reward Shaping
Policy invariance under reward transformations: theory and application to reward shaping.
Andrew Ng's reward shaping paper. It proves that for reward shaping, policy is invariant if and only if a difference of a potential function is added to the reward. Theoretical considerations of potential-based reward shaping for multi-agent systemsPotential based reward-shaping can help a single-agent achieve optimal solution without changing the value (or Nash Equilibrium). This paper extends it to multi-agent case.Reinforcement Learning with Unsupervised Auxiliary TasksICLR17 Oral. Add auxiliary task to improve the performance of Atari Games and Navigation. Auxiliary task includes maximizing pixel changes and maximizing the activation of individual neurons.Navigation
Learning to Navigate in Complex Environments
Raia's group from DM. ICLR17 poster, adding depth prediction as the auxiliary task and improve the navigation performance (also uses SLAM results as network input) (in reward shaping)Deep Reinforcement Learning with Successor Features for Navigation across Similar EnvironmentsGoal: navigation without SLAM.Learn successor features (Q, V before the last layer, these features have a similar Bellman equation.) for transfer learning: learn k top weights simultaneously while sharing the successor features, using DQN acting on the features). In addition to successor features, also try to reconstruct the frame.Experiments on simulation.state: 96x96x four most recent frames.action: four discrete actions. (still, left, right, straight(1m))baseline: train a CNN to directly predict the action of A*Deep Recurrent Q-Learning for Partially Observable MDPsThere is no much performance difference between stacked frame DQN versus DRQN. DRQN may be more robust when the game state is flickered (some are 0)Counterfactual Regret Minimization
Dynamic Thresholding
With proofs:Study game state abstraction and its effect on Ludoc Poker.Decomposition:Solving Imperfect Information Games Using DecompositionSafe and Nested Endgame Solving for Imperfect-Information GamesGame-specific RL
Atari Game
GoAlphaGoDarkForest
Super Smash Bros
DoomArnold: Intel: F1: PokerLimited Texas hold' emUnlimited Texas hold 'em DeepStack:转载地址:https://blog.csdn.net/weixin_33836223/article/details/90219136 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!
发表评论
最新留言
逛到本站,mark一下
[***.202.152.39]2024年04月22日 10时33分50秒
关于作者
喝酒易醉,品茶养心,人生如梦,品茶悟道,何以解忧?唯有杜康!
-- 愿君每日到此一游!
推荐文章
Unity中的刚体
2019-04-27
Unity中的坐标转换
2019-04-27
Unity中为什么不能对transform.position.x直接赋值?
2019-04-27
Unity中物体移动方法详解
2019-04-27
使用对象池优化性能
2019-04-27
Unity中的UI方案(基础版)
2019-04-27
Lua(一)——Lua介绍
2019-04-27
Lua(二)——环境安装
2019-04-27
Unity中父子物体的坑
2019-04-27
基础知识——进位制
2019-04-27
Lua(十二)——表
2019-04-27
Lua(十三)——模块与包
2019-04-27
Lua(四)——变量
2019-04-27
Lua(十四)——元表
2019-04-27
Lua(十五)——协同程序
2019-04-27
Lua(十六)——文件
2019-04-27
Lua(十七)——面向对象
2019-04-27
Lua(十八)——错误处理,垃圾回收
2019-04-27
xLua(一)——介绍
2019-04-27
xLua(二)——下载
2019-04-27