Preferential proximal policy optimization in reinforcement learning
MetadataShow full item record
The Proximal Policy Optimization (PPO), a policy gradient method, excels in reinforcement learning with its ”surrogate” objective function and stochastic gradient ascent. However, PPO does not fully consider the significance of frequently encountered states in policy/value updates. To address this, this Thesis introduces Preferential Proximal Policy Optimization (P3O), which integrates the importance of these states into parameter updates. We determine state importance by multiplying the variance of action probabilities by the value function, then normalizing and smoothing this with the Exponentially Weighted Moving Average (EWMA). This calculated importance is incorporated into the surrogate objective function, redefining value and advantage estimation in PPO. Our method auto-selects state importance, which can apply to any on-policy reinforcement learning algorithm using a value function. Empirical evaluations across six Atari environments demonstrate that our approach outperforms the baseline (vanilla PPO) across different tested environments, highlighting the value of our proposed method in learning complex environments.