WandB: https://wandb.ai/arth-shukla/PPO%20Gym%20Cart%20Pole
Proximal Policy Optimization Algorithms: https://arxiv.org/pdf/1707.06347.pdf
Algorithms/Concepts: PPO, Experience Replay
AI Development: Pytorch (Torch, Cuda), OpenAI Gym, WandB
More episode videos available on WandB: https://wandb.ai/arth-shukla/PPO%20Gym%20Cart%20Pole
The PPO Model currently only supports discrete action spaces (categorical distribution). In OpenAI Gym Cartpole, by episode 136, the agent is able to effectively "beat" cartpole:

First I want to implement algorithms that came before PPO (DQNs or earlier actor-critic algorithms like DDPG, etc) to get a stronger understanding of the math. Also, I'll get a change to make agents for popular environemnts like Mario.
I also want to tackle more challenging game environments, like the DM Control Suite. To do this, I'll explore PPO for continuous action spaces (through normal distributions), other similarly effective models like SAN, and models like RecurrentPPO which offer some implemenation challenges.
Finally, there are some other options for experience replay I'd like to implement, like Prioritized ER.