Backprop-free Pong: PC + distributional Hebbian plasticity vs. PPO: 57% vs. 59%, ~1500 lines from scratch [P]

Our take

In this exploration, we examine the performance of a fully bio-plausible agent using distributional Hebbian plasticity against a PPO baseline in a custom Pong environment. Despite achieving 57% success with no backpropagation, a mere 2% gap highlights a critical challenge: catastrophic forgetting during self-play under non-stationary dynamics. While distributional value encoding improved stability, it wasn't enough to match PPO. This study emphasizes the plasticity-stability dilemma in biologically inspired reinforcement learning. For more insights, check out our article on foundational knowledge in ML by Andrew Ng.

The exploration of biologically plausible reinforcement learning (RL) agents, particularly in the context of a classic game like Pong, represents a pivotal moment in the quest to bridge artificial intelligence with the workings of the human brain. The recent article detailing the performance of a Hebbian agent against a Proximal Policy Optimization (PPO) baseline sheds light on the nuances of integrating biological principles into AI frameworks. The findings reveal that while the gap in performance—57% for the bio-agent compared to 59% for PPO—may seem marginal, the implications stretch far beyond mere statistics. This research invites us to reconsider how we model intelligence in machines and what it means for the future of AI.

The study's design, which employs a custom Pong environment without traditional reinforcement learning libraries, emphasizes a hands-on approach to understanding the dynamics of these agents. By replacing the PPO policy with a Hebbian value estimation, the researchers aimed to capture the essence of learning as observed in biological systems. The results indicate that while the bio-agent can approximate performance, it does so with limitations rooted in catastrophic forgetting during self-play. This highlights a critical challenge for biologically inspired models—the balance between adaptability and stability. As noted, “Hebbian rules that adapt fast forget fast,” a duality that poses significant hurdles for applications in non-stationary environments, a common scenario in real-world data management.

The implications of this research extend beyond academic curiosity. As we strive to harness the potential of AI in various sectors, including data management and productivity enhancement, the exploration of such bio-plausible agents can inform more robust and resilient AI systems. Tools that can learn and adapt without the need for constant retraining or backpropagation could revolutionize how we approach data analytics and decision-making processes. This aligns with ongoing discussions in the AI community about the need to develop systems that are not only efficient but also capable of mimicking the nuanced learning processes of the brain. For context, consider the emerging discourse on Tabular Foundation Models and their potential to reshape our understanding of data representation.

As we look to the future, the findings from this exploration of biologically plausible agents raise important questions about the trajectory of AI development. Will the community prioritize architectures that embody principles of biological learning, or will they gravitate toward more conventional methods that guarantee higher performance? The challenge lies in fostering innovation while addressing the fundamental limitations observed in this study. As we continue to investigate the boundaries of AI, the dialogue around these methodologies will be crucial in shaping the tools we use for data management and beyond.

In conclusion, the pursuit of integrating biologically plausible principles into reinforcement learning not only enhances our understanding of AI but also prompts us to rethink the frameworks we use to build intelligent systems. With continued research and exploration, we may find pathways that not only narrow the performance gap but also introduce more resilient, adaptable, and ultimately human-centered AI solutions. The journey ahead is filled with potential, and it will be fascinating to observe how these insights translate into practical applications that empower users in their data-driven endeavors.

Wanted to see how close a fully bio-plausible agent could get to PPO on Pong.

Setup

Custom Pong environment (pygame, no gym)
PPO baseline: paper-faithful, from scratch
Hebbian agent: PPO policy replaced with Hebbian value estimation
- engineered features → 61%
BioAgent: Predictive Coding for feature learning + distributional Hebbian plasticity for value (Dabney et al. 2020) → 57% Zero backprop anywhere in the pipeline.

Key observations

The 2% gap is real but small. The bottleneck wasn't the lack of backprop because it was catastrophic forgetting under non-stationary opponent dynamics during self-play.
Distributional value encoding (à la Dabney) helped stability vs. a scalar Hebbian baseline, but not enough to match PPO under self-play.
Self-play exposed the plasticity–stability dilemma hard: Hebbian rules that adapt fast forget fast. This is the real wall for bio-plausible RL in non-stationary settings.

Not claiming novelty in the architecture as this is a from-scratch exploration of whether bio-plausible rules can handle a real RL task. Short answer: yes, mostly, with one clear failure mode.

Code: github.com/nilsleut/Biologically-Plausible-RL-Plays-Pong

Happy to answer questions about the PC implementation, the Hebbian value estimator, or the self-play setup.

submitted by /u/ConfusionSpiritual19
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article →