I Trained an AI to Beat Final Fight… Here’s What Happened [p]
Our take
![I Trained an AI to Beat Final Fight… Here’s What Happened [p]](https://external-preview.redd.it/dWZSKa_lMUvycB0q8xwsIkTgDpHLe-W2-Q_S7RwWucQ.jpeg?width=320&crop=smart&auto=webp&s=6a0cfa02507091d2949c8f6b2fcd59a254a23929)
The recent Reddit post by /u/AgeOfEmpires4AOE4 detailing the training of an AI agent to play *Final Fight* via Behavior Cloning offers a refreshingly honest look into the practical hurdles of imitation learning. The author’s straightforward account—from action space remapping and trajectory alignment bugs to the curious discrepancy between LSTM policy performance during evaluation and manual rollout—strips away the typical hype surrounding AI achievements. This project, and others like it such as the work on *Resident Evil* games documented in "Training an AI to play Resident Evil Requiem using Behavior Cloning + HG-DAgge [P]" and "[P] I trained an AI to play Resident Evil 4 Remake using Behavioral Cloning + LSTM," underscores a critical truth: applying even well-established algorithms to complex, partially observable environments remains a meticulous engineering challenge. The value here is not in a flawless victory over a 1990s arcade game, but in the transparent documentation of the friction between algorithm and reality.
The specific challenges raised are emblematic of the gap between theoretical benchmarks and functional deployment. Remapping a multi-binary action space to an emulator’s input is a mundane but essential translation layer, often glossed over in academic papers. The trajectory alignment issue—an observation/action offset bug—highlights how fragile dataset curation can be; a single-step misalignment can corrupt an entire demonstration’s pedagogical value. Furthermore, the LSTM policy’s divergent behavior points to a fundamental issue in evaluation methodology: an agent may appear competent in a scripted rollout but fail under the dynamic, stochastic conditions of actual interaction. These are not mere bugs but profound questions about data quality, model robustness, and what we truly mean by “learning” from demonstrations. The author’s plan to transition from pure Behavior Cloning to a hybrid GAIL + PPO approach is the logical next step, aiming to overcome the distribution shift where the agent encounters states not present in the expert data.
Why does this meticulous, game-based experimentation matter to a broader audience? Because the core problems—learning from limited, imperfect demonstrations, managing partial observability, and stabilizing policy improvement—are identical to those in industrial robotics, autonomous vehicles, and personalized software automation. A robot learning a complex assembly task from human teleoperation will face the same state-alignment and action-repetition challenges as an AI learning to time a jump-kick in *Final Fight*. By using classic arcade games as a testbed, researchers create a transparent, reproducible, and low-stakes environment to debug these fundamental issues. This work pushes the field toward more sample-efficient and reliable imitation learning, which is crucial for applications where collecting vast amounts of expert data is prohibitively expensive or dangerous. The focus is on building systems that can generalize from a few expert trajectories, a necessity for bringing advanced AI into practical, real-world workflows.
The forward path is clear: improving the robustness of imitation learning requires innovations in dataset aggregation, uncertainty modeling, and hybrid objective functions that blend imitation with reinforcement signals. The author’s inquiry into best practices for transitioning from BC to PPO touches on a key strategic question—how do we best augment imperfect demonstrations with exploratory learning? The answer will determine how quickly these techniques move from nostalgic game-playing to transformative tools in productivity software and beyond. The next frontier is not just beating the game, but building agents that understand *why* a sequence of actions works, enabling them to adapt when the rules inevitably change.
| Hey everyone, I’ve been experimenting with Behavior Cloning on a classic arcade game (Final Fight), and I wanted to share the results and get some feedback from the community. The setup is fairly simple: I trained an agent purely from demonstrations (no reward shaping initially), then evaluated how far it could go in the first stage. I also plan to extend this with GAIL + PPO to see how much performance improves beyond imitation. A couple of interesting challenges came up:
The agent can already make some progress, but still struggles with consistency and survival. I’d love to hear thoughts on:
Here’s the code if you want to see the full process and results: Any feedback is very welcome! [link] [comments] |
Read on the original site
Open the publisher's page for the full experience
Related Articles
- I Trained an AI to Beat Final Fight… Here’s What Happened [P]Hey everyone, I’ve been experimenting with Behavior Cloning on a classic arcade game (Final Fight), and I wanted to share the results and get some feedback from the community. The setup is fairly simple: I trained an agent purely from demonstrations (no reward shaping initially), then evaluated how far it could go in the first stage. I also plan to extend this with GAIL + PPO to see how much performance improves beyond imitation. A couple of interesting challenges came up: Action space remapping (MultiBinary → emulator input) Trajectory alignment issues (obs/action offset bugs 😅) LSTM policy behaving differently under evaluation vs manual rollout Managing rollouts efficiently without loading everything into memory The agent can already make some progress, but still struggles with consistency and survival. I’d love to hear thoughts on: Improving BC performance with limited trajectories Best practices for transitioning BC → PPO Handling partial observability in these environments Here’s the code if you want to see the full process and results: notebooks-rl/final_fight at main · paulo101977/notebooks-rl Any feedback is very welcome! submitted by /u/AgeOfEmpires4AOE4 [link] [comments]
- Training an AI to play Resident Evil Requiem using Behavior Cloning + HG-DAgge [P]Code of Project: https://github.com/paulo101977/notebooks-rl/tree/main/re_requiem I’ve been working on training an agent to play a segment of Resident Evil Requiem, focusing on a fast-paced, semi-linear escape sequence with enemies and time pressure. Instead of going fully reinforcement learning from scratch, I used a hybrid approach: Behavior Cloning (BC) for initial policy learning from human demonstrations HG-DAgger to iteratively improve performance and reduce compounding errors The environment is based on gameplay capture, where I map controller inputs into a discretized action space. Observations are extracted directly from frames (with some preprocessing), and the agent learns to mimic and then refine behavior over time. One of the main challenges was the instability early on — especially when the agent deviates slightly from the demonstrated trajectories (classic BC issue). HG-DAgger helped a lot by correcting those off-distribution states. Another tricky part was synchronizing actions with what’s actually happening on screen, since even small timing mismatches can completely break learning in this kind of game. After training, the agent is able to: Navigate the sequence consistently React to enemies in real time Recover from small deviations (to some extent) I’m still experimenting with improving robustness and generalization (right now it’s quite specialized to this segment). Happy to share more details (training setup, preprocessing, action space, etc.) if anyone’s interested. submitted by /u/AgeOfEmpires4AOE4 [link] [comments]
- [P] I trained an AI to play Resident Evil 4 Remake using Behavioral Cloning + LSTMI recorded gameplay trajectories in RE4's village — running, shooting, reloading, dodging — and used Behavioral Cloning to train a model to imitate my decisions. Added LSTM so the AI could carry memory across time steps, not just react to the current frame. The most interesting result: the AI handled single enemies reasonably well, but struggled with the fight-or-flee decision when multiple enemies were on screen simultaneously. That nuance was hard to imitate without more data. Full video breakdown on YouTube. Source code and notebooks here: https://github.com/paulo101977/notebooks-rl/tree/main/re4 Happy to answer questions about the approach. submitted by /u/AgeOfEmpires4AOE4 [link] [comments]
- [P] I built an autonomous ML agent that runs experiments on tabular data indefinitely - inspired by Karpathy's AutoResearchInspired by Andrej Karpathy's AutoResearch, I built a system where Claude Code acts as an autonomous ML researcher on tabular binary classification tasks (churn, conversion, etc.). You give it a dataset. It loops forever: analyze data, form hypothesis, edit code, run experiment, evaluate with expanding time windows (train on past, predict future - no leakage), keep or revert via git. It edits only 3 files - feature engineering, model hyperparams, and analysis code. Everything else is locked down. Edit: To clarify based on some comments, I am using this to solve the problem of finding new signals to add to the model, not trying to overfit a limited dataset. -end Edit- Key design decisions: Introducing an analysis loop in addition to the experiment loop, this allow for better reflection and experimentation. Optimize for experiment throughput with a bunch of decisions: Use LightGBM as default model, limit feature count and tree count, locking down training run until it finishes. Constrained editing surface: only 3 files + logs. No infrastructure changes, no package installs. Without this, the agent will eventually try to modify the evaluation code to "improve" its score. Docker sandbox - the agent runs with full shell access (--dangerously-skip-permissions). Container keeps it contained. Expanding time windows over k-fold - mean score across multiple temporal train/test splits. Forced logging - every experiment gets a LOG.md entry (hypothesis, result, takeaway). Significant insights go to LEARNING.md. You can read the agent's reasoning after the fact. Analysis primitives built-in - univariate AUC, correlation pairs, null rates, feature importance, error analysis. The agent writes analysis code using these to save time, they also serve as initial suggestions for the first few analyses. What I learned building this: Air-tight evaluation is the essential for real improvement - this lesson hit me twice: Earlier version didn't constraint which file the agent could edit, it eventually changed the evaluation code to make "improvement" easier for itself. K-fold validation was originally employed, the agent found improvements that are actually data leakage and didn't hold out-of-time. After a painful manual inspection, I switched over to expanding time windows. Do everything to protect experiment throughput - this lesson also hit twice: Initially, I let the model run wild and was not very impressed when it barely run 20 experiments overnight. Turns out, the agent engineered thousands of new features that slowed down training and crash some runs due to RAM limit. I added the feature count limit and tree count limit to make sure training time is reasonable. Despite that, the agent still manage to crash/slow down training runs by putting many of them into background process at the same time. -> Locking mechanism was implemented to prevent 2 experiments being run at the same time. After this, the rate of progress increased to hundreds of runs per day. Persistent memory is important: Without forced logging, the agent would repeat experiments it already tried. The LOG.md and LEARNING.md system gives it memory across iterations. The code open source (sanitized version): https://github.com/trantrikien239/autoresearch-tabularOf course it is done with Claude Code, but it has improved so much after rounds of iterations, including manual edits, so I think it's worth sharing. submitted by /u/Pancake502 [link] [comments]