Multi-Agent Reinforcement Learning (MultiCarRacing-v0)
Teaching Cars to Think: My Reinforcement Learning Racing Journey
Github Link : Multi-Agent Reinforcement Learning (MultiCarRacing-v0)
It started with frustration, not inspiration.
I was deep into my machine learning coursework, scrolling through Kaggle competitions and research papers, when I noticed a common theme: most reinforcement learning (RL) tutorials stop at Atari games. Pong, Breakout, maybe CartPole if you’re lucky.
Cool examples, sure—but they didn’t feel real.
I wanted something messy, dynamic, and closer to the real world. Something where the agent couldn’t just memorize screen pixels but actually had to react, adapt, and survive.
That’s when I found the CarRacing environment.
Why Racing Cars?
Car racing is chaotic and unforgiving. Every turn is a test:
Brake too late, and you’re off the track.
Accelerate too much, and you spin out.
Turn too little, and you never finish the lap.
It perfectly embodies what makes RL so powerful—learning to make a sequence of split-second decisions under uncertainty.
This was the challenge I wanted to tackle:
👉 Could I train AI agents, using RL, to learn racing strategies without any rules handed to them?
Setting Up the Track
I worked with two environments:
Single-agent mode (CarRacing-v2, Gymnasium): One car, one brain, grayscale vision.
Multi-agent mode (custom repo): Two cars racing on the same track, each acting independently.
If single-agent racing was about teaching one car to survive, multi-agent racing felt more like refereeing a duel.
When I first put two cars on the same track, something strange happened: either they both crawled along cautiously (to avoid penalties) or they crashed headlong into chaos. Clearly, they needed better incentives.
So I reshaped the reward system to act like a race official:
⏱️ Time Penalty: Every frame cost them -0.1 points. No stalling at the start line.
🏆 Progress Reward:
The leading car earned +1000/N per tile.
The trailing car earned +500/N per tile.
(Where N is the total track length in tiles.)
This way, both cars stayed motivated, but the leader always had an edge—just like in real racing.
🚫 Off-Track Penalty: Going off track meant -100 points. A harsh but necessary rule to keep driving clean.
The result? The cars stopped loafing around and started racing. One would pull ahead, the other would chase, and both learned that the only way forward was—literally—forward.
Models in the Pit Stop
I tested two core RL approaches:
1️⃣ Deep Q-Networks (DQN)
Good for discrete actions.
Pixel inputs → CNN → action-value estimation.
I even experimented with ResNet transfer learning and LSTM-ResNet hybrids.
2️⃣ Proximal Policy Optimization (PPO)
A policy-gradient method.
More stable learning curves.
Tried it in single-agent setups for comparison.
Race Results 🏁
Single-agent DQN: After ~2.5M steps (15 hours CPU training), the agent achieved ~800 average reward—right in line with published results.
Single-agent PPO: Smoother early learning (~500 reward), but plateaued.
Multi-agent DQN: After ~52 hours, both cars learned reasonable policies (~400 reward each), but sometimes “fought” over track tiles instead of racing efficiently.
Lessons From the Track
Representation is everything: Frame stacking gave agents memory of momentum.
Algorithms trade off differently: DQN is efficient but unstable; PPO is steady but weaker.
Multi-agent is messy: Independent learners don’t naturally cooperate—you have to design incentives.
The Roadblocks
Q-value instability → fixed with replay buffers + target networks.
Long CPU training → I had to optimize every preprocessing step.
Reward tuning → a balancing act between punishment and encouragement.
Why This Matters Beyond Racing
What I learned here isn’t just about cars in a simulator.
This applies to any system where decisions must be made in real time under uncertainty:
Autonomous vehicles avoiding crashes.
Robotics navigating dynamic warehouses.
Logistics systems optimizing deliveries and routes.
Reinforcement learning gives machines the ability to adapt, not just follow pre-coded instructions.
And racing, for me, was the perfect playground to test those limits.

