Reinforcement Learning for Autonomous Drone Delivery

The setup

Imagine a 6×6 grid. Somewhere on it: a drone, two packages, two delivery cells, a tornado (no-fly zone), the cells around the tornado affected by wind, and a couple of birds the drone shouldn't run into.

The drone gets to choose one action per time step — move up/down/left/right, pick up a package, drop one off, or hover. It earns reward for picking up packages (+25 each) and delivering them (+100 each). It loses reward for everything that goes wrong: entering the tornado (–100), colliding with a bird (–50), drifting into a wind cell (–10), each step it takes (–2), and any invalid action like moving off the grid or dropping in the wrong cell.

The agent's job is to figure out, from scratch — with no map, no instructions, no human demonstrations — how to maximize total reward. Which means: pick up both packages, deliver them to the right cells, take as few steps as possible, and don't crash.

🚁 ⬜ ⬜ ⬜ ⬜ 🛖
⬜ ⬜ ⬜ ⬜ ⬜ 🦅
⬜ ⬜ 🎁 ⬇ ⬜ ⬜
⬜ ⬜ ⬇ 🌪 ⬆ ⬜
⬜ ⬜ ⬜ ➡ ⬜ ⬜
🛖 🦅 ⬜ ⬜ ⬜ 🏠

🚁 drone · 🎁 package · 🌪 tornado · ⬇⬆➡ wind cells · 🦅 birds · 🏠 destination

Why this is interesting

Reinforcement learning is the part of ML that doesn't get labeled training data. It just gets to try things and see what works. The agent starts knowing nothing about the world, takes a bunch of random actions, occasionally stumbles into a reward, and gradually figures out — across thousands of episodes — what behaviors lead to good outcomes.

This problem is small enough to run thousands of episodes in seconds on a laptop, but rich enough to teach genuine RL ideas: state-action value tables, the explore-vs-exploit trade-off, the difference between deterministic and stochastic environments, and why some algorithms work better in noisy worlds than others.

State, actions, reward

A state captures everything the agent needs to act:

state = (row, col, carrying_P1, carrying_P2, delivered_P1, delivered_P2)

So roughly 6 × 6 × 2 × 2 × 2 × 2 = 6,912 possible states, though many are unreachable (you can't be carrying a package you've already delivered).

Eight actions per state: Up, Down, Left, Right, PickUp, DropOff, Stay. Invalid actions all incur penalties — the agent's free to try anything, but doing the wrong thing costs reward.

The reward function is the actual specification of "what good behavior looks like":

| Event | Reward | |---|---| | Successful delivery | +100 | | Valid pickup | +25 | | Each step | –2 | | Wind cell entry | –10 | | Invalid action | –25 | | Wrong drop-off | –50 | | Bird collision | –50 | | Tornado entry | –100 |

Tuning these numbers is half the work. Too small a step penalty and the agent wanders forever; too big and it refuses to explore. Too cheap a tornado and it accepts the hit for a shortcut; too expensive and it gets paranoid and won't even approach the tornado area.

Two flavors of environment

Deterministic. The tornado, the birds, the packages, the wind cells — all in fixed positions. Same layout every episode. The agent's actions have perfectly predictable consequences. This is the warm-up.

Stochastic. The entire grid is randomized at the start of each episode. The tornado moves to a random neighboring cell every time step. The birds move randomly too. Wind cells push the drone in their direction with 70% probability and let it stay in place 30% of the time. This is the version that actually feels like a real-world delivery problem.

The algorithms

Q-Learning is the tabular RL classic. The agent maintains a giant lookup table — Q[state][action] — that estimates how much total future reward it expects from taking each action in each state. The update rule is:

Q(s, a) ← Q(s, a) + α · [ r + γ · max_a' Q(s', a') − Q(s, a) ]

In plain English: "after seeing what happens, nudge my estimate of how good this action was, toward the reward I just got plus how good I think the next state is."

The agent uses ε-greedy exploration: with probability ε it picks a random action (to discover new strategies), and with probability 1 − ε it picks whatever action its table currently rates highest. ε starts at 1.0 (pure exploration) and decays slowly to 0.01 (almost pure exploitation) as the agent gets better.

I also implemented Double Q-Learning, which keeps two Q-tables and uses one to pick the action and the other to evaluate it. The motivation is that standard Q-Learning has a known overestimation bias — it tends to think actions are better than they really are, because it uses the same noisy table to both choose and judge. Double Q-Learning is supposed to fix that.

It mostly does. Except when it doesn't.

What actually happened

Deterministic environment

Both algorithms converged to the optimal policy within ~3,000 episodes. Final reward per episode: +185 (= 2 × 100 for deliveries, minus around 15 for step costs). Average path length: ~25 steps. Success rate: 100%.

Stochastic environment

This is where the comparison gets interesting:

| Algorithm | Episodes | Test success rate | Avg reward | |---|---|---|---| | Q-Learning | 50,000 | 18 / 20 (~90%) | +135 | | Double Q-Learning | 50,000 | 9 / 20 (~45%) | +130 |

In the noisy environment, standard Q-Learning outperformed Double Q-Learning by a wide margin, despite Double Q-Learning's theoretical advantage. The reason: Double Q-Learning only updates one of its two tables per step, so each table sees half as many updates. In a high-variance environment, that data-starvation effect outweighed the bias-reduction benefit.

This was a real lesson. The paper-perfect theoretical reason to use an algorithm doesn't always survive contact with a hard environment.

What I learned

Reward shaping is most of the work. Without the step penalty, the agent wandered. Without a "don't repeat the same move twice" penalty, it would oscillate between two adjacent safe cells forever. The algorithm is the easy part; the reward function is where the actual design happens.
Exploration is harder than it looks. Pure ε-greedy works, but only with a long enough decay schedule. Drop ε too fast and the agent locks in on the first half-decent strategy it finds.
Tabular methods hit a ceiling fast. 6,912 states is fine for a lookup table. 6,912,000 wouldn't be. The natural next step is a Deep Q-Network (DQN), which replaces the table with a neural network that generalizes across states — but that's a different (much bigger) project.
Theory and practice diverge in noisy environments. Double Q-Learning is theoretically nicer than Q-Learning. In practice, for this problem with this training budget, it was just slower to learn. RL is full of these "it depends" results, and you only find them by running the experiment.

What I'd build next

DQN. Use a neural network as the Q-function approximator and stop needing one table entry per state.
Hierarchical RL. Break the task into reusable sub-policies — "go to package", "avoid tornado", "deliver" — and learn a meta-policy that composes them.
Continuous dynamics. Replace the grid with a continuous 2D plane and real drone physics (inertia, momentum, wind turbulence). That moves the problem from tabular RL into policy-gradient territory (PPO, A2C).

Stack

Python 3.11, NumPy, custom Gym-style environment, Q-Learning and Double Q-Learning implementations, matplotlib for training curves.