Multitask Preplay · Under review at PNAS · 2025

Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines

During offline replay, humans and AI agents simulate goals they did not pursue — caching solutions into a shared predictive representation that explains a counterintuitive human bias and improves RL generalization to thousands of unseen worlds.

Wilka Carvalho1Sam Hall-McMaster2Honglak Lee3,4Samuel J. Gershman1,2
1Kempner Institute · Harvard 2Psychology & CBS · Harvard 3CSE · Michigan 4LG AI Research
Demo. One mechanism. Two puzzles. While pursuing one task in our 2D Minecraft (Craftax) environment, an agent encounters accessible goals it doesn't pursue. During offline replay, it preplays — counterfactually simulates — pursuing those goals, caching predictions for them. When a new task arrives, behavior on it is fast and reactive, as if the route had been pre-rehearsed.
TLDR

People are faster at new tasks when they reuse an old path — even when the reused path is longer than the optimal shortcut. We argue this happens because, during offline replay, humans and AI agents simulate goals they didn't pursue — caching solutions into a shared predictive representation. The same algorithm explains a behavioral bias in people and scales an RL agent to 10,000 unseen Craftax worlds.

  1. Algorithm. Multitask Preplay: replay one task; preplay the goals you observed but didn't pursue; cache the resulting trajectories into a shared predictive representation.
  2. Behavioral evidence. Across grid-world and Craftax experiments (n≈100/each, 5 pre-registered predictions), people behave as if they have preemptively rehearsed goals before they encounter them — faster RTs at familiar junctures pointing toward an unannounced goal, partial path reuse even when the new shortcut is shorter.
  3. AI scaling. Beats Dyna, Universal Value Functions, Universal Successor Features, and Hindsight Experience Replay on transfer to 10,000 held-out Craftax environments.
  4. Bridging. One computational idea — offline counterfactual simulation — connects human behavioral biases to RL generalization. AI and cognitive science can keep informing each other on this template.
The Idea

One mechanism. Two puzzles.

Do RL agents really need to replan from scratch for every new goal? Why do rodents "preplay" goals they haven't pursued?

While searching for coffee in a new neighborhood, you may come across gyms, grocery stores, and parks — enabling you to quickly find these locations later when you need them. You weren't told to look for the grocery store. But when the goal becomes "go to the grocery store," your behavior is fast and reactive, as if you'd planned the route already.

We hypothesize this happens because, during offline replay, you image pursuing goals you encountered along the way but didn't actually pursue. Those counterfactual trajectories cache into a single value function that covers many goals. We call it Multitask Preplay — a nod to hippocampal preplay, where place cells fire for places the animal hasn't been yet. The underlying RL idea goes back to Kaelbling in the 90s: use one experience to learn about many goals. We wanted to see what it looks like at scale.

Multitask Preplay generalizes Dyna, which only caches the goal you pursued. It sits between Dyna (model-based, single-task) and successor features (predictive representations across goals), combining what each does well.

Schematic of Multitask Preplay. While pursuing goal g, the agent observes accessible goal g' along the trajectory; offline, it preplays a counterfactual rollout to g' and caches predictions for both into a shared value function.
Figure 1. Multitask Preplay overview. (1) The agent pursues a task and observes other accessible goals along the way. (2) During offline replay, the agent counterfactually simulates pursuit of those other goals. (3) Predictions for all goals are cached into a shared representation. (4) When an unannounced goal becomes the actual task, behavior is fast and reactive — no new experience or planning required.
Two-axis categorization of RL methods. Top: a 2x3 grid mapping number-of-predictions-per-state (single reward vs. multiple) against compute strategy (model-free, model-based background learning, decision-time planning). Multitask Preplay (Ours) sits in the multi-prediction model-based-background cell. Bottom: gridworld behavior for Q-learning, Successor Features / HER, Dyna, and Multitask Preplay, showing what each caches when the agent has experienced the path to a coffee shop while observing food and grocery store along the way.
Figure 2. Where Multitask Preplay fits among RL methods. Top: Two axes — number of predictions per state (1 reward vs. multiple) × compute strategy (model-free, model-based background learning, decision-time planning). Yellow methods cache the results of computation; purple methods leverage model-based planning. Bottom: Behavior in a gridworld where the agent walked to a coffee shop while passing food and a grocery store. (a) Q-learning caches a single reward prediction along the experienced path. (b) Successor Features / HER cache predictions for multiple goals — but only those whose features were experienced. (c) Dyna simulates extra coffee-shop trajectories to refine the same single prediction. (d) Multitask Preplay simulates trajectories to accessible-but-unpursued goals (food, groceries), caching a predictive map that covers all three.
Behavioral Evidence

People behave as if they've rehearsed goals before they encounter them.

We tested this in a setting where train and test conditions involved distinct tasks. Despite only ever being trained on the train tasks, people behave as if they've rehearsed goals they weren't told about. Spawn the same person at a familiar juncture, point them at a novel goal — they're faster than from an unfamiliar start the same distance away. Only Multitask Preplay predicts this.

Across grid-world (JaxMaze) and 2D Minecraft (Craftax) studies, we test 5 pre-registered predictions. The headline finding: the same person is faster at the same novel task when starting from a familiar location than from an unfamiliar one — even when both starts are equally far from the goal. Multitask Preplay explains why: when a person passed by that location during training, they preplayed the route from there to nearby goals, and so the route is already cached when the test arrives. This inverts what model-free, decision-time, or hindsight-based methods predict.

Demo. A subject performing a trial in our juncture-spawn experiment. Spawned at a familiar juncture from training, the subject is asked to reach a novel goal. Their first response from the familiar juncture is faster than when the same subject is spawned at an unfamiliar location the same distance from the goal.
Four panels (I, J, K, L). I-K: JaxMaze layouts showing the same eval target reached from a familiar juncture (Near, Known), the same juncture with the eval goal hidden (Near, Unknown), and a novel distance-matched location (Far, Known). L: bar chart of delta log first-RT for each spawn condition; novel-spawn is significantly slower than juncture-spawn for both known and unknown evaluation goals.
Figure 3. Juncture-spawn experiment (JaxMaze). Subjects are spawned at one of three locations during evaluation: (I) a familiar juncture from training, with the test goal announced; (J) the same familiar juncture, but the test goal is unannounced and must be inferred from the available cues; (K) an unfamiliar location, distance-matched to the goal. (L) Δ log first-RT is significantly elevated when spawning from unfamiliar locations vs. familiar junctures (***), even when the test goal is unknown. Subjects behave as if they have rehearsed the path to those goals before they encounter them as the announced task.

The juncture-spawn advantage is one signature; partial path reuse is another. Give a person a new goal near an old route, and they don't take the optimal shortcut — they partially reuse the trained path. We observe this even when the reused path is longer than the available shortcut: in our path-reuse experiment, the reused-path length is 61 steps versus a shortcut of 55, and subjects still partially reuse the trained path 62.8% of the time, with lower step-by-step response times on the reused segment. To test whether this is a generic "habit" effect or something specific to multi-goal preplay, we benchmarked humans against a panel of RL methods in Craftax. Humans and Multitask Preplay cluster at high success and high path reuse; Dyna, UVF, USFA, and HER all fall away.

Three panels (A, B, C). A: a procedurally generated 2D Minecraft (Craftax) overworld with the agent spawn, training stones marked by stars, and an evaluation stone visible nearby. B: the agent's partial view (foreground panel) and inventory bar. C: scatter plot of path-reuse rate (X) vs. success rate (Y); humans (blue: known eval goal; orange: unknown eval goal) cluster at high success and high path-reuse, with Multitask Preplay alongside; Dyna, Universal Value Function, Universal Landmark Successor Features, and Hindsight Experience Replay all sit far below — they generalize but don't reuse paths.
Figure 4. 2D Minecraft (Craftax) behavioral experiment. (A) Subjects learn to obtain training stones across procedurally generated maps; an evaluation stone is visibly nearby in some maps. (B) During evaluation, subjects only see a partial view of the world plus their inventory. (C) Generalization success vs. partial path reuse, by model. Humans cluster at high success and high partial path-reuse — both when the eval goal is known (blue) and unknown (orange). Of all RL methods tested, only Multitask Preplay lands in the same region. Dyna, UVF, USFA, and HER all fall well below: they generalize but don't partially reuse paths the way humans do.
Results

Only humans and Multitask Preplay clear the cliff.

The ablation we found most telling: in Craftax, every baseline — Dyna, Universal Value Functions, Successor Features, Hindsight Experience Replay — trains to near-perfect success. At test, only humans and Multitask Preplay transfer. Every other method collapses.

Train and test panels side-by-side. All methods reach near-perfect performance during training. At test, only Human and Multitask Preplay bars stand; all other RL baselines fall to near zero.
Figure 5. Train vs. test on Craftax. Train (left): all methods, including humans, reach near-ceiling performance. Test (right): only humans and Multitask Preplay transfer. Every other RL baseline collapses. Humans and AI are evaluated on the same test set.

One agent, 10,000 new worlds.

This same algorithm scales further. When learning from a finite set of environments, Multitask Preplay better generalizes to 10,000 unseen Craftax worlds. Dyna and model-free methods plateau long before that.

Generalization performance to 10,000 unique held-out environments as a function of training environment count. Multitask Preplay continues to scale; Dyna and model-free baselines plateau.
Figure 6. AI generalization to 10,000 held-out Craftax environments. Multitask Preplay continues to improve as the training-environment count increases. Dyna (1M training steps) and model-free methods (10M training steps) plateau long before. Each point: mean ± SE across 5 model initializations.
Two signatures in different systems, one computational idea underneath.

The same trick that explains a behavioral bias in people also scales an RL agent to unseen worlds. We're hopeful this is a template for how AI and cognitive science can keep informing each other: a computational-level claim that constrains both the algorithm space (in RL) and the hypothesis space (in cognitive science).

Limitations

What this work doesn't do.

You need a reasonably accurate world model to preplay with. Multitask Preplay assumes access to a model that can roll out counterfactual goals during offline replay. We use ground-truth simulators in this paper. Whether the result holds when the world model itself is learned online — and how preplay interacts with a noisy or biased model — is not established here.

Tested on 2D environments with discrete tasks. Our results span gridworlds and 2D Minecraft. Whether the same mechanism scales to high-dimensional continuous control or to compositionally complex task structures (Habitat-class transfer) is the natural next step, but it's not what we showed.

Behavioral evidence is consistent with preplay, not unique to it. Our behavioral predictions are derived from Multitask Preplay; the data is consistent with the algorithm. Other algorithms that share core properties (e.g., off-policy multi-goal value caching with TD-learning) may make similar predictions. The strongest part of the bridge claim is that Multitask Preplay alone matches both human path-reuse rates and the train/test cliff among the baselines we compared.

What's Next

Where this goes.

Next we want to see what happens when the world model is learned online — for example, with MuZero — and to scale to domains like Habitat, where agents need to perform inter-related tasks in unseen homes.

Scaling there will lean on two corrections we developed along the way: off-task Q(λ) and conservative all-goals learning. They're what let us match the human data — and what we think will unlock Habitat-class transfer.

Methods (for the technically curious)

Off-task Q(λ). Propagates multi-step returns across goals the agent didn't pursue — a correction on Peng's Q(λ), applied on the goal axis instead of the policy axis. The trace is cut when the greedy actions for the off-task goal diverge from the actions actually taken, preventing erroneous backup of value from on-task rewards that don't apply to the off-task goal.

Conservative all-goals learning. Keeps value estimates from drifting when extrapolating to unseen goals. Penalizes Q-values for actions absent from the data, suppressing inflated estimates while still propagating value from the true off-task reward.

A 3×3 grid comparing three losses (L_Qλ, L_OQλ, L_COQλ) across three trajectory cases. Top row: when greedy off-task actions match online actions, all three losses correctly back up value (green check). Middle row: when greedy off-task actions diverge from online actions, plain Q(λ) erroneously propagates on-task reward (red X) while off-task Q(λ) and the conservative variant cut the trace correctly. Bottom row: when greedy off-task actions are suboptimal for on-task data (overestimated Q-values), plain Q(λ) backs up the inflated value (red X), off-task Q(λ) backs up only from the inflated estimate (red X), and only the conservative variant suppresses the overestimate while still propagating from the true off-task reward (green check).
Figure M1. Why both corrections are needed. Each row depicts a trajectory τ on the left, with backup paths under three losses on the right: Q(λ), off-task Q(λ), and conservative off-task Q(λ). Top: when the off-task greedy policy matches the online actions, all three losses correctly propagate value from the rewarding state. Middle: when those greedy actions diverge, plain Q(λ) erroneously backs up on-task reward; off-task Q(λ) and its conservative variant cut the trace and propagate only from the off-task rewarding state. Bottom: when an unseen action's Q-value is overestimated, both Q(λ) variants back up from that inflated value; only the conservative variant suppresses inflated estimates while still propagating from the true reward.

Both corrections are drop-in on a Dyna backbone. Full derivations + ablations in §Methods of the paper. Reference implementation in github.com/wcarvalho/multitask_preplay.

Cite

BibTeX

@article{carvalho2025preemptive,
  author    = {Carvalho, Wilka and Hall-McMaster, Sam and Lee, Honglak and Gershman, Samuel J.},
  title     = {Preemptive Solving of Future Problems: Multitask Preplay in Humans and Machines},
  journal   = {arXiv preprint arXiv:2507.05561},
  year      = {2025},
}