Banyan ยท Preprint

Task diversity produces systematic transfer
but inhibits continual reinforcement learning

Task variety is supposed to make reinforcement learning agents generalize. Across a changing sequence of tasks, we find the reverse: past a point, more variety makes an agent plateau at a lower level. It keeps learning each new set of tasks, but stops gaining from one set to the next.

Purab Seth1,2,* Neil Shah1,2,* Kunal Jha3 Samuel J. Gershman1,4 Max Kleiman-Weiner3 Wilka Carvalho1 โœ‰
*Equal contribution 1Kempner Institute · Harvard 2Harvard College 3University of Washington 4Psychology & CBS · Harvard
๐Ÿ“„  Paper ๐Ÿ’ป  Code (soon) ๐Ÿงต  Thread (soon)

Each Banyan world is a map the agent moves through, paired with a goal that splits into sub-goals arranged as a tree. The agent sees only the top goal, never the tree, and has to discover the order things must be done.

The setup

Variety is the standard route to generalization

In machine learning, a striking pattern has emerged across settings: training on diverse task distributions consistently enables generalization to new ones, without additional training. Tobin et al. (2017) demonstrated this first in robotics, showing that varying an environment's visual and physical properties during simulation training was sufficient for policies to transfer to real robots. At larger scale, the Open Ended Learning Team (2021) trained agents across a vast space of games and found they solved millions of unseen games with no extra training. Jha et al. (2025) extended this logic to multi-agent coordination, showing that diverse task sets during training enabled zero-shot coordination with entirely new partners. Each result points the same direction: the breadth of training experience, not its depth on any single task, is what drives generalization.

In cognitive science, Fodor and Pylyshyn (1988) identified systematicity as a core property of cognition that connectionist models were thought to lack. Hill et al. (2019) provided early evidence that diversity in the training environment, not architecture, drives systematicity in situated agents. Lake and Baroni (2023) then answered the challenge definitively, showing that the right training task distribution induces human-like systematic generalization in neural networks, with direct comparisons to human behavior. This body of evidence converges on a clear principle: task diversity is the lever. What it leaves open is whether that lever works only at training time.

However, almost all of this evidence tests an agent after training has stopped, with its weights frozen. Far less is known about what task variety does while the agent is still learning, as it moves through one new environment after another. Banyan is built to study exactly that.

The idea

Banyan: turn the ingredients of a task into dials

A Banyan task is a small world the agent moves through, plus a goal. The goal splits into sub-goals arranged as a tree. We control three ingredients independently: the map layouts (๐•ƒ) the agent navigates and the task-tree topologies (๐•‹) that set how sub-goals depend on one another. A third axis, task-tree object assignments (๐•†), decides which concrete objects fill each sub-goal. This design lets us pin an effect on a specific kind of variety rather than on variety in general.

Overview of Banyan: a task is a map plus a goal that splits into sub-goals arranged as a tree; the layouts, task-tree topologies, and object assignments can each be dialed independently; procedural generation yields billions of tasks.
Figure 1 A Banyan task is a map plus a goal that splits into sub-goals arranged as a tree. Three ingredients can be dialed on their own: the layouts (๐•ƒ) the agent navigates, the task-tree topologies (๐•‹) that set how sub-goals depend on one another, and the object assignments (๐•†) that fill them. Procedural generation yields billions of tasks.
One distribution shift

Across a single distribution shift, diversity produces systematic transfer

Start by changing the agent's tasks just once, from a first set to a second. The moment the tasks change, performance usually drops and the agent has to relearn. We call the size of that drop the transfer gap. Increasing the variety along any one of the three ingredients shrinks this gap toward zero. The agent begins the second set near the performance it reached on the first, even when the best strategy for the two sets is different. This holds for both PPO and PQN.

Plots showing the transfer gap shrinking toward zero as variety increases along layouts, object assignments, and topologies, for both PPO and PQN.
Figure 2 More variety along any one ingredient shrinks the transfer gap toward zero: the agent begins a new set of tasks near where it ended the last. Shown for both PPO (top) and PQN (bottom).

Only varying the number of tree shapes improves backward transfer

Backward transfer reveals a sharper asymmetry. Expanding the number of layouts or object assignments leaves performance on earlier tasks unchanged. Changing topology diversity is the exception: adding more topologies produces a large gain. This pattern holds for both PPO and PQN. What the agent carries forward is compositional structure, not surface detail.

Backward transfer to the first set of tasks improves as the number of topologies increases, but stays flat for layouts and object assignments.
Figure 3 Only varying the number of topologies improves backward transfer to the first set of tasks; varying layouts or object assignments does not. Shown for both PPO (top) and PQN (bottom).

The effect is not an artifact of grid-worlds

We built a continuous-control version of Banyan where the agent drives a point mass to push objects around. The effect persists: more variety still improves both initial performance on new tasks and retention of old ones.

Results in the continuous-control point-mass version of Banyan, showing the same variety effect on starting new tasks and holding onto old ones.
Figure 4 The same effect appears in a continuous-control version of Banyan, where the agent drives a point mass into objects. More variety improves how it both starts on new tasks and holds onto old ones.
The paradox

Across ten sets of tasks, variety holds the agent back

We jointly vary the number of layouts and the number of topologies across ten successive task sets, training a single agent through all ten in turn. We track the success rate the agent achieves by the end of its tenth set. Plotting that success rate against task diversity reveals an inversion: performance climbs to a peak at a moderate level of diversity, then falls as diversity increases further.

At that middle amount of variety (16 layouts by 16 topologies) the final success climbs with each new set of tasks added to training. Past that point, more task diversity steadily lowers it. As variety grows from 16 toward 256 layouts-by-topologies, final success falls across the full range from roughly 0.95 to 0.80. The diversity that helps within any single curriculum ultimately limits what the curriculum can build.

Across ten sets of tasks, final success climbs at a middle amount of variety (16 by 16) but plateaus at the highest variety (256).
Figure 5 Across ten sets of tasks, a middle amount of variety keeps the agent improving, while the highest variety makes long-run progress flatten. Final success rate on each set in the sequence. The climbing curve is 16 layouts by 16 topologies; the flat one is the highest-variety condition.
Success rate through training across the ten sets of tasks, for the deepest tasks alone. Low-variety curves relearn from near zero at each new set and spike high; high-variety curves stay flat at a lower level.
Figure 6 The same pattern is sharpest on the deepest tasks, those with the longest chains of sub-goals. Success rate through all ten sets for depth 6 alone. Each new set begins at a dashed line. The low-variety curves (red) relearn from near zero at every set and climb steeply; the high-variety curves (green) stay flat at a lower level, gaining little from one set to the next.
What stalls is not learning, but specialization.

The more task variety an agent has seen, the better it keeps improving on its first set of tasks, even as gains on new ones level off. At the highest variety, the agent is still advancing on old tasks while its progress on new ones plateaus.

The two halves fit together once you see what the agent is learning. The agent picks up the structure shared across tasks, and that shared structure is exactly what lifts the old sets. It loses the ability to specialize to whatever is new in the set in front of it. Varied early sets interfere with later ones. Variety buys a general solution at the cost of a sharp one.

Success on the first set of tasks keeps rising across the ten-set sequence as variety increases, even as progress on new sets flattens.
Figure 7 The more variety the agent sees, the more its success on the very first set keeps climbing, even as its progress on new sets flattens.

Interested in learning more?

๐Ÿ“„  Read the paper
Cite

BibTeX

@misc{seth2026banyan,
  title         = {Task diversity produces systematic transfer but
                   inhibits continual reinforcement learning},
  author        = {Seth, Purab and Shah, Neil and Jha, Kunal and
                   Gershman, Samuel J. and Kleiman-Weiner, Max and
                   Carvalho, Wilka},
  year          = {2026},
  eprint        = {2606.00880},
  archivePrefix = {arXiv},
  url           = {https://arxiv.org/abs/2606.00880}
}