June 26, 2026Research RL Skills

OPID Teaches Agents From Their Own Mistakes, No External Memory Needed

Training agents with reinforcement learning has an annoying split. Outcome-based RL, where you only reward the final result, is stable but the reward is brutally sparse: the agent does fifty things and gets one bit of feedback at the end. The fix everyone reaches for is skill distillation, but that usually means bolting on a costly external memory bank. OPID, a new paper from a team including Jianhua Tao's group, gets the supervision for free, straight out of the agent's own completed runs.

Here's the move in plain terms. After the agent finishes a trajectory, OPID mines two kinds of hindsight skills from it. Episode-level skills capture the big-picture workflow and how to avoid the failure that just happened. Step-level skills capture the local call you should make at a critical moment. A router prioritizes the step-level knowledge exactly when a decision is pivotal and falls back to episode-level guidance otherwise. That skill gets injected back into the history, the policy re-scores itself, and you get a token-level self-distillation signal layered on top of the sparse outcome reward.

The results: better performance, better sample efficiency and better robustness than outcome-only baselines across ALFWorld, WebShop and search-based QA. Code is on GitHub. It's a clean idea executed without a lot of moving parts, which is usually the kind that sticks.

The thread worth watching is that this is one more entry in the agents-learn-from-their-own-rollouts story. The expensive part of agent training has been collecting good demonstrations. Papers like this keep finding ways to squeeze supervision out of the rollouts the agent already produced for free. The hindsight is already sitting in the data. You just have to mine it.

Link: https://arxiv.org/abs/2606.26790

← Previous

Runlayer Raises $30M to Be the Control Room for Your Agent Workforce

Super User Daily: June 27, 2026

← Back to all articles

OPID Teaches Agents From Their Own Mistakes, No External Memory Needed

Related Articles

Comments