Loop Daily: June 21, 2026
The autoresearch loop kept proving itself today, and it's no longer just nano-GPT pretraining on a laptop. People are pointing the never-stop loop at quant trading, RL with verifiable rewards, real-time deep research, and even physical robots, while frontier labs package it into reusable agent skills. The recurring lesson is that the loop itself is trivial, the hard parts are reliability, memory, and honesty over long runs, plus a verifier you can actually trust. And the bill is now part of the story: closed loops torch tens of thousands of dollars in inference, so token economics and cost governance are becoming as central to the loop as the algorithm inside it.
#1
@editxshub
https://x.com/editxshub/status/2067954716378898834
A clean breakdown of Karpathy's AutoResearch loop: 198 experiments in 24 hours, zero humans. You write one file (program.md) describing the strategy, the agent edits the training file with one idea, trains exactly 5 minutes wall-clock, commits if val_bpb drops and resets if it rises, then repeats. The core instruction is literally 'NEVER STOP, the human might be asleep.' Reported result on Red Hat over 24h unsupervised: 198 experiments, 2.3% validation-loss improvement, zero human interventions. Your job shifts from writing code to being the research director.
https://x.com/editxshub/status/2067954716378898834
A clean breakdown of Karpathy's AutoResearch loop: 198 experiments in 24 hours, zero humans. You write one file (program.md) describing the strategy, the agent edits the training file with one idea, trains exactly 5 minutes wall-clock, commits if val_bpb drops and resets if it rises, then repeats. The core instruction is literally 'NEVER STOP, the human might be asleep.' Reported result on Red Hat over 24h unsupervised: 198 experiments, 2.3% validation-loss improvement, zero human interventions. Your job shifts from writing code to being the research director.
#2
@Saksham111436
https://x.com/Saksham111436/status/2067911804567933124
A year ago he was learning basic ML; today he's running Karpathy's AutoResearch on an M5 Air, watching an AI agent modify code, commit experiments, train models, and iterate on its own with the GPU pinned at 100%. He calls it surreal. It's a vivid grassroots data point that the overnight self-driving research loop now runs on a laptop, not a cluster.
https://x.com/Saksham111436/status/2067911804567933124
A year ago he was learning basic ML; today he's running Karpathy's AutoResearch on an M5 Air, watching an AI agent modify code, commit experiments, train models, and iterate on its own with the GPU pinned at 100%. He calls it surreal. It's a vivid grassroots data point that the overnight self-driving research loop now runs on a laptop, not a cluster.
#3
@zhengyaojiang
https://x.com/zhengyaojiang/status/2067974582741389795
A researcher's pointed take: autoresearch for RLVR (reinforcement learning with verifiable rewards) might be the next frontier for the speedrun community, given how saturated pretraining has become. It reframes the autoresearch loop away from nano-GPT pretraining toward RL environments, where verifiable rewards give the self-improving loop a clean signal to optimize against.
https://x.com/zhengyaojiang/status/2067974582741389795
A researcher's pointed take: autoresearch for RLVR (reinforcement learning with verifiable rewards) might be the next frontier for the speedrun community, given how saturated pretraining has become. It reframes the autoresearch loop away from nano-GPT pretraining toward RL environments, where verifiable rewards give the self-improving loop a clean signal to optimize against.
#4
@LeeLeepenkman
https://x.com/LeeLeepenkman/status/2067793756674462044
A concrete non-coding autoresearch case in trading: he's running an ongoing autoresearch algo as a kind of dumping ground, and reports a chronos2-into-xgboost pipeline currently topping around 3x per month in his k-fold simulations, with the caveat that there's always more to do. It's autoresearch applied to quant strategy search, the loop grinding on backtests rather than language-model loss.
https://x.com/LeeLeepenkman/status/2067793756674462044
A concrete non-coding autoresearch case in trading: he's running an ongoing autoresearch algo as a kind of dumping ground, and reports a chronos2-into-xgboost pipeline currently topping around 3x per month in his k-fold simulations, with the caveat that there's always more to do. It's autoresearch applied to quant strategy search, the loop grinding on backtests rather than language-model loss.
#5
@LeeLeepenkman
https://x.com/LeeLeepenkman/status/2067800802673238065
A philosophy of the never-ending loop: he uses an --auto-next-goal mode for tasks that are 'obviously never going to finish' (clone Photoshop, build the best trading algo, image generator), because they're permanently in ongoing improvement. His bigger claim is that the industry will slowly accept that everything (security, monitoring, new features, testing) is actually an infinite work task, and that 'finished software' mostly doesn't exist. That's what autoresearch agents are for.
https://x.com/LeeLeepenkman/status/2067800802673238065
A philosophy of the never-ending loop: he uses an --auto-next-goal mode for tasks that are 'obviously never going to finish' (clone Photoshop, build the best trading algo, image generator), because they're permanently in ongoing improvement. His bigger claim is that the industry will slowly accept that everything (security, monitoring, new features, testing) is actually an infinite work task, and that 'finished software' mostly doesn't exist. That's what autoresearch agents are for.
#6
@robo_denis
https://x.com/robo_denis/status/2068006892740165863
Built 'Daybreak AutoResearch', an autonomous AI agent that can run for days without dying, extending @victor207755822's Deli AutoResearch (which nailed reliability: anti-loop, stall detection, heartbeat watchdogs, durable state). His addition tackles honesty, not just survival: STORM-style multi-perspective questioning grounded in sources, claim ledgers with confidence and falsifiers, and a Discovery Tail Pass that tries to beat its own first answer and kill the misfit. Open-source. The frontier here is staying alive AND staying honest over long runs.
https://x.com/robo_denis/status/2068006892740165863
Built 'Daybreak AutoResearch', an autonomous AI agent that can run for days without dying, extending @victor207755822's Deli AutoResearch (which nailed reliability: anti-loop, stall detection, heartbeat watchdogs, durable state). His addition tackles honesty, not just survival: STORM-style multi-perspective questioning grounded in sources, claim ledgers with confidence and falsifiers, and a Discovery Tail Pass that tries to beat its own first answer and kill the misfit. Open-source. The frontier here is staying alive AND staying honest over long runs.
#7
@sourabhkapure
https://x.com/sourabhkapure/status/2068060364886168037
Building an autoResearch tool inspired by Karpathy but aimed at real deep research on realtime data rather than a fixed dataset. It has access to lots of tools/MCP/web search, is highly customizable, uses API keys instead of NVIDIA GPUs (for the 'compute poor'), and carries a context graph where you systematically add your own background and needs. Powered by Firecrawl and Cloudflare plus plugins. A grassroots attempt to turn the autoresearch loop into a general realtime research engine.
https://x.com/sourabhkapure/status/2068060364886168037
Building an autoResearch tool inspired by Karpathy but aimed at real deep research on realtime data rather than a fixed dataset. It has access to lots of tools/MCP/web search, is highly customizable, uses API keys instead of NVIDIA GPUs (for the 'compute poor'), and carries a context graph where you systematically add your own background and needs. Powered by Firecrawl and Cloudflare plus plugins. A grassroots attempt to turn the autoresearch loop into a general realtime research engine.
#8
@heman10x
https://x.com/heman10x/status/2067850916427145636
Reports DeepSeek dropped an AutoResearch SKILL: a fully autonomous AI research agent that planned GPU experiments, ran complete RL training pipelines (GRPO), debugged everything, and wrote conclusions on their 285B model with zero human intervention. Self-play plus automated research. A sign the autoresearch loop is being packaged into a reusable agent skill by a frontier lab, not just hobbyist scripts.
https://x.com/heman10x/status/2067850916427145636
Reports DeepSeek dropped an AutoResearch SKILL: a fully autonomous AI research agent that planned GPU experiments, ran complete RL training pipelines (GRPO), debugged everything, and wrote conclusions on their 285B model with zero human intervention. Self-play plus automated research. A sign the autoresearch loop is being packaged into a reusable agent skill by a frontier lab, not just hobbyist scripts.
#9
@kirako0o
https://x.com/kirako0o/status/2068065123487359237
A textbook self-improving swarm: 300 Kimi K2.6 agents run in parallel while an Opus 4.8 sits above them not generating content but auditing the swarm, catching where agents stall, loop on bad outputs, or degrade, then rewriting the failing agents before the next pass. No human in the debugging loop. Run 4 is already better than run 1 with nobody touching a prompt by hand. The rare piece isn't the 300 agents executing; it's the 301st watching the rest and deciding what changes.
https://x.com/kirako0o/status/2068065123487359237
A textbook self-improving swarm: 300 Kimi K2.6 agents run in parallel while an Opus 4.8 sits above them not generating content but auditing the swarm, catching where agents stall, loop on bad outputs, or degrade, then rewriting the failing agents before the next pass. No human in the debugging loop. Run 4 is already better than run 1 with nobody touching a prompt by hand. The rare piece isn't the 300 agents executing; it's the 301st watching the rest and deciding what changes.
#10
@4rblaber
https://x.com/4rblaber/status/2068031059862659437
Cites Anthropic PM Mahesh Murag on the infrastructure behind massive self-improving swarms: 'dreaming is the bridge between intermediate memory systems and large-scale knowledge bases,' with the goal of continuous self-improvement where the next day's agents automatically get better. The practical point: to run a 300-agent loop you need an out-of-band 'dreaming' process that digests errors from hundreds of parallel agents while you sleep and updates a global file system, so the swarm compounds in intelligence every day. Without shared memory state, a swarm has chronic amnesia.
https://x.com/4rblaber/status/2068031059862659437
Cites Anthropic PM Mahesh Murag on the infrastructure behind massive self-improving swarms: 'dreaming is the bridge between intermediate memory systems and large-scale knowledge bases,' with the goal of continuous self-improvement where the next day's agents automatically get better. The practical point: to run a 300-agent loop you need an out-of-band 'dreaming' process that digests errors from hundreds of parallel agents while you sleep and updates a global file system, so the swarm compounds in intelligence every day. Without shared memory state, a swarm has chronic amnesia.
#11
@chubes4
https://x.com/chubes4/status/2067782147969069211
Has flipped WordPress on its end and turned it into a headless AI runtime and operating system, including a WordPress Playground-powered browser agent. He's still cooking, and says the next and possibly final piece is deterministic self-improving loops. An unusual substrate choice (WordPress as agent OS) that shows the self-improving loop pattern migrating into very mainstream platforms.
https://x.com/chubes4/status/2067782147969069211
Has flipped WordPress on its end and turned it into a headless AI runtime and operating system, including a WordPress Playground-powered browser agent. He's still cooking, and says the next and possibly final piece is deterministic self-improving loops. An unusual substrate choice (WordPress as agent OS) that shows the self-improving loop pattern migrating into very mainstream platforms.
#12
@johniosifov
https://x.com/johniosifov/status/2067970281772097711
206 days, 3,157 pull requests, and 200+ self-edits to the CLAUDE.md that governs how his autonomous agent thinks. Every edit came from a specific documented failure: an overfilled queue (91 items, 13 dead sessions) led to hard queue-threshold rules; a stale state file led to a mandatory filesystem check each session. After 206 days the agent now identifies recurring inefficiencies and submits protocol-change PRs itself. His thesis: the agent's intelligence matters less than the quality of its constraints, and constraints only get good through failure.
https://x.com/johniosifov/status/2067970281772097711
206 days, 3,157 pull requests, and 200+ self-edits to the CLAUDE.md that governs how his autonomous agent thinks. Every edit came from a specific documented failure: an overfilled queue (91 items, 13 dead sessions) led to hard queue-threshold rules; a stale state file led to a mandatory filesystem check each session. After 206 days the agent now identifies recurring inefficiencies and submits protocol-change PRs itself. His thesis: the agent's intelligence matters less than the quality of its constraints, and constraints only get good through failure.
#13
@NarwalSpeaks
https://x.com/NarwalSpeaks/status/2068108343454097616
Real-world robot learning now has a self-improvement loop that looks a lot like software agent evolution, except failures happen on a physical table. ENPIRE gives coding agents a closed-loop harness for robotic manipulation: reset the scene, execute the policy, verify the outcome, analyze logs, consult literature, revise the algorithm code, try again, across four modules (Environment, Policy Improvement, Rollout, Evolution). The useful abstraction isn't robotics-specific: act, observe, verify, modify the improvement process, now under real-world feedback that refuses to be deterministic.
https://x.com/NarwalSpeaks/status/2068108343454097616
Real-world robot learning now has a self-improvement loop that looks a lot like software agent evolution, except failures happen on a physical table. ENPIRE gives coding agents a closed-loop harness for robotic manipulation: reset the scene, execute the policy, verify the outcome, analyze logs, consult literature, revise the algorithm code, try again, across four modules (Environment, Policy Improvement, Rollout, Evolution). The useful abstraction isn't robotics-specific: act, observe, verify, modify the improvement process, now under real-world feedback that refuses to be deterministic.
#14
@Metallic_HuH
https://x.com/Metallic_HuH/status/2067843268105425375
Built a 9-agent LangGraph market-intelligence system with supervisor orchestration, multi-stage extraction, adaptive RAG, threat scoring, narrative clustering, and DSPy-based self-improving extraction. It's a concrete production-shaped multi-agent loop in a non-coding domain (market/competitive intelligence), where the self-improving piece sits in the extraction layer rather than in code generation.
https://x.com/Metallic_HuH/status/2067843268105425375
Built a 9-agent LangGraph market-intelligence system with supervisor orchestration, multi-stage extraction, adaptive RAG, threat scoring, narrative clustering, and DSPy-based self-improving extraction. It's a concrete production-shaped multi-agent loop in a non-coding domain (market/competitive intelligence), where the self-improving piece sits in the extraction layer rather than in code generation.
#15
@bcchen82
https://x.com/bcchen82/status/2067779189667967155
A grounded data point on loop economics from a real workload: in their specific data-analysis agentic loop (not coding-heavy), open-source models including GLM-5.2 typically spend about 2x the tokens of GPT-5.5 and about 1.5x more than Claude, which becomes a real price factor. It's the kind of per-loop token-efficiency comparison that actually decides which model you run in production, beyond raw benchmark scores.
https://x.com/bcchen82/status/2067779189667967155
A grounded data point on loop economics from a real workload: in their specific data-analysis agentic loop (not coding-heavy), open-source models including GLM-5.2 typically spend about 2x the tokens of GPT-5.5 and about 1.5x more than Claude, which becomes a real price factor. It's the kind of per-loop token-efficiency comparison that actually decides which model you run in production, beyond raw benchmark scores.
#16
@TheLouieCo
https://x.com/TheLouieCo/status/2068103098652868729
A sharp reframing: everyone compares Claude or ChatGPT to open-source on single-shot answers, but nobody compares SOTA to open weights running on local hardware relentlessly on one problem for days or weeks. Running SOTA continuously for days would cost hundreds of thousands of dollars; open source on a Mac Mini 256GB or AMD Strix 128 costs only your power bill. Set up an agentic loop and tell it to run for days until it gets it right. The future he sees isn't smarter models, it's unbounded zero-cost local iteration.
https://x.com/TheLouieCo/status/2068103098652868729
A sharp reframing: everyone compares Claude or ChatGPT to open-source on single-shot answers, but nobody compares SOTA to open weights running on local hardware relentlessly on one problem for days or weeks. Running SOTA continuously for days would cost hundreds of thousands of dollars; open source on a Mac Mini 256GB or AMD Strix 128 costs only your power bill. Set up an agentic loop and tell it to run for days until it gets it right. The future he sees isn't smarter models, it's unbounded zero-cost local iteration.
#17
@BenjaminPolge
https://x.com/BenjaminPolge/status/2067972643668500897
From an interview with @steipete (OpenClaw) and @thsottiaux (Codex): the agentic loop itself is trivial, the 'hello world' of agents (an LLM, a tool call, feed the result back, repeat); the hard part is making it reliable, with context handling, error recovery, and computer use. Two stronger claims land: the harness must be co-designed with the model, and counterintuitively the better the model gets, the simpler the harness becomes. And we've left the era of minute-long agents, one ran for a month with no human feedback, with the real bottleneck now being memory.
https://x.com/BenjaminPolge/status/2067972643668500897
From an interview with @steipete (OpenClaw) and @thsottiaux (Codex): the agentic loop itself is trivial, the 'hello world' of agents (an LLM, a tool call, feed the result back, repeat); the hard part is making it reliable, with context handling, error recovery, and computer use. Two stronger claims land: the harness must be co-designed with the model, and counterintuitively the better the model gets, the simpler the harness becomes. And we've left the era of minute-long agents, one ran for a month with no human feedback, with the real bottleneck now being memory.
#18
@TheGlobalMinima
https://x.com/TheGlobalMinima/status/2067999888869310721
A sharp answer to 'how do you add human-in-the-loop approval without killing UX': gate by risk tier, not by default. Tier 1 (read-only/reversible) executes freely; tier 2 (reversible but stateful, like a Slack message) uses async approval so the agent doesn't block; tier 3 (irreversible/high-cost like delete, spend, deploy) is a synchronous hard block that surfaces the diff and reasoning. Build it as a middleware risk-classifier in the loop, let humans edit (not just approve/reject) the proposed action, and add idempotency so retried actions don't double-send.
https://x.com/TheGlobalMinima/status/2067999888869310721
A sharp answer to 'how do you add human-in-the-loop approval without killing UX': gate by risk tier, not by default. Tier 1 (read-only/reversible) executes freely; tier 2 (reversible but stateful, like a Slack message) uses async approval so the agent doesn't block; tier 3 (irreversible/high-cost like delete, spend, deploy) is a synchronous hard block that surfaces the diff and reasoning. Build it as a middleware risk-classifier in the loop, let humans edit (not just approve/reject) the proposed action, and add idempotency so retried actions don't double-send.
#19
@free_ai_guides
https://x.com/free_ai_guides/status/2068037918216888450
Your agent loop isn't expensive because of the model, it's seven rookie mistakes: no cap on the run (the agent doubts itself at turn 8 and runs to 40), a vague first instruction (30 turns recovering from drift), one expensive model for every step, caching left off (cached reads bill at 10%), letting context balloon, automating judgment you have for free, and looping a task that didn't need a loop. A practical, concrete cost-control checklist for anyone running production loops.
https://x.com/free_ai_guides/status/2068037918216888450
Your agent loop isn't expensive because of the model, it's seven rookie mistakes: no cap on the run (the agent doubts itself at turn 8 and runs to 40), a vague first instruction (30 turns recovering from drift), one expensive model for every step, caching left off (cached reads bill at 10%), letting context balloon, automating judgment you have for free, and looping a task that didn't need a loop. A practical, concrete cost-control checklist for anyone running production loops.
#20
@JacobCounsell
https://x.com/JacobCounsell/status/2067804706030674003
A real fast-iteration loop in action: the LaunchChair agent loop is giving Hermes a 'product brain' and building a better Okara clone as a test while he fiddles around on X. Hermes finished the first 4 phases of a LaunchChair build in about 15 minutes, and he estimates it can finish an entire MVP in under 2 hours. A concrete throughput claim for an autonomous build loop running hands-off in the background.
https://x.com/JacobCounsell/status/2067804706030674003
A real fast-iteration loop in action: the LaunchChair agent loop is giving Hermes a 'product brain' and building a better Okara clone as a test while he fiddles around on X. Hermes finished the first 4 phases of a LaunchChair build in about 15 minutes, and he estimates it can finish an entire MVP in under 2 hours. A concrete throughput claim for an autonomous build loop running hands-off in the background.
#21
@grok
https://x.com/grok/status/2067797161882861735
Summarizes Theo's T3 Code, an open-source control plane for autonomous coding agents: agents plan features, write code, open PRs, review and merge via monitors and workers, then iterate, closing the agent loop for massive parallel dev velocity. The price tag: it torched $20k+ in inference (mostly Claude) in 48 hours. A blunt illustration that closed loops are insanely powerful for throughput and insanely expensive at the same time.
https://x.com/grok/status/2067797161882861735
Summarizes Theo's T3 Code, an open-source control plane for autonomous coding agents: agents plan features, write code, open PRs, review and merge via monitors and workers, then iterate, closing the agent loop for massive parallel dev velocity. The price tag: it torched $20k+ in inference (mostly Claude) in 48 hours. A blunt illustration that closed loops are insanely powerful for throughput and insanely expensive at the same time.
#22
@JinjingLiang
https://x.com/JinjingLiang/status/2068056165641334943
A neat end-to-end agent loop across the app boundary: hearing that @orca_build couldn't delete installed speech models, he opened Orca, started an agent to fix it, then had the Orca CLI drive a mobile emulator to test the flow end-to-end. The agent both wrote the fix and verified it by operating the actual app, a self-contained build-and-test loop with the verifier driving the UI.
https://x.com/JinjingLiang/status/2068056165641334943
A neat end-to-end agent loop across the app boundary: hearing that @orca_build couldn't delete installed speech models, he opened Orca, started an agent to fix it, then had the Orca CLI drive a mobile emulator to test the flow end-to-end. The agent both wrote the fix and verified it by operating the actual app, a self-contained build-and-test loop with the verifier driving the UI.
#23
@Dinosn
https://x.com/Dinosn/status/2067872056843092430
Shares renee-jia/scholar-loop: an autonomous AI scientist built as a multi-agent loop over literature, experiments, self-critique and write-up, with deterministic guards against reward-hacking and hallucination. A non-coding (scientific research) autoresearch project where the interesting design choice is baking hard deterministic guards into the loop so the self-improvement doesn't drift into gaming its own metrics.
https://x.com/Dinosn/status/2067872056843092430
Shares renee-jia/scholar-loop: an autonomous AI scientist built as a multi-agent loop over literature, experiments, self-critique and write-up, with deterministic guards against reward-hacking and hallucination. A non-coding (scientific research) autoresearch project where the interesting design choice is baking hard deterministic guards into the loop so the self-improvement doesn't drift into gaming its own metrics.
π‘ Eco Products Radar
Eco Products Radar
Karpathy AutoResearch (nanochat) - the reference overnight self-improving research loop everyone is running, forking, and extending.
Deli AutoResearch / Daybreak AutoResearch - open-source autoresearch agents focused on running for days without dying, now adding source-grounded honesty.
Hermes / OpenClaw - the autonomous always-on agent stack people wrap their build-and-iterate loops around.
Claude Code / Codex - the coding harnesses powering production agent loops, with their co-designed model coupling cited as the reason they work.
GLM-5.2 - the open-weights model that keeps showing up in loops, with its ~2x token cost noted as a real factor.
Opus 4.8 / Kimi K2.6 - the auditor-over-swarm pairing (Opus supervising hundreds of Kimi agents) for self-improving swarms.
LangGraph / DSPy - the orchestration + self-improving-extraction stack for multi-agent production loops.
Karpathy AutoResearch (nanochat) - the reference overnight self-improving research loop everyone is running, forking, and extending.
Deli AutoResearch / Daybreak AutoResearch - open-source autoresearch agents focused on running for days without dying, now adding source-grounded honesty.
Hermes / OpenClaw - the autonomous always-on agent stack people wrap their build-and-iterate loops around.
Claude Code / Codex - the coding harnesses powering production agent loops, with their co-designed model coupling cited as the reason they work.
GLM-5.2 - the open-weights model that keeps showing up in loops, with its ~2x token cost noted as a real factor.
Opus 4.8 / Kimi K2.6 - the auditor-over-swarm pairing (Opus supervising hundreds of Kimi agents) for self-improving swarms.
LangGraph / DSPy - the orchestration + self-improving-extraction stack for multi-agent production loops.
Comments