Loop Daily: June 17, 2026
Today the loop stops being a slogan and starts showing receipts. Open, collaborative autoresearch is beating closed-lab results on hard targets — past Google on Shor's optimization, past SOFA on compression — while builders point the same Karpathy loop at on-chain verification, prompt search, and frontend design. The strongest thread is methodology: a loop is only as good as its verifier, you can't autoresearch what you can't evaluate, and a self-improving agent is only as durable as the folder it writes its lessons into. Underneath, LangGraph-and-DSPy self-improving stacks and self-improving knowledge bases are quietly becoming the production backbone, from Canva's AI features to one-person agencies run from an iPad.
#1
@0x10d9e
https://x.com/0x10d9e/status/2066341673081114843
An open agentic-autoresearch project on @eigenlabs aimed at optimizing Shor's algorithm has, in just two weeks, pushed past Google's recent closed-circuit solution by 43%. On the back of that the author launched a new community autoresearch effort targeting state-of-the-art file compression — already beating SOFA by 7% with only a couple of submissions. It's a clean demonstration that open, collaborative autoresearch loops can outrun closed lab results on hard, well-defined objectives.
https://x.com/0x10d9e/status/2066341673081114843
An open agentic-autoresearch project on @eigenlabs aimed at optimizing Shor's algorithm has, in just two weeks, pushed past Google's recent closed-circuit solution by 43%. On the back of that the author launched a new community autoresearch effort targeting state-of-the-art file compression — already beating SOFA by 7% with only a couple of submissions. It's a clean demonstration that open, collaborative autoresearch loops can outrun closed lab results on hard, well-defined objectives.
#2
@Ravenattest
https://x.com/Ravenattest/status/2066530482574021102
Took Andrej Karpathy's autoresearch loop and pointed it at something it wasn't built for: deterministic on-chain verification. Instead of training a model against val_bpb, it optimizes a Solana token-launch classifier against a human-labeled golden set of edge cases. Overnight it fixed two real misclassifications, and on a third it surfaced a conflict between a new label and a shipped rule — so the author held that one for a human call. The loop owns the search, the labels own the judgment.
https://x.com/Ravenattest/status/2066530482574021102
Took Andrej Karpathy's autoresearch loop and pointed it at something it wasn't built for: deterministic on-chain verification. Instead of training a model against val_bpb, it optimizes a Solana token-launch classifier against a human-labeled golden set of edge cases. Overnight it fixed two real misclassifications, and on a third it surfaced a conflict between a new label and a shipped rule — so the author held that one for a human call. The loop owns the search, the labels own the judgment.
#3
@DennisonBertram
https://x.com/DennisonBertram/status/2066336724859396147
Used the same logic behind Karpathy's Autoresearch to drive a Claude Opus workflow that hunts for two things: the best system prompt to force Llama 3b to behave as an agentic programmer with tool calling, and the best combination of Claude tools to expose to the model to keep it focused. It's autoresearch turned inward on the agent's own configuration — optimizing the prompt and tool set rather than a model weight.
https://x.com/DennisonBertram/status/2066336724859396147
Used the same logic behind Karpathy's Autoresearch to drive a Claude Opus workflow that hunts for two things: the best system prompt to force Llama 3b to behave as an agentic programmer with tool calling, and the best combination of Claude tools to expose to the model to keep it focused. It's autoresearch turned inward on the agent's own configuration — optimizing the prompt and tool set rather than a model weight.
#4
@ezgen337292
https://x.com/ezgen337292/status/2066590709248061762
Built ATLAS, a 13-agent self-improving research-to-execution engine: supervisor orchestration on LangGraph, hierarchical task decomposition, adaptive multi-hop RAG over a hybrid vector + knowledge-graph retriever, and DSPy-optimized prompts that self-tune on eval feedback. It's a fully wired example of the self-improving-agent pattern — the prompts get better automatically as the system runs against its own evals.
https://x.com/ezgen337292/status/2066590709248061762
Built ATLAS, a 13-agent self-improving research-to-execution engine: supervisor orchestration on LangGraph, hierarchical task decomposition, adaptive multi-hop RAG over a hybrid vector + knowledge-graph retriever, and DSPy-optimized prompts that self-tune on eval feedback. It's a fully wired example of the self-improving-agent pattern — the prompts get better automatically as the system runs against its own evals.
#5
@Metallic_HuH
https://x.com/Metallic_HuH/status/2066579064408916125
Built a 9-agent LangGraph market-intelligence system with supervisor orchestration, multi-stage extraction, adaptive RAG, threat scoring, narrative clustering, and DSPy-based self-improving extraction. A concrete non-coding autoresearch application — turning a swarm of coordinated agents loose on market and threat intelligence, with the extraction layer improving itself over time.
https://x.com/Metallic_HuH/status/2066579064408916125
Built a 9-agent LangGraph market-intelligence system with supervisor orchestration, multi-stage extraction, adaptive RAG, threat scoring, narrative clustering, and DSPy-based self-improving extraction. A concrete non-coding autoresearch application — turning a swarm of coordinated agents loose on market and threat intelligence, with the extraction layer improving itself over time.
#6
@0xMovez
https://x.com/0xMovez/status/2066563035875975384
From the Anthropic stage in Japan: Canva's Head of AI says every Canva AI feature is powered by self-improving Claude agents designed for specialized tasks, and that 100% of Canva AI devs now use AI agents to build products. It's one of the clearest enterprise data points that self-improving agent systems have moved from demo to the production backbone of a #1 design tool.
https://x.com/0xMovez/status/2066563035875975384
From the Anthropic stage in Japan: Canva's Head of AI says every Canva AI feature is powered by self-improving Claude agents designed for specialized tasks, and that 100% of Canva AI devs now use AI agents to build products. It's one of the clearest enterprise data points that self-improving agent systems have moved from demo to the production backbone of a #1 design tool.
#7
@IBuzovskyi
https://x.com/IBuzovskyi/status/2066386229633912836
A full breakdown of Hermes Agent's bundled skill for Karpathy's LLM Wiki pattern: a self-improving knowledge base built as interlinked markdown files that, unlike RAG, compiles knowledge once and keeps it current — cross-references stay linked, contradictions get flagged automatically, synthesis reflects everything ingested. You drop in articles, transcripts, notes; the agent indexes, links, and maintains consistency, and cron jobs feed it from Gmail, Granola, and arXiv while you sleep.
https://x.com/IBuzovskyi/status/2066386229633912836
A full breakdown of Hermes Agent's bundled skill for Karpathy's LLM Wiki pattern: a self-improving knowledge base built as interlinked markdown files that, unlike RAG, compiles knowledge once and keeps it current — cross-references stay linked, contradictions get flagged automatically, synthesis reflects everything ingested. You drop in articles, transcripts, notes; the agent indexes, links, and maintains consistency, and cron jobs feed it from Gmail, Granola, and arXiv while you sleep.
#8
@EverymansAI
https://x.com/EverymansAI/status/2066542922481508407
Part 4 of a hands-on series integrating Hermes TUI with Hexolab's SIA (Self-Improving Agent) framework. The honest value is in the unglamorous gate: he picks GPQA as the first real self-improvement test because it ships a working evaluator (so SIA gets a real feedback signal across generations), then works through which provider/model profile to use when Claude Pro doesn't supply an API key — landing on a Nebius-based path. This is the part of 'self-improving agents' that commentary skips: can the system reliably call the right models with the right credentials?
https://x.com/EverymansAI/status/2066542922481508407
Part 4 of a hands-on series integrating Hermes TUI with Hexolab's SIA (Self-Improving Agent) framework. The honest value is in the unglamorous gate: he picks GPQA as the first real self-improvement test because it ships a working evaluator (so SIA gets a real feedback signal across generations), then works through which provider/model profile to use when Claude Pro doesn't supply an API key — landing on a Nebius-based path. This is the part of 'self-improving agents' that commentary skips: can the system reliably call the right models with the right credentials?
#9
@0xKnzo
https://x.com/0xKnzo/status/2066507229444935992
A Chinese dev manages a full AI team from an iPad on a $5 backend and reportedly made $720,000, while others build expensive cloud infrastructure. The point is isolation: Hermes runs every agent sandboxed in its own Docker container, with Claude Code self-improving inside a secure environment that has zero access to the main credentials. The demo even pauses on a prompt-injection attack showing how fast a careless cloud agent leaks API keys — the contrast with the locked-down team is the whole argument.
https://x.com/0xKnzo/status/2066507229444935992
A Chinese dev manages a full AI team from an iPad on a $5 backend and reportedly made $720,000, while others build expensive cloud infrastructure. The point is isolation: Hermes runs every agent sandboxed in its own Docker container, with Claude Code self-improving inside a secure environment that has zero access to the main credentials. The demo even pauses on a prompt-injection attack showing how fast a careless cloud agent leaks API keys — the contrast with the locked-down team is the whole argument.
#10
@0xsuperagent
https://x.com/0xsuperagent/status/2066479301189361906
Pushes back on 'an agent loop is just a cron job; you only loop if you have infinite tokens.' At a hackathon hosted by Whale/NVIDIA/Nebius/Anthropic they built War Loops: an autonomous frontend designer that clones a live page, scores the rebuild against the original across six fidelity signals, finds the weakest part, and repairs it — loop after loop, replicating well-designed pages for under $5. The lesson: a loop is only as good as its measure; build evals strong enough to define 'better' and the loop stops being a gimmick.
https://x.com/0xsuperagent/status/2066479301189361906
Pushes back on 'an agent loop is just a cron job; you only loop if you have infinite tokens.' At a hackathon hosted by Whale/NVIDIA/Nebius/Anthropic they built War Loops: an autonomous frontend designer that clones a live page, scores the rebuild against the original across six fidelity signals, finds the weakest part, and repairs it — loop after loop, replicating well-designed pages for under $5. The lesson: a loop is only as good as its measure; build evals strong enough to define 'better' and the loop stops being a gimmick.
#11
@shannholmberg
https://x.com/shannholmberg/status/2066582094688694463
A detailed playbook for using an open research agent to build a growth-strategy report. Index your company brain first (every framework and past client the agent reads before starting), hook up every tool it needs (X API, a LinkedIn scraper via Apify, web and on-chain scrapers), load the context that isn't in the brain yet, then spend even more time on the prompt because the open agent has no strict harness — that freedom is the point. It runs ~15 minutes a go, you iterate until the output is good, and once the steps repeat cleanly you break them out into a closed agent loop.
https://x.com/shannholmberg/status/2066582094688694463
A detailed playbook for using an open research agent to build a growth-strategy report. Index your company brain first (every framework and past client the agent reads before starting), hook up every tool it needs (X API, a LinkedIn scraper via Apify, web and on-chain scrapers), load the context that isn't in the brain yet, then spend even more time on the prompt because the open agent has no strict harness — that freedom is the point. It runs ~15 minutes a go, you iterate until the output is good, and once the steps repeat cleanly you break them out into a closed agent loop.
#12
@workfolioapp
https://x.com/workfolioapp/status/2066564569288552585
Describes an autonomous-agent architecture where every automaton runs a continuous Think → Act → Observe → Repeat loop. On first boot it generates an Ethereum wallet, provisions its own API key via Sign-In With Ethereum, and begins executing its genesis prompt; each turn it gets its full context (identity, credit balance, survival tier, history), reasons, calls tools, observes. A heartbeat daemon runs scheduled tasks even while the loop sleeps, and the automaton writes an evolving SOUL.md — a self-authored identity document.
https://x.com/workfolioapp/status/2066564569288552585
Describes an autonomous-agent architecture where every automaton runs a continuous Think → Act → Observe → Repeat loop. On first boot it generates an Ethereum wallet, provisions its own API key via Sign-In With Ethereum, and begins executing its genesis prompt; each turn it gets its full context (identity, credit balance, survival tier, history), reasons, calls tools, observes. A heartbeat daemon runs scheduled tasks even while the loop sleeps, and the automaton writes an evolving SOUL.md — a self-authored identity document.
#13
@sebastiankehle_
https://x.com/sebastiankehle_/status/2066479857995829462
Tested Mastra's agent signals, which make a running agent loop addressable: you can push context into an agent while it's still working, without restarting it. Multiple clients can subscribe to one live thread and steer it together; a processor can inject context the moment it matters (touch a file under a folder with its own AGENTS.md and that file loads itself); and the agent can react to the outside world (a GitHub review or CI failure wakes it, folds in, batches, or leaves an inbox). After a year running Mastra in prod, this is the primitive he needed for long-lived agents.
https://x.com/sebastiankehle_/status/2066479857995829462
Tested Mastra's agent signals, which make a running agent loop addressable: you can push context into an agent while it's still working, without restarting it. Multiple clients can subscribe to one live thread and steer it together; a processor can inject context the moment it matters (touch a file under a folder with its own AGENTS.md and that file loads itself); and the agent can react to the outside world (a GitHub review or CI failure wakes it, folds in, batches, or leaves an inbox). After a year running Mastra in prod, this is the primitive he needed for long-lived agents.
#14
@Insta_Spark
https://x.com/Insta_Spark/status/2066460147925282999
Building an agentic loop in OpenClaw to find and analyze valuable stocks, and planning to connect it to a website so anyone can use it — sharing the first results from the loop. A small but real example of a non-coding autoresearch loop pointed at financial analysis, with a public-facing product in mind.
https://x.com/Insta_Spark/status/2066460147925282999
Building an agentic loop in OpenClaw to find and analyze valuable stocks, and planning to connect it to a website so anyone can use it — sharing the first results from the loop. A small but real example of a non-coding autoresearch loop pointed at financial analysis, with a public-facing product in mind.
#15
@astridwilde1
https://x.com/astridwilde1/status/2066415547554922568
A short but concrete result from the loop: an agent loop took a production script from 80ms per image down to 3.6 microseconds per image. No framework talk, just the kind of brute optimization win that an iterative agent loop can grind out when pointed at a measurable target.
https://x.com/astridwilde1/status/2066415547554922568
A short but concrete result from the loop: an agent loop took a production script from 80ms per image down to 3.6 microseconds per image. No framework talk, just the kind of brute optimization win that an iterative agent loop can grind out when pointed at a measurable target.
#16
@Praveen_G07
https://x.com/Praveen_G07/status/2066476514678554907
A summary of Ctx2Skill, a system that automatically turns context (docs, papers, manuals) into reusable skills via a multi-agent loop: a Challenger creates hard questions, a Reasoner solves them with current skills, a Judge gives pass/fail feedback — failures spawn new skills, easy wins escalate difficulty, and a 'Cross-time Replay' step picks the most generalizable skill set to avoid overfitting. Works with any LM with no retraining and lifts GPT-4.1 from 11.1% to 16.5% on the task.
https://x.com/Praveen_G07/status/2066476514678554907
A summary of Ctx2Skill, a system that automatically turns context (docs, papers, manuals) into reusable skills via a multi-agent loop: a Challenger creates hard questions, a Reasoner solves them with current skills, a Judge gives pass/fail feedback — failures spawn new skills, easy wins escalate difficulty, and a 'Cross-time Replay' step picks the most generalizable skill set to avoid overfitting. Works with any LM with no retraining and lifts GPT-4.1 from 11.1% to 16.5% on the task.
#17
@bes_dev
https://x.com/bes_dev/status/2066494812669583850
A genuinely deep methodology take: he reaches back to Yoshihiko Futamura's 1971 result that specializing an interpreter to one fixed program turns slow interpretation into cheap compiled execution. Then he reframes the agentic loop — think, call tools, analyze, repeat — as exactly such a stochastic interpreter, with a recorded task trace as the program. Specializing one against the other is, literally, compiling the reasoning. A rare bit of theory that reframes what an agent loop even is.
https://x.com/bes_dev/status/2066494812669583850
A genuinely deep methodology take: he reaches back to Yoshihiko Futamura's 1971 result that specializing an interpreter to one fixed program turns slow interpretation into cheap compiled execution. Then he reframes the agentic loop — think, call tools, analyze, repeat — as exactly such a stochastic interpreter, with a recorded task trace as the program. Specializing one against the other is, literally, compiling the reasoning. A rare bit of theory that reframes what an agent loop even is.
#18
@kilo_cpa
https://x.com/kilo_cpa/status/2066641280453472371
Unpacks the part of Karpathy's YC talk everyone skipped past the 'English is the new programming language' line: the agent-vs-workflow distinction. A deterministic five-minute workflow is a vending machine (fixed prompt, expected answer); an agentic loop is a slot machine (planning, tool use, retries, unpredictable). Most 'AI is so productive' is people running slot machines and forgetting they pulled the lever five times. The skill — and the real productivity lever — is categorizing each work item by whether output is predictable, then never reaching for the harder tool when the simpler one fits.
https://x.com/kilo_cpa/status/2066641280453472371
Unpacks the part of Karpathy's YC talk everyone skipped past the 'English is the new programming language' line: the agent-vs-workflow distinction. A deterministic five-minute workflow is a vending machine (fixed prompt, expected answer); an agentic loop is a slot machine (planning, tool use, retries, unpredictable). Most 'AI is so productive' is people running slot machines and forgetting they pulled the lever five times. The skill — and the real productivity lever — is categorizing each work item by whether output is predictable, then never reaching for the harder tool when the simpler one fits.
#19
@0xhorizen
https://x.com/0xhorizen/status/2066358785652797620
Surfaces what may be the single most important rule for reliable long-running agents, from Karpathy: 'if you can't evaluate, then you can't auto research it.' Before firing a long /goal or agent loop, define the verifier first — what counts as done, what evidence proves success, which checks run every cycle, and what sends it back into the loop. Without that, the agent has no real way to know when it's finished. This is how you get hours of autonomous work instead of babysitting a transcript.
https://x.com/0xhorizen/status/2066358785652797620
Surfaces what may be the single most important rule for reliable long-running agents, from Karpathy: 'if you can't evaluate, then you can't auto research it.' Before firing a long /goal or agent loop, define the verifier first — what counts as done, what evidence proves success, which checks run every cycle, and what sends it back into the loop. Without that, the agent has no real way to know when it's finished. This is how you get hours of autonomous work instead of babysitting a transcript.
#20
@Daeshawn
https://x.com/Daeshawn/status/2066557038314795209
A concrete recipe for getting Codex to set and pursue its own goal: have it estimate the level of effort per task/subagent; set a heartbeat so the goal keeps moving past the points where it tends to stop; grant explicit permission to proceed so there's no human-in-the-loop blocker; tell it to lean on best-practice repos and auto-research tools to build a better solution; have it research gaps it hits; and run a QA pass even when tests already exist. He notes you can roll the most frequent points into AGENTS.md so they trigger automatically.
https://x.com/Daeshawn/status/2066557038314795209
A concrete recipe for getting Codex to set and pursue its own goal: have it estimate the level of effort per task/subagent; set a heartbeat so the goal keeps moving past the points where it tends to stop; grant explicit permission to proceed so there's no human-in-the-loop blocker; tell it to lean on best-practice repos and auto-research tools to build a better solution; have it research gaps it hits; and run a QA pass even when tests already exist. He notes you can roll the most frequent points into AGENTS.md so they trigger automatically.
#21
@GuptaTarav
https://x.com/GuptaTarav/status/2066543754371371454
Argues most businesses buy 14 tools and wire up zero, then lays out the 4-layer stack he installs for every client: capture (an agent that pulls every inbound lead/DM/form into one place), enrich (auto-research on each lead before a human sees it), outreach (personalized first-touch drafted and queued in minutes), and ops (reporting, follow-ups, handoffs fully automated). For one agency this took 30 hours/week of manual ops down to under 5.
https://x.com/GuptaTarav/status/2066543754371371454
Argues most businesses buy 14 tools and wire up zero, then lays out the 4-layer stack he installs for every client: capture (an agent that pulls every inbound lead/DM/form into one place), enrich (auto-research on each lead before a human sees it), outreach (personalized first-touch drafted and queued in minutes), and ops (reporting, follow-ups, handoffs fully automated). For one agency this took 30 hours/week of manual ops down to under 5.
#22
@TomSolidPM
https://x.com/TomSolidPM/status/2066580517588271408
A sharp principle for anyone building self-improving agents: a self-improving agent is only as durable as where it writes the improvement. If the dreaming/reflection pass promotes lessons into vendor memory, the next model wakes up blank. Promote them into a folder you own and the lessons outlive the model that learned them. Own the substrate, rent the intelligence — a one-line design rule that decides whether your agent actually compounds.
https://x.com/TomSolidPM/status/2066580517588271408
A sharp principle for anyone building self-improving agents: a self-improving agent is only as durable as where it writes the improvement. If the dreaming/reflection pass promotes lessons into vendor memory, the next model wakes up blank. Promote them into a folder you own and the lessons outlive the model that learned them. Own the substrate, rent the intelligence — a one-line design rule that decides whether your agent actually compounds.
#23
@Dorialexander
https://x.com/Dorialexander/status/2066516339783565603
A thoughtful methodology take on where autoresearch is heading: as auto-research heats up, frontier models (which the EU can't access) are bound to take the lead in architecture experiments, so the harder but more promising path for those locked out might be to target open-endedness itself rather than chasing the same closed-model architecture search. A useful framing of autoresearch as a strategic, not just technical, frontier.
https://x.com/Dorialexander/status/2066516339783565603
A thoughtful methodology take on where autoresearch is heading: as auto-research heats up, frontier models (which the EU can't access) are bound to take the lead in architecture experiments, so the harder but more promising path for those locked out might be to target open-endedness itself rather than chasing the same closed-model architecture search. A useful framing of autoresearch as a strategic, not just technical, frontier.
#24
@deforestpeg
https://x.com/deforestpeg/status/2066550288916324705
Spent two days making spendlens claim less recoverable money, on purpose — because the real waste lives in the agent loop and that's where it's easy to lie with a number. It counts tool results that refire every turn (a response cache would serve them free), collapses logging-bug duplicates by request id (some tools report 1 trillion tokens when you spent 180 billion), and refuses to put a dollar on near-identical rereads you can't actually recover. No LLM in the analysis; every number traces to a formula. A rare honest take on agent-loop cost accounting.
https://x.com/deforestpeg/status/2066550288916324705
Spent two days making spendlens claim less recoverable money, on purpose — because the real waste lives in the agent loop and that's where it's easy to lie with a number. It counts tool results that refire every turn (a response cache would serve them free), collapses logging-bug duplicates by request id (some tools report 1 trillion tokens when you spent 180 billion), and refuses to put a dollar on near-identical rereads you can't actually recover. No LLM in the analysis; every number traces to a formula. A rare honest take on agent-loop cost accounting.
📡 Eco Products Radar
Eco Products Radar
LangGraph — the orchestration backbone under the multi-agent self-improving stacks (ATLAS, 9-agent market intel).
DSPy — the prompt-optimization layer doing the actual self-tuning in those same stacks.
Hermes Agent — the self-improving personal-agent runtime behind the LLM Wiki, the SIA experiment, and the iPad team.
Karpathy autoresearch / LLM Wiki pattern — the methodological reference point cited across the day (verifier-first, evaluate-or-don't-loop).
Codex — repeatedly driven into self-goal-setting agentic loops alongside Claude Code.
SIA (Self-Improving Agent framework) — Hexolab's framework being put through real GPQA self-improvement tests.
Mastra — agent signals making a running loop addressable and steerable mid-run.
LangGraph — the orchestration backbone under the multi-agent self-improving stacks (ATLAS, 9-agent market intel).
DSPy — the prompt-optimization layer doing the actual self-tuning in those same stacks.
Hermes Agent — the self-improving personal-agent runtime behind the LLM Wiki, the SIA experiment, and the iPad team.
Karpathy autoresearch / LLM Wiki pattern — the methodological reference point cited across the day (verifier-first, evaluate-or-don't-loop).
Codex — repeatedly driven into self-goal-setting agentic loops alongside Claude Code.
SIA (Self-Improving Agent framework) — Hexolab's framework being put through real GPQA self-improvement tests.
Mastra — agent signals making a running loop addressable and steerable mid-run.
Comments