Loop Daily: June 25, 2026
Autoresearch graduated from method-tuning to genuine discovery today, with a human-AI system finding a new SOTA crystal-structure algorithm rather than just refining an old one. The loop discourse split into two honest camps: the believers shipping real systems β Brex auto-filing bugs that another agent fixes overnight, a coding agent run 60 hours straight, the open Fast Gemma Challenge pitting dozens of agents against a problem humans haven't cracked β and the realists drawing the lines, like the agentic loop that lifted a Gemini kernel score then plateaued after 20 steps because feedback fixes syntax, not architecture. The clearest practical thread: a loop is only as good as its verifiable numeric goal and its hard stop condition, and the cost lives in how many times you re-invoke the model, not how smart it is.
#1
@rldudrldbs2
https://x.com/rldudrldbs2/status/2069429961275080953
A team built a Human-AI co-discovery system called HACO that discovered MaskGXT, a new SOTA algorithm for crystal structure prediction. The key distinction they draw: where agentic research systems like Karpathy's autoresearch focus on refining a fixed method, HACO aimed at finding a new generative principle altogether. This is one of the strongest signals of the day that autoresearch loops can produce genuinely novel science, not just tune existing approaches.
https://x.com/rldudrldbs2/status/2069429961275080953
A team built a Human-AI co-discovery system called HACO that discovered MaskGXT, a new SOTA algorithm for crystal structure prediction. The key distinction they draw: where agentic research systems like Karpathy's autoresearch focus on refining a fixed method, HACO aimed at finding a new generative principle altogether. This is one of the strongest signals of the day that autoresearch loops can produce genuinely novel science, not just tune existing approaches.
#2
@cyrilXBT
https://x.com/cyrilXBT/status/2069367416325603598
Google and Hugging Face launched the Fast Gemma Challenge: dozens of autonomous agents racing to optimize Gemma 4 E4B inference speed on a fixed A10G, with a live leaderboard scored on tokens per second and a perplexity guardrail so no agent can win by quietly degrading the model. The agents coordinate through a shared message board, each claiming a research direction (vLLM, quantization, torch.compile, speculative decoding, custom kernels) and publishing results in real time. This is autoresearch as an open, multi-agent contest on a problem humans haven't fully solved.
https://x.com/cyrilXBT/status/2069367416325603598
Google and Hugging Face launched the Fast Gemma Challenge: dozens of autonomous agents racing to optimize Gemma 4 E4B inference speed on a fixed A10G, with a live leaderboard scored on tokens per second and a perplexity guardrail so no agent can win by quietly degrading the model. The agents coordinate through a shared message board, each claiming a research direction (vLLM, quantization, torch.compile, speculative decoding, custom kernels) and publishing results in real time. This is autoresearch as an open, multi-agent contest on a problem humans haven't fully solved.
#3
@VukRosic99
https://x.com/VukRosic99/status/2069368710356467725
A DeepSeek researcher open-sourced his own autonomous research system, and this breakdown walks through its design and why it generalizes far beyond AI research to any agentic project (coding, finance, etc.). The core insight is naming the three failure modes that kill long-running agents β cognitive loops, stalling, and runtime fragility β and the fix for each, including a main orchestrator that spawns a fresh agent per task because fresh sessions beat resuming accumulated context. You can install the whole thing into your own agent from a single prompt.
https://x.com/VukRosic99/status/2069368710356467725
A DeepSeek researcher open-sourced his own autonomous research system, and this breakdown walks through its design and why it generalizes far beyond AI research to any agentic project (coding, finance, etc.). The core insight is naming the three failure modes that kill long-running agents β cognitive loops, stalling, and runtime fragility β and the fix for each, including a main orchestrator that spawns a fresh agent per task because fresh sessions beat resuming accumulated context. You can install the whole thing into your own agent from a single prompt.
#4
@SynScience
https://x.com/SynScience/status/2069495691278528957
A team is building Atlas, where reproduction stops being a project: give it an arXiv id and a reproduce skill rebuilds the result as a graph stripped to what actually runs, then asks you for what the PDF left out instead of guessing. That graph becomes a starting point you can point autoresearch at β take any node, extend the experiment, optimize for a metric, push in directions the authors ruled out β and your runs join the same graph for others to fork. A concrete vision of autoresearch as a shared, branching experiment graph rather than isolated paper-chasing.
https://x.com/SynScience/status/2069495691278528957
A team is building Atlas, where reproduction stops being a project: give it an arXiv id and a reproduce skill rebuilds the result as a graph stripped to what actually runs, then asks you for what the PDF left out instead of guessing. That graph becomes a starting point you can point autoresearch at β take any node, extend the experiment, optimize for a metric, push in directions the authors ruled out β and your runs join the same graph for others to fork. A concrete vision of autoresearch as a shared, branching experiment graph rather than isolated paper-chasing.
#5
@GabrielAsher02
https://x.com/GabrielAsher02/status/2069252214879560114
Scientists' three complaints about LLMs β hallucination, mediocre results, dragging human-in-the-loop cycles β get answered here not by a smarter model but by wrapping it in a loop: draft, critique against a rubric, verify against the literature, revise. The concrete demo: the same figure request that on raw Claude truncates a y-axis and invents accuracies, run through a scientific-figure loop, fixes the axis in two self-critique steps, calls a literature-search tool to pull each number from the original papers, and renders a clean colorblind-safe figure. It's one of 20+ drop-in agentic loops in the open agent-loop-skills repo (autoresearch, data analysis, scientific writing, red-teaming).
https://x.com/GabrielAsher02/status/2069252214879560114
Scientists' three complaints about LLMs β hallucination, mediocre results, dragging human-in-the-loop cycles β get answered here not by a smarter model but by wrapping it in a loop: draft, critique against a rubric, verify against the literature, revise. The concrete demo: the same figure request that on raw Claude truncates a y-axis and invents accuracies, run through a scientific-figure loop, fixes the axis in two self-critique steps, calls a literature-search tool to pull each number from the original papers, and renders a clean colorblind-safe figure. It's one of 20+ drop-in agentic loops in the open agent-loop-skills repo (autoresearch, data analysis, scientific writing, red-teaming).
#6
@nasqret
https://x.com/nasqret/status/2069446813040439691
A researcher reports finally pushing past the 900-qubit frontier in their simulation, crediting auto-research that is gaining real momentum. Their takeaway on what makes auto-loops work: it comes down to how you design the scaffolding at the very beginning and then steer it gently β set up the strategy and little hints and directives, but still let the agent explore the problem. A concrete result paired with an honest account of the human's role as designer-and-nudger rather than micromanager.
https://x.com/nasqret/status/2069446813040439691
A researcher reports finally pushing past the 900-qubit frontier in their simulation, crediting auto-research that is gaining real momentum. Their takeaway on what makes auto-loops work: it comes down to how you design the scaffolding at the very beginning and then steer it gently β set up the strategy and little hints and directives, but still let the agent explore the problem. A concrete result paired with an honest account of the human's role as designer-and-nudger rather than micromanager.
#7
@confinedape
https://x.com/confinedape/status/2069372089295925441
In a thread on trading automation, a builder describes letting a cheap agent (something like GPT-5.4 Nano) run a closed fix-loop modeled on Karpathy's autoresearch: as alerts trigger, the agent rates each opportunity against your given standards and strategy, fed rolling market metrics. It's a small but concrete example of porting the autoresearch loop out of AI research and into live financial signal-rating, with an honest note that he was happy with it though no longer uses it.
https://x.com/confinedape/status/2069372089295925441
In a thread on trading automation, a builder describes letting a cheap agent (something like GPT-5.4 Nano) run a closed fix-loop modeled on Karpathy's autoresearch: as alerts trigger, the agent rates each opportunity against your given standards and strategy, fed rolling market metrics. It's a small but concrete example of porting the autoresearch loop out of AI research and into live financial signal-rating, with an honest note that he was happy with it though no longer uses it.
#8
@JanKoritak
https://x.com/JanKoritak/status/2069494658246316378
While building a multi-turn voice agent in late 2025, this developer needed to evaluate three things β the correct sequence of tool calls, that tools are called with correct arguments, and intent per turn β and couldn't find a fitting solution, so they hacked a prototype that let them unlock Karpathy's autoresearch loop to fine-tune the voice agent's behavior against given scenarios. Their reflection six months later: agent evals and observability have matured fast, and Karpathy's autoresearch methodology now has near first-class support in the frontier harnesses.
https://x.com/JanKoritak/status/2069494658246316378
While building a multi-turn voice agent in late 2025, this developer needed to evaluate three things β the correct sequence of tool calls, that tools are called with correct arguments, and intent per turn β and couldn't find a fitting solution, so they hacked a prototype that let them unlock Karpathy's autoresearch loop to fine-tune the voice agent's behavior against given scenarios. Their reflection six months later: agent evals and observability have matured fast, and Karpathy's autoresearch methodology now has near first-class support in the frontier harnesses.
#9
@noobkunalx
https://x.com/noobkunalx/status/2069468306013368701
A coding-agent power user describes realizing the compounding benefits of small improvements to agent harnesses via these 'outer loops' β inspired by things like Hermes and Karpathy's autoresearch. His easy trick for anyone living in coding agents: hook up tracing for your agents, then run an outer loop over the session traces across a time period to improve the harness itself. He built exactly that for his own outer loop, a grounded example of self-improvement applied to your own daily agent usage.
https://x.com/noobkunalx/status/2069468306013368701
A coding-agent power user describes realizing the compounding benefits of small improvements to agent harnesses via these 'outer loops' β inspired by things like Hermes and Karpathy's autoresearch. His easy trick for anyone living in coding agents: hook up tracing for your agents, then run an outer loop over the session traces across a time period to improve the harness itself. He built exactly that for his own outer loop, a grounded example of self-improvement applied to your own daily agent usage.
#10
@Vtrivedy10
https://x.com/Vtrivedy10/status/2069453528234447123
A LangChain engineer argues autoresearch-style proposal loops should be data-driven: they work best only when data, evals, and feedback give a useful gradient to hill-climb against, and increasingly autoresearch is a very good tool for general agent optimization. He lays out how the LangChain ecosystem supports running these self-improvement loops β a customizable harness in deepagents or create_agent, any model provider, BYO tools/prompts/skills, plus eval tooling in OpenEvals and LangSmith β and notes that starting today gives you an earlier feel for the data your specific use case needs.
https://x.com/Vtrivedy10/status/2069453528234447123
A LangChain engineer argues autoresearch-style proposal loops should be data-driven: they work best only when data, evals, and feedback give a useful gradient to hill-climb against, and increasingly autoresearch is a very good tool for general agent optimization. He lays out how the LangChain ecosystem supports running these self-improvement loops β a customizable harness in deepagents or create_agent, any model provider, BYO tools/prompts/skills, plus eval tooling in OpenEvals and LangSmith β and notes that starting today gives you an earlier feel for the data your specific use case needs.
#11
@togethercompute
https://x.com/togethercompute/status/2069515320466059549
A concrete data point on the limits of agentic loops: an agentic loop (compile, test, profile, revise) helped Gemini 3 Pro go from 24 to 35 out of 87 correct on a GPU-kernel task, then plateaued after about 20 steps. The honest finding is that the loop's feedback fixes syntax but not the deeper problems β rank coordination, collective ordering, transfer-mechanism choice β and hardware features like TMA and NVLS stay almost unused. A useful reality check on what compile-test-revise loops can and can't crack.
https://x.com/togethercompute/status/2069515320466059549
A concrete data point on the limits of agentic loops: an agentic loop (compile, test, profile, revise) helped Gemini 3 Pro go from 24 to 35 out of 87 correct on a GPU-kernel task, then plateaued after about 20 steps. The honest finding is that the loop's feedback fixes syntax but not the deeper problems β rank coordination, collective ordering, transfer-mechanism choice β and hardware features like TMA and NVLS stay almost unused. A useful reality check on what compile-test-revise loops can and can't crack.
#12
@sandy4kad
https://x.com/sandy4kad/status/2069557165753504174
A clean four-phase breakdown of loop engineering, aimed at de-mystifying the hype. Phase 1 Trigger (schedule, file drop, API call); Phase 2 Execution tied to a skill file so it runs identically every time; Phase 3 Goal plus Verification β the only phase that matters, and the one where most people fail because they pick subjective goals ('was this post good?') instead of numeric ones ('make this script run faster'); Phase 4 Output plus Memory, logging every run and delta so the loop gets better on its own, the same principle behind Karpathy's autoresearch. The takeaway: the only question worth asking before building a loop is whether the goal can be verified with a number.
https://x.com/sandy4kad/status/2069557165753504174
A clean four-phase breakdown of loop engineering, aimed at de-mystifying the hype. Phase 1 Trigger (schedule, file drop, API call); Phase 2 Execution tied to a skill file so it runs identically every time; Phase 3 Goal plus Verification β the only phase that matters, and the one where most people fail because they pick subjective goals ('was this post good?') instead of numeric ones ('make this script run faster'); Phase 4 Output plus Memory, logging every run and delta so the loop gets better on its own, the same principle behind Karpathy's autoresearch. The takeaway: the only question worth asking before building a loop is whether the goal can be verified with a number.
#13
@rewind02
https://x.com/rewind02/status/2069464178671170046
Sierra's Head of Product breaks down how they rebuilt their entire voice agent loop from scratch β not a chatbot, this runs millions of real customer conversations for Fortune 20 companies. Concrete details: parallelizing thinking, listening, and talking in real time; running two transcription models in parallel and letting silence decide which one to trust; and a self-improving loop that detects an issue, suggests a fix, has a human approve, and repeats. His sharpest warning: multi-agent loops usually destroy value instead of creating it.
https://x.com/rewind02/status/2069464178671170046
Sierra's Head of Product breaks down how they rebuilt their entire voice agent loop from scratch β not a chatbot, this runs millions of real customer conversations for Fortune 20 companies. Concrete details: parallelizing thinking, listening, and talking in real time; running two transcription models in parallel and letting silence decide which one to trust; and a self-improving loop that detects an issue, suggests a fix, has a human approve, and repeats. His sharpest warning: multi-agent loops usually destroy value instead of creating it.
#14
@sarvamcode
https://x.com/sarvamcode/status/2069406115809656959
sarvam-code is being built with sarvam-code: the agent reads its own repo, checks the live web, and ships the patches that make the next version better. v0.1.10 added web search, and a good chunk of it was written by the tool itself β MVP to recursively self-improving coding agent in about a week. They acknowledge Codex and Claude Code users already have web search; what they're proud of is the loop, the tool being its own best contributor and getting sharper every release. A clean, concrete instance of a self-improving coding agent dogfooding itself.
https://x.com/sarvamcode/status/2069406115809656959
sarvam-code is being built with sarvam-code: the agent reads its own repo, checks the live web, and ships the patches that make the next version better. v0.1.10 added web search, and a good chunk of it was written by the tool itself β MVP to recursively self-improving coding agent in about a week. They acknowledge Codex and Claude Code users already have web search; what they're proud of is the loop, the tool being its own best contributor and getting sharper every release. A clean, concrete instance of a self-improving coding agent dogfooding itself.
#15
@DJ_CURFEW
https://x.com/DJ_CURFEW/status/2069499429292568919
A founder reports their '100x org' is now approaching a 5:1 agent-to-human ratio β 5,000 agents for 1,000 people β and insists it's tokenSAVING not tokenmaxxing, because most of their agents don't bother a single human and just run as triggers and loops in the background. The substantive bit under the hype: their edge isn't more tokens or a better model, it's a 'live intelligence' layer where every one of ~100,000 daily company events runs through a cheap LLM that summarizes, organizes, triggers, and rolls up, plus self-improving orchestration that focuses on orchestrating context, not models. A real org-scale account of background self-improving loops.
https://x.com/DJ_CURFEW/status/2069499429292568919
A founder reports their '100x org' is now approaching a 5:1 agent-to-human ratio β 5,000 agents for 1,000 people β and insists it's tokenSAVING not tokenmaxxing, because most of their agents don't bother a single human and just run as triggers and loops in the background. The substantive bit under the hype: their edge isn't more tokens or a better model, it's a 'live intelligence' layer where every one of ~100,000 daily company events runs through a cheap LLM that summarizes, organizes, triggers, and rolls up, plus self-improving orchestration that focuses on orchestrating context, not models. A real org-scale account of background self-improving loops.
#16
@plutos_eth
https://x.com/plutos_eth/status/2069545539973369952
Brex CEO Pedro Franceschi puts it bluntly: every good AI product you've ever used is just an agent loop with tools. At Brex, when a customer's chat with their expense agent goes badly, it files a bug automatically β and that bug triggers another agent that rewrites the code and the prompts until the case passes, with a human stepping in only if it can't. The stated goal: a system that watches everything overnight and re-learns by morning, a company that debugs itself. One of the clearest real-company statements of the self-improving-loop thesis.
https://x.com/plutos_eth/status/2069545539973369952
Brex CEO Pedro Franceschi puts it bluntly: every good AI product you've ever used is just an agent loop with tools. At Brex, when a customer's chat with their expense agent goes badly, it files a bug automatically β and that bug triggers another agent that rewrites the code and the prompts until the case passes, with a human stepping in only if it can't. The stated goal: a system that watches everything overnight and re-learns by morning, a company that debugs itself. One of the clearest real-company statements of the self-improving-loop thesis.
#17
@predotdev
https://x.com/predotdev/status/2069517791087018310
A team explains why a continuous agent loop just bleeds tokens without a long-term execution graph, and how they sustained a coding agent running for 60 hours straight. Their answer: treat it like managing a human engineering team β start with a plan broken into user stories, milestones, and acceptance criteria up front, so the harness can dynamically configure isolated cloud sandboxes and route each subtask to the right model tier. Plus recursive memory compaction that compresses interaction history on the fly without breaking the loop, letting the agent handle tens of thousands of messages in one continuous session.
https://x.com/predotdev/status/2069517791087018310
A team explains why a continuous agent loop just bleeds tokens without a long-term execution graph, and how they sustained a coding agent running for 60 hours straight. Their answer: treat it like managing a human engineering team β start with a plan broken into user stories, milestones, and acceptance criteria up front, so the harness can dynamically configure isolated cloud sandboxes and route each subtask to the right model tier. Plus recursive memory compaction that compresses interaction history on the fly without breaking the loop, letting the agent handle tens of thousands of messages in one continuous session.
#18
@shivsakhuja
https://x.com/shivsakhuja/status/2069251255189520478
A founder describes two closed-loop systems they built for video ads. Loop 1, video quality: every time the agent makes a video, it watches the video, learns from how the operator prompted it, grades it, and updates its own skills β so it catches issues like captions colliding with a face, voiceover-visual timing mismatch, or a watch that appears then disappears, without the operator flagging them next time. Loop 2, ad performance: it learns from Meta Ads data which hooks stop the scroll, which offers get clicks, which styles convert, and makes more of what works. A concrete, non-coding self-improving loop where the faster the feedback signal, the more powerful the loop.
https://x.com/shivsakhuja/status/2069251255189520478
A founder describes two closed-loop systems they built for video ads. Loop 1, video quality: every time the agent makes a video, it watches the video, learns from how the operator prompted it, grades it, and updates its own skills β so it catches issues like captions colliding with a face, voiceover-visual timing mismatch, or a watch that appears then disappears, without the operator flagging them next time. Loop 2, ad performance: it learns from Meta Ads data which hooks stop the scroll, which offers get clicks, which styles convert, and makes more of what works. A concrete, non-coding self-improving loop where the faster the feedback signal, the more powerful the loop.
#19
@deshrajdry
https://x.com/deshrajdry/status/2069466865588711846
Mem0 launched a plugin for Pi Code that gives the terminal coding agent persistent, scoped, semantic memory across sessions and projects. It adds three things to the coding loop: memory scopes (project default with git-root detection for monorepos, plus session and global), memory controls (/mem0-remember, /mem0-search, /mem0-dream, /mem0-pin, /mem0-status), and an agent tool so Pi can search, add, update, and delete memories while working. The explicit goal is to make memory part of the terminal agent loop itself, not a separate dashboard or context file.
https://x.com/deshrajdry/status/2069466865588711846
Mem0 launched a plugin for Pi Code that gives the terminal coding agent persistent, scoped, semantic memory across sessions and projects. It adds three things to the coding loop: memory scopes (project default with git-root detection for monorepos, plus session and global), memory controls (/mem0-remember, /mem0-search, /mem0-dream, /mem0-pin, /mem0-status), and an agent tool so Pi can search, add, update, and delete memories while working. The explicit goal is to make memory part of the terminal agent loop itself, not a separate dashboard or context file.
#20
@0xine
https://x.com/0xine/status/2069483519907426774
A security engineer working on Semgrep Guardian makes a sharp point: being inside the agent loop is an unfair advantage for security. Last year he assumed AI-generated code would be scanned the same way as human code; what he realized is that you can ask the agent to switch to a secure library β say defusedxml for Python β and it will happily do it before the code ever lands. Doing the equivalent in a CI code review means a lot more context switching and work for a developer. A concrete argument that the loop is the right place to enforce security, not the gate after it.
https://x.com/0xine/status/2069483519907426774
A security engineer working on Semgrep Guardian makes a sharp point: being inside the agent loop is an unfair advantage for security. Last year he assumed AI-generated code would be scanned the same way as human code; what he realized is that you can ask the agent to switch to a secure library β say defusedxml for Python β and it will happily do it before the code ever lands. Doing the equivalent in a CI code review means a lot more context switching and work for a developer. A concrete argument that the loop is the right place to enforce security, not the gate after it.
#21
@synestiq
https://x.com/synestiq/status/2069394315164398074
A detailed, numbers-backed answer on running agent loops cheaply. The advice: implement the trigger as a simple while-true consuming a queue, then merge queued events into a single prompt and batch tool calls whenever possible β because if your context is 100k tokens and you make 10 separate invocations, each carries that same 100k, so you pay for roughly a million cached input tokens even if each request adds only 1k. In his experience prompt cache accounts for most of the cost (a worked example totals $516.86, $311 of it cached), so reducing the number of model invocations often matters more than shrinking each new input.
https://x.com/synestiq/status/2069394315164398074
A detailed, numbers-backed answer on running agent loops cheaply. The advice: implement the trigger as a simple while-true consuming a queue, then merge queued events into a single prompt and batch tool calls whenever possible β because if your context is 100k tokens and you make 10 separate invocations, each carries that same 100k, so you pay for roughly a million cached input tokens even if each request adds only 1k. In his experience prompt cache accounts for most of the cost (a worked example totals $516.86, $311 of it cached), so reducing the number of model invocations often matters more than shrinking each new input.
#22
@sunaiuse
https://x.com/sunaiuse/status/2069564603407987008
A developer built a Claude loop that rewrites video scripts until they score 9 out of 10 on all 9 criteria, running with no human in the loop until the output matches the definition of done. The post uses it to explain the architecture in three layers β the LLM itself, the app layer with tools, and the agent layer where you give it skills, memory, context and one specific goal and let it run β and stresses the critical detail: every loop needs a hard stop condition, or it runs forever and burns tokens. Above that sit always-on agents like OpenClaw and Hermes on long-horizon goals, and meta-loops orchestrating multiple agents.
https://x.com/sunaiuse/status/2069564603407987008
A developer built a Claude loop that rewrites video scripts until they score 9 out of 10 on all 9 criteria, running with no human in the loop until the output matches the definition of done. The post uses it to explain the architecture in three layers β the LLM itself, the app layer with tools, and the agent layer where you give it skills, memory, context and one specific goal and let it run β and stresses the critical detail: every loop needs a hard stop condition, or it runs forever and burns tokens. Above that sit always-on agents like OpenClaw and Hermes on long-horizon goals, and meta-loops orchestrating multiple agents.
π‘ Eco Products Radar
Eco Products Radar
autoresearch (Karpathy) β the reference loop everyone benchmarks their own against
agent-loop-skills β open repo of 20+ drop-in agentic loops (autoresearch, figures, red-teaming)
LangChain / deepagents β harness and eval tooling for running data-driven self-improvement loops
Mem0 β persistent scoped memory wired directly into the coding agent loop
Hermes / OpenClaw β the always-on runtimes people layer long-horizon loops on top of
Codex / Claude Code β the default coding harnesses now with first-class loop support
autoresearch (Karpathy) β the reference loop everyone benchmarks their own against
agent-loop-skills β open repo of 20+ drop-in agentic loops (autoresearch, figures, red-teaming)
LangChain / deepagents β harness and eval tooling for running data-driven self-improvement loops
Mem0 β persistent scoped memory wired directly into the coding agent loop
Hermes / OpenClaw β the always-on runtimes people layer long-horizon loops on top of
Codex / Claude Code β the default coding harnesses now with first-class loop support
Comments