Loop Daily: July 3, 2026
The theme crystallizing this week is that the vocabulary itself has changed — "prompt engineering" became context engineering, "research" became autoresearch — and the people doing real work are the ones who stopped prompting and started designing loops. Physical AutoResearch pushed into robots teaching themselves skills, financial and energy-system loops ran overnight while their owners slept, and adversarial builder-plus-reviewer pipelines matured into shipping open-source tools. Cutting through the trading-bot hype narratives, the durable insight kept repeating: a loop only earns autonomy when something external can reject its bad output, and the eval that catches the real failure is the hard part.
#1
@DrJimFan
https://x.com/DrJimFan/status/2072337488782475390
Announced ENPIRE → ASPIRE, the second work in a Physical AutoResearch series aimed at robot self-improvement "one /skill at a time." The framing treats a robot's capability growth like an autoresearch loop, building the components that let embodied agents improve themselves rather than being hand-tuned by engineers.
https://x.com/DrJimFan/status/2072337488782475390
Announced ENPIRE → ASPIRE, the second work in a Physical AutoResearch series aimed at robot self-improvement "one /skill at a time." The framing treats a robot's capability growth like an autoresearch loop, building the components that let embodied agents improve themselves rather than being hand-tuned by engineers.
#2
@FTayAI
https://x.com/FTayAI/status/2072138123052523814
A striking end-to-end account of automating a whole business with a legion of agents by end-2026. He's "hired" a board-of-directors for decisions, agent customer support, and just put his first AI sales specialist on a live prospect conversation — unscripted, acknowledging the message then leading it via a custom sales skill loaded on every incoming message. With 22 paying users of his premium product reached mostly without touching sales, he plans to let autoresearch recursively self-optimize the sales agent from logged conversations.
https://x.com/FTayAI/status/2072138123052523814
A striking end-to-end account of automating a whole business with a legion of agents by end-2026. He's "hired" a board-of-directors for decisions, agent customer support, and just put his first AI sales specialist on a live prospect conversation — unscripted, acknowledging the message then leading it via a custom sales skill loaded on every incoming message. With 22 paying users of his premium product reached mostly without touching sales, he plans to let autoresearch recursively self-optimize the sales agent from logged conversations.
#3
@R_E_Beer
https://x.com/R_E_Beer/status/2072319364821606747
Describes setting up Karpathy-style auto-research loops in the background for both work and curiosity. His example draws on energy-system design experience: point the agent at a starting question (say, optimizing an investment into a battery storage system) and iteratively add complexity with expert prompts while the analyses run 24/7 — "you can just point the agent and let the analyses run while you play with your kids." Setup takes 15-30 minutes when you already understand the fundamentals.
https://x.com/R_E_Beer/status/2072319364821606747
Describes setting up Karpathy-style auto-research loops in the background for both work and curiosity. His example draws on energy-system design experience: point the agent at a starting question (say, optimizing an investment into a battery storage system) and iteratively add complexity with expert prompts while the analyses run 24/7 — "you can just point the agent and let the analyses run while you play with your kids." Setup takes 15-30 minutes when you already understand the fundamentals.
#4
@natashamalpani
https://x.com/natashamalpani/status/2072196402071962026
A sharp essay arguing the AI-x-science discourse conflates execution and discovery. Karpathy's autoresearch — 700 experiments in 48 hours, 20 improvements, no human in the loop — worked because success was unambiguous and fast to measure; that's compression, i.e. execution. The OpenAI model that disproved a 1946 Erdős conjecture is different: it explored more of mathematics simultaneously than any career allows, with no verifier telling it which step to take. Span versus compression, discovery versus execution.
https://x.com/natashamalpani/status/2072196402071962026
A sharp essay arguing the AI-x-science discourse conflates execution and discovery. Karpathy's autoresearch — 700 experiments in 48 hours, 20 improvements, no human in the loop — worked because success was unambiguous and fast to measure; that's compression, i.e. execution. The OpenAI model that disproved a 1946 Erdős conjecture is different: it explored more of mathematics simultaneously than any career allows, with no verifier telling it which step to take. Span versus compression, discovery versus execution.
#5
@softwaredoug
https://x.com/softwaredoug/status/2072363880345436448
A thoughtful take that loop engineering is extremely close to autoresearch, and maybe the best use of coding agents is just autoresearch. He sketches a spectrum from autoresearch to hand-coding with a "yawning gap" in the middle — poorly-built, underspecified autoresearchers that inconsistently succeed. That gap is why you get good days where the agent seems smart and bad days where it seems dumb, until you push all the way to the autoresearch side.
https://x.com/softwaredoug/status/2072363880345436448
A thoughtful take that loop engineering is extremely close to autoresearch, and maybe the best use of coding agents is just autoresearch. He sketches a spectrum from autoresearch to hand-coding with a "yawning gap" in the middle — poorly-built, underspecified autoresearchers that inconsistently succeed. That gap is why you get good days where the agent seems smart and bad days where it seems dumb, until you push all the way to the autoresearch side.
#6
@stretchcloud
https://x.com/stretchcloud/status/2072401100133855452
Breaks down AI Auto-Work, an open-source adversarial pipeline: Claude Code as the builder, Codex as the adversarial reviewer, each stage isolated with files only passing forward and compile/test gates before review. The structural edge is that a different model reviews with fresh eyes and a different blind-spot distribution — what Claude misses, Codex tends to catch. Systematic errors update a shared .ai/ knowledge base so corrections compound across sessions, not just within the code.
https://x.com/stretchcloud/status/2072401100133855452
Breaks down AI Auto-Work, an open-source adversarial pipeline: Claude Code as the builder, Codex as the adversarial reviewer, each stage isolated with files only passing forward and compile/test gates before review. The structural edge is that a different model reviews with fresh eyes and a different blind-spot distribution — what Claude misses, Codex tends to catch. Systematic errors update a shared .ai/ knowledge base so corrections compound across sessions, not just within the code.
#7
@nisedo_
https://x.com/nisedo_/status/2072216335514276222
A practical trick from building a dozen agentic workflows: add a supervisor agent whose only job is to monitor the run, step in to unblock things, and file a GitHub issue every time it hits a bug or edge case. After a few runs most bugs are patched and the supervisor becomes a self-improving loop. He notes he still prefers to review the GH issues before acting, sometimes implementing the fix differently than the supervisor suggested.
https://x.com/nisedo_/status/2072216335514276222
A practical trick from building a dozen agentic workflows: add a supervisor agent whose only job is to monitor the run, step in to unblock things, and file a GitHub issue every time it hits a bug or edge case. After a few runs most bugs are patched and the supervisor becomes a self-improving loop. He notes he still prefers to review the GH issues before acting, sometimes implementing the fix differently than the supervisor suggested.
#8
@kepochnik
https://x.com/kepochnik/status/2072284018725605630
Details a Hermes + Obsidian second-brain setup running 24/7 that learns patterns itself. The self-improvement loop runs every 10 turns, reviewing conversations and updating memory files automatically; a 4am stale-session sweep extracts learning from old conversations before they expire. He notes 200 pre-built skills actually broke his agent until he fixed it with a curated set, and isolated Discord channels give each area of life its own agent personality and skillset.
https://x.com/kepochnik/status/2072284018725605630
Details a Hermes + Obsidian second-brain setup running 24/7 that learns patterns itself. The self-improvement loop runs every 10 turns, reviewing conversations and updating memory files automatically; a 4am stale-session sweep extracts learning from old conversations before they expire. He notes 200 pre-built skills actually broke his agent until he fixed it with a curated set, and isolated Discord channels give each area of life its own agent personality and skillset.
#9
@tom_doerr
https://x.com/tom_doerr/status/2072251701608784049
Shares a self-improving meta-skill that drafts and updates AI agent skills by observing actual work sessions — instead of you guessing the prompt, the system watches execution and refines the skill definition based on what actually succeeds. A concrete answer to the "how do skills get written" bottleneck in self-improving agents.
https://x.com/tom_doerr/status/2072251701608784049
Shares a self-improving meta-skill that drafts and updates AI agent skills by observing actual work sessions — instead of you guessing the prompt, the system watches execution and refines the skill definition based on what actually succeeds. A concrete answer to the "how do skills get written" bottleneck in self-improving agents.
#10
@evermind
https://x.com/evermind/status/2072297085203017898
Introduces Raven, a memory-first self-improving agent harness powered by EverOS that keeps user memory, agent memory, tools, skills, policies, and execution context together. The key loop primitive: successful workflows become reusable agent templates, so the system accumulates capability from what has already worked rather than restarting each task.
https://x.com/evermind/status/2072297085203017898
Introduces Raven, a memory-first self-improving agent harness powered by EverOS that keeps user memory, agent memory, tools, skills, policies, and execution context together. The key loop primitive: successful workflows become reusable agent templates, so the system accumulates capability from what has already worked rather than restarting each task.
#11
@algoxstonk
https://x.com/algoxstonk/status/2072389899396034885
Open-sourced blcli, an agentic infrastructure stack battle-tested at 30M+ user scale that lets coding agents manage whole cloud infra through code, PRs, dry-runs, and deterministic apply workflows. The thesis is pointed: agents build toy infra not because they can't handle real systems but because real infra needs enormous expert context, so blcli packages that production knowledge code-first for agents to read, reason about, and operate safely across an 18-month iteration horizon.
https://x.com/algoxstonk/status/2072389899396034885
Open-sourced blcli, an agentic infrastructure stack battle-tested at 30M+ user scale that lets coding agents manage whole cloud infra through code, PRs, dry-runs, and deterministic apply workflows. The thesis is pointed: agents build toy infra not because they can't handle real systems but because real infra needs enormous expert context, so blcli packages that production knowledge code-first for agents to read, reason about, and operate safely across an 18-month iteration horizon.
#12
@anton_iades
https://x.com/anton_iades/status/2072396614287962156
Presents Heuresis, a composable framework combining coding agents with arbitrary search algorithms inside a flexible loop, built specifically to improve the exploration capabilities of Auto Research Agents making novel AI discoveries. It targets the part most autoresearch setups underinvest in — not just running experiments, but searching the space of what to try intelligently.
https://x.com/anton_iades/status/2072396614287962156
Presents Heuresis, a composable framework combining coding agents with arbitrary search algorithms inside a flexible loop, built specifically to improve the exploration capabilities of Auto Research Agents making novel AI discoveries. It targets the part most autoresearch setups underinvest in — not just running experiments, but searching the space of what to try intelligently.
#13
@Dinosn
https://x.com/Dinosn/status/2072238179805962355
Surfaces anti-autoresearch, a reviewer-side integrity-forensics tool: don't trust an autoresearch paper at face value. It runs self-consistency and fabrication checks for a deterministic verdict, with 61 signals including 46 integrity hack-patterns across families A–H. A necessary counterweight as autoresearch output floods in — a loop that audits the loops.
https://x.com/Dinosn/status/2072238179805962355
Surfaces anti-autoresearch, a reviewer-side integrity-forensics tool: don't trust an autoresearch paper at face value. It runs self-consistency and fabrication checks for a deterministic verdict, with 61 signals including 46 integrity hack-patterns across families A–H. A necessary counterweight as autoresearch output floods in — a loop that audits the loops.
#14
@SwishMoe
https://x.com/SwishMoe/status/2072303928180416527
Describes autoresearch-explorer, where a Qwen3-8B GRPO agent learns which document chunks are worth opening and how to answer under a budget — training models for workflow judgment, not just better prompting. He built the research-docs version in parallel to a financial-docs one; same direction: domain-specific judgment models learned through a loop rather than hand-specified.
https://x.com/SwishMoe/status/2072303928180416527
Describes autoresearch-explorer, where a Qwen3-8B GRPO agent learns which document chunks are worth opening and how to answer under a budget — training models for workflow judgment, not just better prompting. He built the research-docs version in parallel to a financial-docs one; same direction: domain-specific judgment models learned through a loop rather than hand-specified.
#15
@JustinAngel
https://x.com/JustinAngel/status/2072401192907612481
Proposes a new autoresearch benchmark for the NanoGPT speedrun with three tiers — with arxiv, with internet, and with nothing. Early findings: Claude leads on refusals and token use, Codex doesn't refuse and gets similar results, and Kimi K2.5 uses the fewest tokens. The notable meta-result: all advances came from reusing existing techniques, with some novel adaptations but no genuinely novel techniques.
https://x.com/JustinAngel/status/2072401192907612481
Proposes a new autoresearch benchmark for the NanoGPT speedrun with three tiers — with arxiv, with internet, and with nothing. Early findings: Claude leads on refusals and token use, Codex doesn't refuse and gets similar results, and Kimi K2.5 uses the fewest tokens. The notable meta-result: all advances came from reusing existing techniques, with some novel adaptations but no genuinely novel techniques.
#16
@doronkatz
https://x.com/doronkatz/status/2072353986594898331
A deep operational analysis of Perplexity's Brain, arguing the important thing is what it remembers: not who you are but what the agent's work did, what succeeded and failed, refined overnight into a context graph (+25% correctness, +16% recall, -13% cost on repeated tasks). He reframes agent memory from a UX feature to a system surface with versions, rollback, audit trail, and a cost model — and gives a five-point checklist (source traceability, retention/forgetting, ownership, cost model) for treating a self-improving memory loop as infrastructure.
https://x.com/doronkatz/status/2072353986594898331
A deep operational analysis of Perplexity's Brain, arguing the important thing is what it remembers: not who you are but what the agent's work did, what succeeded and failed, refined overnight into a context graph (+25% correctness, +16% recall, -13% cost on repeated tasks). He reframes agent memory from a UX feature to a system surface with versions, rollback, audit trail, and a cost model — and gives a five-point checklist (source traceability, retention/forgetting, ownership, cost model) for treating a self-improving memory loop as infrastructure.
#17
@vadwarp
https://x.com/vadwarp/status/2072336532564836861
Building a self-improvement loop where an autonomous agent treats its own source code as a workspace — learning, remembering, and eventually editing itself — arguing Opus 4.8+ models are capable of self-analysis given careful setup. He's drawing on recent papers (A Self-Improving Coding Agent, Darwin Gödel Machine, HyperAgents) and shares infographics ahead of an article covering his first real-world experiment.
https://x.com/vadwarp/status/2072336532564836861
Building a self-improvement loop where an autonomous agent treats its own source code as a workspace — learning, remembering, and eventually editing itself — arguing Opus 4.8+ models are capable of self-analysis given careful setup. He's drawing on recent papers (A Self-Improving Coding Agent, Darwin Gödel Machine, HyperAgents) and shares infographics ahead of an article covering his first real-world experiment.
#18
@0thernet
https://x.com/0thernet/status/2072284840121286793
Sick of using other people's harnesses and IDEs, he built his own self-improving coding agent and app in a single overnight session (11pm to 4:30am). It's baked into their repo and reuses the same stack as an unannounced product — a small data point on how quickly a bespoke self-improving harness can now be stood up.
https://x.com/0thernet/status/2072284840121286793
Sick of using other people's harnesses and IDEs, he built his own self-improving coding agent and app in a single overnight session (11pm to 4:30am). It's baked into their repo and reuses the same stack as an unannounced product — a small data point on how quickly a bespoke self-improving harness can now be stood up.
#19
@Ivory_Towerz
https://x.com/Ivory_Towerz/status/2072464516546252892
A detailed walk-through of Hermes token economics for long-running agents: most cost comes not from conversations you intend but from background processes, context bloat, and defaults that load 90+ skills and full tool schemas every turn. The 10 concrete cuts — cheap auxiliary models for background tasks, routing sub-agents to cheaper models, aggressive compression thresholds, trimming unused skills, enabling on-demand tool search — are the practical tuning layer for keeping a self-improving loop affordable while keeping auto-memory on.
https://x.com/Ivory_Towerz/status/2072464516546252892
A detailed walk-through of Hermes token economics for long-running agents: most cost comes not from conversations you intend but from background processes, context bloat, and defaults that load 90+ skills and full tool schemas every turn. The 10 concrete cuts — cheap auxiliary models for background tasks, routing sub-agents to cheaper models, aggressive compression thresholds, trimming unused skills, enabling on-demand tool search — are the practical tuning layer for keeping a self-improving loop affordable while keeping auto-memory on.
#20
@ariccio
https://x.com/ariccio/status/2072322063449080156
Points out an agentic loop nobody's using yet: GPT-5.5 Codex is good at writing arbitrary static checks, so when you observe an issue, have it write a static check, then have Codex fix all the other instances the check surfaces, then wire it into your git hooks or CI/CD. A tidy self-reinforcing loop where each observed bug becomes a permanent guardrail.
https://x.com/ariccio/status/2072322063449080156
Points out an agentic loop nobody's using yet: GPT-5.5 Codex is good at writing arbitrary static checks, so when you observe an issue, have it write a static check, then have Codex fix all the other instances the check surfaces, then wire it into your git hooks or CI/CD. A tidy self-reinforcing loop where each observed bug becomes a permanent guardrail.
#21
@hrswatigupta
https://x.com/hrswatigupta/status/2072313463465279523
Describes AxDafny, which uses an agentic loop where the Dafny verifier acts as the teacher: it generates code and proofs, runs them through the verifier, and repairs whatever fails. A clean example of a loop with a genuinely external, unambiguous judge — the verifier — which is precisely what lets it run autonomously.
https://x.com/hrswatigupta/status/2072313463465279523
Describes AxDafny, which uses an agentic loop where the Dafny verifier acts as the teacher: it generates code and proofs, runs them through the verifier, and repairs whatever fails. A clean example of a loop with a genuinely external, unambiguous judge — the verifier — which is precisely what lets it run autonomously.
#22
@PixelRainbowNFT
https://x.com/PixelRainbowNFT/status/2072414229412651321
Open-sourced the DwarfStar DS4 dashboard with a built-in MCP server so a model in an agentic loop can drive the dashboard programmatically: spin up the engine, run a benchmark, inspect metrics, tweak a config profile, re-run, and iterate. The model can effectively train, tune, and test itself against its own runtime — a self-tuning loop where DeepSeek4 can tune DS4's own settings.
https://x.com/PixelRainbowNFT/status/2072414229412651321
Open-sourced the DwarfStar DS4 dashboard with a built-in MCP server so a model in an agentic loop can drive the dashboard programmatically: spin up the engine, run a benchmark, inspect metrics, tweak a config profile, re-run, and iterate. The model can effectively train, tune, and test itself against its own runtime — a self-tuning loop where DeepSeek4 can tune DS4's own settings.
#23
@julie_bush
https://x.com/julie_bush/status/2072211761781223492
Built Cursus Publicus for the Hermes Agent hackathon, answering Patrick Collison's ask for an LLM workflow tool: files, context, stored prompts, coding agents, compiled outputs, shareable results. It's a "post office for agents" — free AI on NVIDIA Nemotron 3 Ultra, Hermes agents, Stripe payments, loops, autoresearch, memory, and receipts.
https://x.com/julie_bush/status/2072211761781223492
Built Cursus Publicus for the Hermes Agent hackathon, answering Patrick Collison's ask for an LLM workflow tool: files, context, stored prompts, coding agents, compiled outputs, shareable results. It's a "post office for agents" — free AI on NVIDIA Nemotron 3 Ultra, Hermes agents, Stripe payments, loops, autoresearch, memory, and receipts.
#24
@eliebakouch
https://x.com/eliebakouch/status/2072202622027833618
While preparing conference slides, asked Claude to change the color of his autoresearch plots and got blocked by the safety filter — "everything is fine :)". A small but telling friction anecdote from a researcher presenting autoresearch work, and a reminder that safety classifiers are catching benign autoresearch-adjacent requests.
https://x.com/eliebakouch/status/2072202622027833618
While preparing conference slides, asked Claude to change the color of his autoresearch plots and got blocked by the safety filter — "everything is fine :)". A small but telling friction anecdote from a researcher presenting autoresearch work, and a reminder that safety classifiers are catching benign autoresearch-adjacent requests.
📡 Eco Products Radar
Eco Products Radar
Karpathy's autoresearch pattern is the gravitational center this week, referenced across nearly every serious loop discussion. Hermes and OpenClaw recur as the persistent-agent harnesses people build loops on top of, with Claude Code and Codex the default builder/reviewer pair in adversarial pipelines. Obsidian shows up repeatedly as the memory substrate for self-improving second-brain loops, and Perplexity Brain is the week's most-discussed productized memory loop. On the frameworks side, EverOS/Raven, blcli, Heuresis, and darwin-agents all pitch themselves as loop or self-improvement infrastructure, while DeepSeek, Qwen, and Kimi K2.5 recur as the models people run inside these loops.
Karpathy's autoresearch pattern is the gravitational center this week, referenced across nearly every serious loop discussion. Hermes and OpenClaw recur as the persistent-agent harnesses people build loops on top of, with Claude Code and Codex the default builder/reviewer pair in adversarial pipelines. Obsidian shows up repeatedly as the memory substrate for self-improving second-brain loops, and Perplexity Brain is the week's most-discussed productized memory loop. On the frameworks side, EverOS/Raven, blcli, Heuresis, and darwin-agents all pitch themselves as loop or self-improvement infrastructure, while DeepSeek, Qwen, and Kimi K2.5 recur as the models people run inside these loops.
Comments