June 26, 2026loop

Loop Daily: June 27, 2026

The loop conversation matured this week from "look, an agent that runs itself" into hard results and honest limits. The headline is robotics: 8 Codex autoresearch agents brought a real robot fleet to life end-to-end with no human bridge. Underneath, the same skeleton keeps reappearing β€” fitness function, executor, keep-or-revert, lessons log, budget β€” whether the domain is LoRA fine-tuning over a weekend, drug-discovery property prediction, jailbreak research, or autonomous trading. Two themes cut through: "self-improving" almost always means a sharper memory loop around the same model (Anthropic's Dreaming, Hermes' wiki, Perplexity's Brain), not new weights; and people are getting honest that naive single-agent loops plateau and that a machine grading its own homework is the real risk. Here's what people actually ran.
πŸ’‘#1
@SciTechera
https://x.com/SciTechera/status/2070055664123339001
For the first time, 8 Codex-based AutoResearch agents brought a robot fleet to life end-to-end with no human bridge in between, via NVIDIA's ENPIRE framework. The agents analyze failures, rewrite code, retrain policies, read research papers, and launch new experiments autonomously β€” learning GPU installation, pin organization, zip-tie cutting, and Push-T manipulation at up to 99% success through continuous self-improvement on real hardware. They even found a new form of physical scaling, where larger robot fleets generate more real-world experience and accelerate learning. Instead of humans running the robotics research loop, the agents now run most of the cycle themselves.
πŸ’‘#2
@HITESHJ20841451
https://x.com/HITESHJ20841451/status/2070279999413002702
A clean, concrete autoresearch run: 4 LLM agents ran 108 LoRA fine-tuning experiments over a long weekend on gpt-oss-20b against HotpotQA, pushing accuracy up +59 percentage points with context extended to 16K. The human's only job was overseeing and cutting dead ends. It's exactly the Karpathy-style loop made real β€” overnight, unattended, with a measurable jump at the end.
πŸ’‘#3
@willbork_
https://x.com/willbork_/status/2070228824797979129
A sharp comparison from the AutoScientists team: against autoresearch on GPT nanochat training optimization, AutoScientists reached the same validation bits-per-byte with fewer experiments. The harder test started from an already-strong champion β€” there the single-agent autoresearch loop saturated with 0 accepted improvements over 100 experiments, while AutoScientists kept making progress with 7 accepted improvements over 93. A useful data point on where naive single-agent loops plateau and what beats them.
πŸ’‘#4
@anmorgan2414
https://x.com/anmorgan2414/status/2070252560947007796
Hands-on AutoResearch in the wild: in WecoAI's "Cracking OpenAI's Parameter Golf" workshop, the task is to train the best language model in 16MB (int8 + zlib), 10 minutes on 8xH100s, scored on bits-per-byte. Weco's autoresearch agent outperformed 1,000+ human participants, and the workshop has you run the agent yourself. A concrete, reproducible case of an autonomous research loop beating a large field of humans on a constrained optimization problem.
πŸ’‘#5
@maksym_andr
https://x.com/maksym_andr/status/2070141465674604984
A meaty update to the Claudini paper (agents autonomously improving jailbreak algorithms): stronger results (claude_v100-oss hits 80% ASR on GPT-OSS-Safeguard-20B), plus a key ablation finding β€” for autoresearch loops, the context you give the agent really matters (feeding all GCG variants beats feeding GCG only). Most strikingly, repeating the experiments across models, Kimi-K2.6 turned out to be the best agent for the task, with a collaborator noting it "did everything right." Another data point that Chinese open-weight models are very strong for autoresearch-style loops.
πŸ’‘#6
@askalphaxiv
https://x.com/askalphaxiv/status/2070159537093677095
A practical "run it yourself" moment: alphaXiv showed an open model that can be trusted on difficult research tasks, and you can run autoresearch yourself with GLM 5.2 just by changing 'arxiv' to 'autoarxiv' in any arXiv URL. It frames autoresearch not as a lab-only capability but as a URL trick anyone can trigger over a paper. The accessibility angle β€” open model plus a one-character URL change β€” is the real story here.
πŸ’‘#7
@heyDhavall
https://x.com/heyDhavall/status/2070029911520358830
Autoresearch at scale via Atlas Graph: give Atlas an arXiv ID and let your agent turn that paper into a working graph; once the result is reproduced, the graph becomes a starting point. The agent can take one node and keep going β€” extend an experiment, test a direction the authors gave up on, optimize for a different metric, or branch into a new idea β€” and every run feeds back into the same graph. So one paper stops being an endpoint and becomes a launchpad for hundreds of follow-up experiments.
πŸ’‘#8
@rudzinskimaciej
https://x.com/rudzinskimaciej/status/2070209038269182376
A candid first-hand autoresearch experiment: he spent 3 days on a small nanogpt-scale autoresearch setup and found DeepSeek underperforms even Qwen Max 3.7, with GLM comparison pending. His sharp observation β€” without a little help from Opus, these autoresearch runs often make low sense, because neither DS nor Qwen reliably choose values or handle implementation details, yet they're broader and generalize better on ideas while fumbling tiny details and throwing away the gains. His ask: at least one open model with attention good enough to attend to 100K context sensibly like Opus/GPT.
πŸ’‘#9
@teortaxesTex
https://x.com/teortaxesTex/status/2070172894328463610
A widely-shared framing: this is the beginning of serious autoresearch that can't be withdrawn with the push of a button by some responsible AI safety committee. No matter what happens next, we already have early AI scientist assistants β€” and recursive self-improvement can be local, even if slow. A concise statement of why the autoresearch genie doesn't go back in the bottle.
πŸ’‘#10
@ChrisHayduk
https://x.com/ChrisHayduk/status/2070183659383214359
A thoughtful methodology argument for autoresearch in biology: biology is significantly more complex than text β€” it's a story of exceptions with few true generalizations, so there are almost limitless subproblems in Bio ML. Because of that, solving biology should need at least as many researchers and as much compute as solving text, which is exactly why he's more bullish on autoresearch in the bio domain than in LLMs. The field is intelligence-constrained in a way text isn't, making it the natural home for autonomous research loops.
πŸ’‘#11
@BiologyAIDaily
https://x.com/BiologyAIDaily/status/2070161121663967346
A deep, careful paper on closed-loop Auto Research for molecular property prediction, and crucially on reliability: when LLM agents adaptively edit features, model code, and training data to improve validation, which improvements actually generalize to an unseen held-out test? The method separates discovery from certification β€” each validation-selected config is frozen, retrained from scratch, and scored once on an isolated test split β€” with a file-level ablation lock so each trial edits only one axis (Feature, Model, or Data). Across 36 endpoints it shows positive held-out gains, but also exposes two non-transfer signatures (selection-variance and distribution-shift), a sober reminder that validation gains aren't automatically real.
πŸ’‘#12
@jaseweston
https://x.com/jaseweston/status/2070117091521204521
A research claim worth tracking: autoresearch that moves the frontier will be about better data β€” call it Autodata. The key idea is agentic data creation as a way to convert increased inference compute into higher-quality model training, with gains shown on computer-science, legal, and math problems over classical synthetic-data methods. They also show how to meta-optimize the data-scientist agent so it creates even stronger data. A concrete framing of autoresearch as a data-generation loop, not just an experiment-running loop.
πŸ’‘#13
@nodescribe89
https://x.com/nodescribe89/status/2070134011251540252
A detailed look at MiMoCode, Xiaomi's MIT-licensed terminal coding agent (0 to 10k stars in two weeks), whose tagline is "models and agents co-evolve." It's a fork of OpenCode with a self-improvement layer: it keeps a persistent MEMORY.md, and two commands feed it β€” /dream scans recent sessions and pulls durable knowledge into memory while dropping stale entries, and /distill watches for repeated manual steps and packages high-confidence ones into reusable skills, subagents, or commands. There's also a /goal stop-condition checked by a separate judge model, a direct shot at the optimistic "done" that wrecks autonomous runs. A real self-improving loop bolted onto a coding agent.
πŸ’‘#14
@nosp321
https://x.com/nosp321/status/2070140491878764588
A non-coding domain loop: building a fully autonomous trading system as loop engineering. The breakdown names the 6 building blocks of any working loop (automation, skill, state file, verifier, worktrees, connectors) and the 5 stages of a trading loop (Data β†’ Signal Generation β†’ Verification β†’ Execution β†’ Risk Monitoring), with a maker-checker separation where a separate agent verifies signals. The self-improving piece: every loss is recorded as a lesson that improves the system, moving you from "I'll run a few tests" to "the system runs while I sleep." A clear template for applying autoresearch loops to markets.
πŸ’‘#15
@IBuzovskyi
https://x.com/IBuzovskyi/status/2070067409130537316
A concrete self-improving knowledge system: Hermes Agent ships with an LLM Wiki skill (Karpathy's pattern) that compiles knowledge once and keeps it current instead of rediscovering it per query like RAG. Feed it a source and it writes a structured markdown page, auto-links to related pages, flags contradictions, and updates affected pages β€” the graph grows denser with every entry. You automate the growth with cron jobs ("every day ingest today's sessions," "every week check arXiv for new papers"), so it goes from 50 scattered entries in month one to 1,000+ cross-referenced entries by month six. The agent gets smarter because the knowledge base got smarter, overnight, on its own.
πŸ’‘#16
@0xbelorix
https://x.com/0xbelorix/status/2070124935532445932
A clear explanation of what "self-improving" actually means, with a real result: Harvey turned on Anthropic's new Dreaming feature and watched its legal agents hit 6x task-completion rates. Dreaming is a scheduled job that runs between agent sessions β€” it reads up to 100 past transcripts, merges duplicates, replaces stale notes, and writes a clean memory store for the next run. Anthropic's own framing: the model isn't changing, no weights, no fine-tuning; what improves is the curated set of plain-text notes the agent reads at the start of every job. Not a smarter model β€” a sharper loop wrapped around the same model.
πŸ’‘#17
@luckeyfaraday
https://x.com/luckeyfaraday/status/2070130307903246428
On a self-evolving-agents paper, he makes the case that the Build β†’ Reflect β†’ Curate β†’ Reuse cycle is exactly why structure beats one-off prompting β€” the agent actually gets better over time instead of resetting every session. He also open-sourced athena-loops as a practical harness for running these structured, self-improving loops with proper memory, verification, and orchestration. A real tool for anyone wanting to tinker with the loop pattern rather than just read about it.
πŸ’‘#18
@SciTechera
https://x.com/SciTechera/status/2070197654144135241
Perplexity launched Brain, a self-improving memory system for its Computer AI agent that learns from completed tasks and workflows. Instead of just storing preferences, Brain logs user actions into a Context Graph and builds a personal LLM Wiki, so the agent starts each new task with deeper understanding of the user's projects, sources, and past decisions. Perplexity says Brain improved answer accuracy by 25% and recall by 16% in internal testing. Another major product converging on the same idea: agents that learn from experience, not just remember.
πŸ’‘#19
@VenelinVidenov
https://x.com/VenelinVidenov/status/2070158968006082785
A striking sovereign-AI setup: the entire AI layer of his company runs on 4x GB10 with zero frontier API bills and nothing leaving his network. Each box runs one brain β€” a 30B MoE brand brain with 8 hot-swap LoRAs, a 36B agent conductor, a 27B long-context model, plus media and training. The number that made him stop paying for inference: 1,432 tok/s at 256 concurrent on a single node, and when he cut over to local, ~57 background agents that had been silently dying on API limits all came back to life. On top sits a persistent agent that conducts the whole fleet 24/7, with a team of self-improving workers coming next.
πŸ’‘#20
@dragosroua
https://x.com/dragosroua/status/2070013949245055226
A grounded first-hand agent loop: he has 10 apps in the App Store, each with a settings section cross-promoting the other 9, and he wanted 10 distinct promo codes per app to track redemptions from Settings. Total prompting to build the loop: 15 minutes; agent work: 35 minutes including versioning, archiving, and uploading to the App Store; generating the codes in parallel: 25 minutes. Sequentially ~1h15m, ~50 minutes actual since he added codes while the agent worked β€” saving an estimated 4-5 hours plus the cognitive load of switching between 10 app contexts.
πŸ’‘#21
@Stoff81
https://x.com/Stoff81/status/2070056953586593892
A neat personal agent loop he calls "code-gardening": a weekly Fable (Opus for now) audit commits to the repo, then Rowan picks up items off that in downtime and makes PRs, Quinn reviews, Rowan handles review comments and merges on approval β€” a self-healing codebase. It's a small, concrete example of wiring named agents into a continuous maintenance loop rather than a one-off task.
πŸ’‘#22
@ruima
https://x.com/ruima/status/2070198639004459212
An honest reflection mid-build: about three hours into building his first fairly complex long-running agent loop and getting a little frustrated β€” then it hit him that he's basically asking it to do an MBA-level research project that would take a well-educated generalist several weeks of full-time work. Meanwhile he could have it working by the weekend while parallel-processing several other smaller projects. A grounded reminder of how easy it is to lose sight of how crazy the new baseline already is.
πŸ’‘#23
@dxverm
https://x.com/dxverm/status/2070250719261704523
A useful tool for closing the loop cheaply: Qwen released Qwen-AgentWorld-35B-A3B, a 35B MoE with ~3B active params trained not as a chat model but as a language world model that predicts what an environment returns after an agent takes an action. It covers MCP/tool-calling, search, terminal, software engineering, Android, web, and OS GUI. The value is simulating the environment side of the agent loop β€” given action history and a new tool, it predicts the outcome, reducing the need to actually execute every action to verify state and burning fewer tokens on full-context reads.
πŸ’‘#24
@_philschmid
https://x.com/_philschmid/status/2070177135453434183
A concrete, reproducible agent-loop quickstart: after Gemini 3.5 Flash got native computer use across browser, mobile, and desktop, he put together a guide to control an Android phone β€” a single script installs the emulator from the terminal, a basic agent loop uses the interactions API with `adb` to control the phone, it connects to remote devices too, and the same pattern works for iOS via simctl. Useful for anyone wanting to actually stand up a real device-control loop rather than watch a demo.
πŸ’‘#25
@DMMeacham
https://x.com/DMMeacham/status/2069970883804815692
A clear-eyed explainer on why loops change who's valuable: a prompt is you typing one instruction and reading the answer β€” you are the loop; an agent loop hands that whole cycle to the machine, which pulls in work, acts with real tools, checks its own result, and goes again with nobody in the chair. That's why serious users run dozens or hundreds at once, clearing tickets overnight. But a machine grading its own homework gives itself an A every time, so the bottleneck quietly moves from doing the work to catching the machine when it's confidently wrong β€” and that's true for any kind of work, not just coding.
πŸ’‘#26
@marfinxx
https://x.com/marfinxx/status/2070086349193883785
A concrete local-agent-loop setup: an Apple developer's WWDC session shows the entire agentic loop running locally on Apple silicon β€” the setup reasons, calls tools, and decides next steps with no cloud APIs. Hardware performance hinges on memory bandwidth: a Mac Mini M4 runs Gemma 27B headlessly via unified memory, a GMKtec EVO X2 allocates 96GB VRAM in BIOS to load Llama 70B, an NVIDIA DGX Spark does FP4 enterprise workloads. MLX accelerates the local execution loop, dropping per-token cost to zero and keeping your database offline against prompt-injection attacks. The privacy-and-cost case for running the loop on hardware you own.
πŸ’‘#27
@dunik_7
https://x.com/dunik_7/status/2070161645897212375
A useful map of the field: a 55-page survey, pulled together by 15 researchers, of every known way an AI agent can rewrite itself after you've shipped it. The problem it names is that almost every agent today is frozen β€” you hand-craft the config once and it never improves on its own. Self-evolving agents flip that into one loop: Input β†’ Agent acts β†’ Environment pushes back β†’ Optimizer rewrites the agent β†’ Repeat, and it lays out every part the agent learns to rewrite (the prompt, the memory, the tools, the workflow). The thesis: the static, hand-configured agent is already the old way.
πŸ’‘#28
@TeksCreate
https://x.com/TeksCreate/status/2070106086669992195
A concrete offline agentic-loop tool in security: METATRON is an AI-powered penetration-testing assistant that runs 100% offline β€” no cloud, no API keys. You give it a target IP or domain, it runs real recon tools (nmap, whois, whatweb, curl, dig, nikto), feeds everything to a local Qwen 3.5 via Ollama, and analyzes vulnerabilities, suggests exploits, and recommends fixes. The interesting part is the loop: the AI can request more tool runs mid-analysis based on what it discovers, with full scan history saved to MariaDB. A clean picture of what local-first, self-driving security tooling looks like.
πŸ“‘ Eco Products Radar
Eco Products Radar

Tools and frameworks mentioned 3+ times across today's autoresearch / loop posts:

alphaXiv (autoarxiv / argithub) β€” the URL-trick autoresearch entry point; change one word in any arXiv or GitHub URL to run an agent over it.
Hermes Agent (Nous Research) β€” the open self-improving agent with an LLM Wiki / second-brain that grows overnight via cron.
Codex β€” the agent powering the ENPIRE robot-fleet autoresearch and the constant cross-model audit partner.
GLM 5.2 β€” the open model people actually run autoresearch with for cost reasons.
Atlas Graph β€” turns an arXiv paper into a reproducible, forkable experiment graph for scaled autoresearch.
Karpathy's nanochat / nanogpt β€” the shared benchmark substrate nearly every autoresearch experiment optimizes against.
Anthropic Dreaming β€” the between-sessions memory job behind the "self-improving without weight updates" results.
← Previous
Super User Daily: June 27, 2026
Next β†’
Ideas Radar: June 27, 2026
← Back to all articles

Comments

Loading...
>_