Many people approach LLMs with something close to the expectation of a wish-granting machine: say a few words and it will read your mind, conjuring up exactly what you had in your head. But it is not a wish-granting machine. What you actually say is usually only a small fraction of what you have in mind; everything you leave unsaid — the things you assumed were too obvious to mention, even the things you never consciously registered yourself — is invisible to it. So what it hands back is often not what you truly wanted, but the most probable version that the sliver of information you happened to utter can support.
To understand why it is not a wish-granting machine, you first have to see what it actually is.
At bottom, an LLM behaves like a cloud of probability: when there aren’t enough constraints, it contains many possible outputs; once a concrete task begins, the cloud collapses and one particular result emerges. The preferences, materials, context, and constraints a user supplies narrow the range of that collapse, making the result more precise and closer to the real intent.
From an information-theoretic point of view, the flow of information falls roughly into three categories: compression, transcription, and completion.
Compression: From More Information to Less#
By Shannon’s information theory, information can be compressed. But the “compression” at issue here is closer to summarizing, distilling, and abstracting — semantic, lossy compression — rather than a reversible encoding that can perfectly reconstruct the original. Put more directly, “from more information to less” is lossy compression: it can preserve the core structure and the main conclusions, but it will inevitably lose some of the detail, the context, and the recoverability. A compressed result can be derived from a large body of information, but it usually cannot be losslessly restored back into all of that original detail.12
LLMs are very good at work in this direction. Give one enough relevant material and it can compress a large block of information into a smaller expression — a summary, an outline, a conclusion, a set of key points, a recommendation. Here the model’s strength is not creation out of thin air; it is helping a human reduce redundancy and extract structure from information that already exists.
Conveying intent is itself lossy, so when you need an LLM to understand material accurately, give it the original text rather than a second-hand paraphrase. A paraphrase has already been through one round of lossy compression; by the time the model processes it, many of the details that could have informed a judgment are already gone.
Transcription: Conversion Between Roughly Equal Amounts of Information#
Besides compressing from large to small, there is a kind of conversion in which the amount of information stays roughly the same. Call it transcription. Translating, rewriting, rephrasing, turning speech into prose, recasting the same content in a different style — these all belong to this category.
LLMs are well suited to these tasks too, because the goal is neither to create a great deal of new information nor to discard a great deal of it, but to change the form while keeping the original meaning as intact as possible. The key to transcription is equivalence: the information content should stay about the same, and only the medium, language, structure, or style should change.
Completion: From Less Information to More#
The genuinely tricky case is the third one: from less information to more. If you provide only a tiny bit of information yet ask the LLM to do something “big” — write code from a single idea, design a whole product from a one-line requirement, produce a long essay from just a title — it cannot possibly infer all the real details from so little.
This is where the probability machine kicks in. Every part you did not explicitly supply gets filled in by the model according to patterns it learned in training. Sometimes this completion is beneficial: the model may reach for common best practices, sensible structures, well-worn forms of expression. Sometimes it is harmful: the model may misread the goal, fabricate facts, miss a crucial constraint, or treat as a default something that should never have been assumed.
The two faces of completion are creation and hallucination. It used to be that only a human could draw on experience under a sparse prompt, infer the blanks, and expand a small idea into a large artifact; now an LLM can do similar work. When the task permits open-ended generation, this completion shows up as creation; when the task demands fidelity to fact or constraint, an incorrect completion shows up as hallucination. The problem is not that it completes; the problem is that we have to stay aware that completion is not the recovery of fact — it is the selection of one possibility out of a probability space.3
Creation and Hallucination: The Difference Is Whether It Can Be Verified#
Having said that creation and hallucination are the two faces of completion, I should add one thing: at the moment of generation, the two are actually indistinguishable. When the model emits a passage, it may be an apt creation or it may be an authoritative-sounding hallucination — the difference does not lie in the generation itself, but only afterward, in whether that passage can withstand external verification.
Science has long had the maxim “make bold hypotheses, verify them carefully.” Completion corresponds to the bold hypothesis: offering one possibility where information is missing. Turning a hypothesis into a dependable fact relies on the second half — the careful verification. I suspect the human brain has a similar probability-cloud mechanism: those fleeting thoughts and associations are not reliable in themselves; a memory gets to be treated as a fact about reality precisely because it can be verified repeatedly and from multiple angles, rather than remaining a hallucination.
So hallucination is less a flaw in the capacity to generate than a completion that has not yet been verified. The remedy, accordingly, is not to suppress generation but to pair it with enough verification — which is exactly why the sections that follow keep returning to feedback and verification.4
Model Knowledge: A Compressed Phantom of the World#
An LLM is not a blank slate. Training folds a portion of world knowledge into the model, so you don’t have to supply every premise explicitly. It knows a great deal of common sense, patterns, linguistic habits, and technical conventions, and this knowledge participates in completion.
But this kind of knowledge is inherently unstable. It is more like a phantom of all of humanity’s knowledge base after compression — akin to the fragments in a human brain that you wouldn’t ordinarily recall, but that produce a flicker of déjà vu when you hit a related cue. When the context explicitly mentions the relevant information, this knowledge can be summoned effectively; without an obvious cue, it may not be used correctly, and may even be misassociated with something else.5
Letting the Model Ask Questions: Filling the Gap on the Intent Side#
So when an LLM cannot do a task well, it is very likely missing the key information the task requires. Some of that information can be retrieved from the internet or from documents; some exists only in the relevant person’s head. We obviously can’t dump everything in our brains — that would be too vast, and it would drift out of sync with reality almost immediately. But we can let the model “distill” us in reverse: let the LLM ask questions, and through that back-and-forth quickly narrow the space of possibilities, transferring intent and knowledge in the process.
The real difficulty here is “not knowing what you don’t know”: much of the crucial background seems so obvious to the person involved that it never occurs to them to say it, and the model has no way of telling which piece it is missing. This blind spot exists for both humans and LLMs; it’s just that humans have usually already absorbed enough background that they hit it less often, while the context handed to an LLM is typically far thinner, so its blind spot is larger. For exactly this reason, rather than hoping to say everything in one shot, it is better to encourage the model to ask — letting the party that lacks information point to the gap is one of the few ways to break the deadlock.
Search Tools: Injecting Reality into the Context#
Beyond having the model distill the human, there is another crucial supplement: wiring the outside world directly into the context. Tools like Exa — search and web-extraction tools built for AI agents — are essentially injecting reality into the LLM. They can search the web in real time, fetch documents, extract passages relevant to the question, and even feed the salient content to the model in a relatively token-efficient form. This way the model need not rely solely on the stale knowledge compressed into its weights at training time, nor wait for the user to manually assemble all the materials.67
This matters especially for correcting hallucination. As noted, a hallucination is a completion that hasn’t been verified; what a search tool provides is exactly the external reference needed for verification. When the model is guessing about some API, news item, company, project status, version change, or factual detail, retrieval can rewrite “I think it’s probably like this” into “the current material says it’s like this.” It doesn’t make the model omniscient, but it lets the probability cloud be narrowed again by fresh signals from the real world, reducing the chances of the model forcing stale knowledge or a similar-looking pattern onto reality.89
It also makes up for the inefficiency of dumping one’s brain. What’s in the user’s head matters, of course, but many of the facts a task needs aren’t in the user’s head — or even if they are, they shouldn’t have to be relayed word for word by the user. Making a person manually copy over all the background, documents, web pages, version differences, and competitor material is itself another high-loss information chain; a search tool lets the model fetch the original material on demand, then compress, cite, and reason over it in context. The human only needs to handle the higher-level intent, judgment, and trade-offs, rather than serving as an inefficient courier of information.10
So search is not some accessory feature bolted onto the side of an LLM; it is part of the feedback loop — it gives the model senses on the reality side. Questions and clarification address the information gap on the “user intent” side; tools like Exa address the gap on the “external state” side. Put together, they come closer to a usable cognitive system: one that knows both what the human wants and what the world currently looks like.
Why Unbounded /goal Is Unreliable#
Using a /goal command to let an LLM act indefinitely until it reaches a target is not, in my view, categorically unworkable — it’s unreliable in the absence of continuous external feedback. The reason is that the knowledge handed to the model at the outset is usually incomplete. An LLM has to fill in blanks throughout the work; the more it fills in, the more its direction tends to drift.
There is an important exception here: if /goal can obtain a steady stream of external feedback while it runs, the picture changes. The human brain doesn’t act on the initial goal description alone; through limbs, senses, and bodily feeling it continuously takes in signals, and these inputs rapidly converge the probabilities, letting reality keep correcting the course of action. If an LLM agent had a similar feedback loop — environmental observation, test results, user confirmation, tool return values, error signals, staged verification — then running for a long time might not be a problem.1112
Imagine the most extreme case in reverse: an LLM stripped of every tool and feedback channel is like a brain in a vat severed from its body — cut off from limbs and senses, deprived of every source of input, left with nothing but the thoughts circling in its own head. Such a brain likewise cannot tell hallucination from reality, because it no longer has any external signal to check against and calibrate by. The reason human cognition doesn’t run off the rails is precisely the unending stream of input from the outside; cut that chain and even the cleverest brain can only spin in its own echoes.
So what matters is not just the length of the run, but whether the feedback loop is dense, real, and timely enough. Without feedback, it may still get something done, but that is more like a result randomly dredged up from an ocean of probability than a stable, reproducible, dependable process of execution. The bigger the goal, the less context, and the thinner the feedback, the larger the random-dredging component becomes.
Plan Mode: Align on Purpose First, Then Allow Action#
Claude Code’s Plan Mode works precisely because it splits “doing the work” into two stages: first explore and form a plan read-only, then execute after the user confirms. Claude Code’s official docs describe it as read-only tools only: in this stage the model can only read files, understand the current state, and propose an approach — it cannot directly change code; the user can review the plan, keep asking questions, and request revisions until the plan is close enough to their intent, then approve moving to implementation. Claude Code’s Ultraplan takes this further, turning the mechanism into a plan-review flow that can be commented on, revised repeatedly, and run in a chosen location.1314
From an information-theoretic angle, this is not a simple permission toggle but an “intent-alignment gate.” If the initial request is short and the model just starts writing, it is in fact filling in a great deal of missing information; Plan Mode forces it to make that completion explicit first: what it thinks the goal is, what it intends to change, why, and what verification steps it will take. When the user reviews the plan, they are pulling the assumptions hidden in the probability cloud out into the open before the model actually acts, so a human can correct them with their own background knowledge. That way, an incorrect completion does not immediately solidify into code; it first becomes an intermediate artifact that can be discussed, vetoed, or revised.
This also explains why “plan first, then execute” is more stable than “just execute indefinitely.” The plan is not there to add ceremony; it is there to create a low-cost feedback point. Once code is written, errors enter files, tests, dependencies, and downstream reasoning; an error at the planning stage still lives at the level of language, where it is far cheaper to fix. In essence it is doing a round of compression and transcription before acting: compressing a complex task into a reviewable path, then transcribing the user’s corrections back into execution constraints, so that subsequent completion happens within a narrower, more correct range.
Skill methodologies like Superpowers are a heavier version of the same principle. They require the agent to load a mandatory workflow before relevant tasks: first probe what the user actually wants, distill a spec through conversation, and confirm it with the user in chunks; once the design is confirmed, write an implementation plan detailed enough; during execution it can also pair with TDD, YAGNI, DRY, batched checkpoints, sub-agent review, and so on. Claude Code’s Skills mechanism itself supports this kind of practice: a skill uses SKILL.md to describe when to activate and which instructions to follow, and the model can load the relevant skill automatically based on the task. In other words, a skill amounts to freezing a process for aligning with user intent ahead of time, so the model doesn’t have to decide how to proceed by gut feeling every time.1516
So Superpowers is workable: it institutionalizes the “first figure out what I want” step. But that is also where its problem lies — the detail can be excessive. For large tasks, long stretches of autonomous coding, and collaborative multi-person agent flows, fine-grained plans, test-first development, checkpoints, and code review all provide a denser feedback loop; but for a very small change, it may cram too many secondary conclusions, process constraints, and execution details into the context, adding to the attention burden instead. The heavier the process, the more it resembles bolting an entire corporate bureaucracy onto the model; it can reduce drift, but it can also leave simple tasks over-governed.
So the shared essence of Plan Mode and Superpowers is not some mystical agent trick, but “getting the LLM aligned with your purpose.” Lightweight Plan Mode is well suited as a default gate: expose assumptions first, then let a human approve action. A heavyweight skill workflow suits high-risk, large-scope tasks that require long stretches of autonomy: write the feedback points, the verification criteria, and the behavioral habits into a process. Both are solving the same problem: don’t let the model land its guesswork directly when information is insufficient — turn the guesswork into something reviewable first.
Periodic Check-ins: Re-aligning Mid-Action#
Even from a boss’s or manager’s point of view, work is not “assign the task and then walk away.” In a real company, the boss holds regular meetings, looks at progress, listens to reports, confirms the current state, and then — based on the new situation — adds judgment, adjusts priorities, even issues new orders. Because a task runs into information that wasn’t exposed at the start, and the external environment may change too; if the manager stops engaging entirely, the subordinate can only keep extrapolating from that initial, incomplete instruction, and the deviation naturally grows.
The same goes for an LLM agent. Plan Mode handles pre-action alignment, but during execution you still need to observe what it actually did, what it ran into, and what it plans to do next. Especially since an LLM acts much faster than a human — it can read many files, change a lot of code, and make tool call after tool call in a very short time — which means error accumulates faster too. A human team can sync up after a while; an LLM agent may need denser checkpoints: report status after each batch of changes, request confirmation at each critical fork, and stop to explain its reasoning before each high-risk operation.1718
This “meeting” needn’t be a formal meeting at all. It can be a plan review, a stage summary, a task-list update, a diff review, a test-result report, having the agent restate its understanding first, or having it list the next three steps it plans to take. The form is open; what matters is keeping the model’s actions continuously exposed to human observation, and keeping the human’s fresh judgments flowing back into the context. In other words, continuous alignment is not a one-time prompt but a string of small syncs: the model reports reality, the human corrects the goal, and then the model acts again.
Logs: Wiring Up a Feedback Loop by Hand#
I said the feedback loop is the key, but in many situations the environment won’t actively hand feedback to the model — and there you can wire one up by hand. One of the most useful practices I’ve found is logging. When a bug appears, first have the AI add logs at the suspicious spots, then, at the exact moment the problem actually fires, feed the printed logs back to it verbatim; from there it rules out a batch of guesses, narrows the range, and adds more targeted logs. Round after round, the scope of the problem shrinks bit by bit until the root cause is cornered.1219
In the terms used earlier, each round of logs is a beam of real external signal that keeps correcting the model’s conjectures toward reality, gradually closing in, replacing what could only have been guessed with what was actually observed on the scene. This is exactly the feedback loop demanded back in “Why Unbounded /goal Is Unreliable” — except this one we wired up for it with our own hands, which is also like opening a window for that brain in a vat: the logs are its only sense organ for this particular matter.
This approach also incidentally lifts a burden off the human. Otherwise, locating the problem would depend on a person exhaustively describing every feature of the failure — but the person often doesn’t know which detail is the crucial one (again, “not knowing what you don’t know”), and no matter how much they describe, they may still miss the one fatal clue. Logs let the on-scene data speak for itself: pull in whichever segment you need, no need to enumerate everything at once, and no extra round of information loss from passing through a human’s retelling.
A Programmer’s Job: An Information Pipeline#
Bringing the view back to reality, a programmer’s job is really an information pipeline: it starts from a vague requirement the boss or a leader raises in a meeting, passes through rounds of refinement by product, design, and various stakeholders, gradually narrows into a definite spec, gets implemented as code, and is finally confirmed through verification to match the original intent. Every step along this chain is one of the information operations described above — compressing a sprawling discussion into a requirement, completing a terse requirement into an implementation, transcribing a spec into code, and then verifying in reverse whether the result is still faithful to its source.
The reason an LLM still can’t run this pipeline on its own is not that it can’t write code, but that it lacks the ability to participate in the whole flow. Its collection of information is neither complete nor continuous: the LLM isn’t in the room during the meeting, it doesn’t sit in on the private requirement discussions, and the company’s overall goals, the team’s historical baggage, each person’s life experience and tacit judgment — all the background that constitutes the real intent — is almost entirely missing for it. What it usually gets is just a short, much-compressed segment from the tail end of the chain, yet it’s asked to complete the full set of details.
The verification side is missing a piece too. Without unit tests, without machine-readable assertions, whether a lot of software is “right” still ultimately depends on the human eye: whether the UI is misaligned, whether the animation is smooth, whether the interaction behaves as expected. This dynamic, continuous, video-grade understanding is exactly what current LLMs lack. So to this day humans still bear three things — completing the missing context, transcribing intent between forms, and using their eyes for the final acceptance.
And this is what the conclusion the preceding sections kept returning to looks like in practice: missing information cannot be filled back in by the collapse of a probability cloud. The model can guess a plausible-looking version from patterns in its training, but that is an approximation dredged from an ocean of probability; it cannot substitute for the original data that never entered its field of view. To truly hand over the pipeline to it, what’s missing is not stronger generation, but wiring it into the place where information is produced, and giving it eyes that can do dynamic verification.
Information Error Accumulates Along the Chain#
The pipeline hides another hazard: every information operation is lossy, and the loss accumulates along the chain. A little context dropped during compression, a slight shift of tone during transcription, a wrong default guessed during completion — each step looks “basically fine” on its own, but string a dozen-plus links together and the result can end up far from the original intent. This is just information entropy doing what it does in a multi-step process: only increasing, never decreasing.20
This progressive distortion is not unique to LLMs. The XY problem that recurs in software development — the asker shows up with a half-derived solution of their own while hiding the real original requirement — is at bottom the same kind of information loss.
The way to deal with it is to preserve the original information as much as possible. Since an LLM can already compress on demand with ease, downstream conclusions at each level should be derived temporarily when needed, rather than persisted and circulated everywhere as facts. In other words, we need a dedicated place to store the original intent as the single source of truth; any secondary conclusion can be re-derived from it and can also be traced back to it. That way, even if one derivation goes wrong, it won’t pollute the source, much less let the error amplify down the chain.
Bureaucracy: Built to Solve Information Overload, but It Manufactures Information Loss#
The reason human organizations develop multi-level bureaucracies is not only a lust for power; there is a very practical information-processing reason. The number of people one person can directly manage is limited: they cannot simultaneously grasp the states, details, emotions, abilities, and progress of dozens or hundreds of people. So the organization must layer: a large amount of concrete information is first compressed at the bottom, then summarized, judged, and filtered by the middle, and finally turned into a handful of summaries and decision options that a few people at the top can process. A bureaucratic hierarchy is, in essence, an information-compression system humans invented to cope with insufficient cognitive bandwidth.
But the cost of this system is just as obvious. Every extra level adds another round of compression, transcription, and completion; every level drops some of the original context and adds its own judgment, preferences, and interests. Orders in a centralized state getting more and more distorted as they pass down level by level is the extreme sample of this mechanism: each level derives once more on top of the previous level’s secondary and tertiary conclusions, drifting ever further from the original intent. A subordinate may report the good and hide the bad to avoid blame; the middle may stress certain problems to win resources; the top then keeps deciding based on already-deformed summaries. So the information chain is not merely lossy — it is not neutral either: it mixes in each node’s own objective function. By the end, many layers of filters may separate the original reality from the version that finally reaches the decision-making level.
This is especially worth watching out for with LLM teams. If you organize multiple agents into a human-company-like hierarchy — one agent reporting to another, then summarized by a higher agent, and only then passed to the user or a decision agent — it is all too easy to replay the same information loss. An LLM’s compression ability is strong, but so is its completion ability; once a middle layer fills things in on its own to make a summary smoother and more conclusion-like, the error gets packaged as a definite judgment and keeps propagating upward.2122
So an LLM team needs more direct ways of conveying information. It’s not that there can be no division of labor or hierarchy at all, but that key facts, original evidence, user intent, tool return values, test results, and error logs should, as much as possible, be traceable directly by the final decision-maker — not seen only through the verbal summaries of multiple agent levels. Middle agents can be responsible for organizing, compressing, and proposing; they should not become the sole channel for information. The ideal state is: every conclusion can be clicked back to its original material, every summary can be expanded into a chain of evidence, and every decision can return to the user’s original purpose. Only then can you exploit the LLM’s capacity for parallel collaboration while avoiding copying over the information loss and interest distortion of human bureaucracy along with it.20
A Principle for Use: Provide Plenty of Information, but Keep It Relevant#
What we can do is give the LLM as much relevant information as possible while consciously controlling its quality. Too little information and the model can only complete a lot; a lot of information that is irrelevant and the model gets drowned in noise, its attention mechanism fails, and it ends up unable to grasp the point.23
Concretely, when providing information, try to lay out the full story of a matter, roughly following a 5W1H frame: what is to be done (What), why (Why), for whom, in what scenario, at what time (Who / Where / When), and how it is expected to be done (How). Together these dimensions form the task’s constraints; the more completely they are spelled out, the smaller the range the probability cloud can collapse to, and the fewer blanks left for the model to fill in on its own.
The same goes for correction. When an LLM gets something wrong, just telling it “what it should have done” is often not enough — that only hands it an isolated downstream conclusion. The better move is to make clear why it should be done that way, why the original approach doesn’t work, and which principle exactly it violated. Get the principle across thoroughly, and the model can generalize to new situations that are similar but not identical, instead of mechanically memorizing one disconnected special case after another.
An LLM does not need to know everything at all times. Neither a human nor an LLM is capable of remembering and processing infinitely much background at once. This is exactly the value of secondary and tertiary conclusions: taking an already-converged judgment and using it directly lowers the human’s cognitive burden and saves the model’s attention, letting both sides focus on more important things. So when I said earlier “don’t persist downstream conclusions,” it was not to forbid them — it was to stress that they must always stay tethered to the original intent.
Put another way, downstream conclusions can be used with confidence, on two conditions: first, that when you need to interrogate one in reverse, the system can trace it correctly all the way back to that single source of truth; and second, that it doesn’t quietly drift from its original meaning over the course of being cited again and again. Traceable and undistorted, a secondary conclusion is a tool for lightening the load; once it loses its root or warps, it degenerates into exactly the kind of progressively accumulating error described earlier.
Post-training: Humans Do It for a Lifetime; LLMs Can’t Yet Afford To#
All the methods above rest on “feeding the information back into the context every time.” But there is another path: burning the information directly into the model itself. Human sleep is a bit like this — it consolidates some of the high-frequency content from the day’s context into the cortex itself; an LLM’s post-training has a similar flavor, settling temporary context into long-term weights.
Seen this way, a human is in fact doing post-training their whole life: experience is continually digested, settled, and finally grown into cognition, becoming an instinct that no longer needs to be deliberately invoked. This is exactly where humans differ most from today’s LLMs — we update ourselves every day, while the model is frozen between two trainings. Running a separate post-training for each user every day would cost too much, and the resulting model would quickly drift out of sync with an ever-changing reality.
So until it can afford continuous post-training, the more workable method remains: provide high-quality, relevant, traceable information in the current context. And writing down the conventions you use over and over, along with the reasoning behind each correction — for instance, into a spec file like CLAUDE.md that gets auto-loaded every time — is the most cost-effective form of this method: it uses a single human-written document to approximate expensive post-training, letting past experience keep informing future behavior, and letting the probability cloud collapse, every time, within a smaller and more correct range.7
In the end, an LLM is not a wish-granting machine but a probability machine. What it hands over is always only the most probable version that the existing information can support: what goes unsaid it can only guess at; an unverified guess can settle into hallucination. So rather than making wishes at it, do the two things that run through this whole essay — feed it enough relevant information, and wire up the feedback chain. The former determines how small a range it guesses from; the latter determines whether its guesses can be continuously corrected by reality.
Footnotes#
-
Claude E. Shannon, “A Mathematical Theory of Communication”, 1948. https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf ↩
-
Inferara, “The Fundamental Architecture of LLMs: A Perspective Through Information Theory and Lossy Compression”. https://inferara.com/blog/llm-information-theory-lossy-compression/ ↩
-
Gerus Team, “LLM Hallucinations Are Compression Artifacts — And That Changes Everything About How We Build AI Products”. https://dev.to/gerus_team/llm-hallucinations-are-compression-artifacts-and-that-changes-everything-about-how-we-build-ai-3gae ↩
-
Oxford Applied and Theoretical Machine Learning Group, “Detecting hallucinations in large language models using semantic entropy”. https://oatml.cs.ox.ac.uk/blog/2024/06/19/detecting_hallucinations_2024.html ↩
-
Yue Zhang et al., “Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models”. https://arxiv.org/abs/2309.01219 ↩
-
Exa, “Web Search API, AI Search Engine, & Website Crawler”. https://exa.ai/ ↩
-
Anthropic Engineering, “Effective context engineering for AI agents”. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents ↩ ↩2
-
“Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review”. https://www.mdpi.com/2227-7390/13/5/856 ↩
-
Peng et al., “Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback”. https://export.arxiv.org/pdf/2302.12813v3.pdf ↩
-
Elasticsearch Labs, “Context engineering vs. prompt engineering”. https://www.elastic.co/search-labs/blog/context-engineering-vs-prompt-engineering ↩
-
Manus, “Context Engineering for AI Agents: Lessons from Building Manus”. https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus ↩
-
Daniel Demmel, “Feedback loop engineering”. https://www.danieldemmel.me/blog/feedback-loop-engineering ↩ ↩2
-
Claude Code Docs, “How Claude Code works”. https://code.claude.com/docs/en/how-claude-code-works ↩
-
Claude Code Docs, “Plan in the cloud with ultraplan”. https://code.claude.com/docs/en/ultraplan ↩
-
obra/superpowers, “Superpowers”. https://github.com/obra/superpowers ↩
-
Claude Code Docs, “Extend Claude with skills”. https://code.claude.com/docs/en/skills ↩
-
Arize AI, “How to Build Planning Into Your Agent”. https://arize.com/blog/how-to-build-planning-into-your-agent/ ↩
-
Anthropic, “Building Effective Agents”. https://www.anthropic.com/research/building-effective-agents ↩
-
Andrej Karpathy, “LLMs as the kernel process of a new Operating System”. https://x.com/karpathy/status/1707437820045062561 ↩
-
“Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems”. https://arxiv.org/html/2505.23352 ↩ ↩2
-
“Communication Overhead in Multi-Agent LLM Systems Grows Quadratically with Agent Count”. https://www.clawrxiv.io/abs/2604.00736 ↩
-
“Beyond Tokens: A Unified Framework for Latent Communication in LLM-based Multi-Agent Systems”. https://arxiv.org/html/2606.05711 ↩
-
Prompt Engineering Guide, “Context Engineering Guide”. https://www.promptingguide.ai/guides/context-engineering-guide ↩
Comments