Thought Leadership·May 2, 2026·13 min read

The Copilot illusion: why consumer AI holds up on ten lines and collapses on a hundred pages

Microsoft Copilot, ChatGPT, and Gemini actually run on long-context models (128k+ tokens). But the vendors deliberately throttle them in their consumer interfaces -- the limit is not technical, it is economic. The apparent free access is subsidised by venture capital, and the available tools are cognitively disappointing in the majority of cases where depth is needed, because nobody agreed to pay its price. The consequence: tenders, long sensitive meetings, and cross-cutting document analysis fall outside the legitimate use case of consumer chatbots.

By The TenderGraph team

IC

The Copilot illusion: why consumer AI holds up on ten lines and collapses on a hundred pages

Third article in the "cognition / doctrine" block. If framing is the most profitable human act in a tender response, the economic throttling of consumer tools is precisely what prevents its execution. This article makes the mechanism visible. It also extends the diagnosis laid out in April 2026 on adoption blockers -- the organisation pictures AI as a chatbot and ends up with a product that is one.

A scene witnessed in large accounts throughout 2025-2026. A sales director shows, in a team meeting, the latest Microsoft Copilot demo. A thirty-minute Teams call has just ended. Three clicks, one prompt: "Summarise this meeting with the actions to take." Fifteen seconds later, a clean report appears, structured, with the names of the speakers and the commitments made. The room is convinced. "This tool is going to change the way we work."

A few weeks later, the same tool is set loose on an object of a different nature. A four-hour board meeting, twenty participants, where -- without a single word saying so head-on -- the trajectory of a subsidiary's divestment was decided. The Copilot report is delivered. It is clean, structured, with the names of the speakers and the commitments made.

It missed what mattered.

The tension between the CFO and the COO, which has shaped every shift on divestment topics for the past eighteen months, appears nowhere. The general counsel conceded on the timeline, in exchange for a victory secured three weeks earlier on scope -- a concession invisible in the report. The president's seemingly innocuous sentence -- "we are going to have to think about this matter differently" -- which for insiders marks the burial of the strategy defended for six months by the strategy director, is rendered as encouragement toward creativity.

The report that comes out exposes the company legally. It is false by omission, and a president signing off on it believing he is validating the reality of his own meeting is performing an act that few lawyers would recommend.

This is the gap between an AI sized for email and an AI sized for the dossier.

The economic throttling of consumer tools

Microsoft Copilot, ChatGPT, Gemini, and most consumer chatbots actually run on long-context models. The underlying versions -- GPT, Claude, Gemini -- have windows of at least 128,000 tokens, sometimes one million. That is already very respectable.

But the end user does not have access to this capacity. The vendors deliberately throttle the models in their consumer interfaces. The engine can technically process 200,000 tokens of input and 64,000 of output; the Copilot product only delivers enough to process roughly 30,000 tokens of input and 4,000 of output. This gap is purely economic.

The arithmetic is simple. A Copilot licence at thirty dollars per user per month does not cover the inference cost of intensive long-context usage. If Microsoft let Copilot ingest the raw verbatim of a four-hour meeting and produce a fifty-page report, the inference cost would far exceed the licence's monthly revenue. Throttling protects the product's margin.

This logic deserves to be spelled out, because it is massively ignored. The general public today feels that generative AI is free, or nearly so. This apparent free access is partly real -- inference costs have fallen sharply in two years -- and partly subsidised by venture capital, which burns tens of billions a year to drive adoption ahead of profitability. As soon as the user intensifies usage -- long context, extended reasoning, multimodality, agentic workflows -- the real costs reappear. The vendors then have two options: charge at the right level, or throttle the product so usage stays within the plan. For the general public, it is almost always the latter. The result: the available tools are cognitively disappointing in the majority of cases where depth is needed, because nobody agreed to pay its price.

Throttling takes, in practice, the form of a two-stage architecture: RAG, or Retrieval-Augmented Generation. The term, formalised by Lewis et al. (NeurIPS 2020), denotes a setup in which the complete document is not sent to the model. When the user asks a question, a search engine first extracts a few relevant fragments, and the language model generates its answer only from those fragments. RAG divides the inference cost by twenty or a hundred. For a question whose answer fits in a single paragraph -- "what is the contract's expiry date?", "who is responsible for lot 3?" -- it works well. The answer is correct, fast, inexpensive.

RAG nonetheless rests on a hidden assumption: that the answer to any useful question is found in a small number of contiguous fragments. The assumption holds for one-off factual questions. It collapses the moment a question calls for a cross-cutting connection.

Three structural flaws

The loss of inter-document relationships. A complex tender response typically aggregates a CCTP, an RC, a BPU, a DQE, a DPGF, an AE, a consultation rulebook, two or three lots, twelve technical appendices, and the technical bid of the previous incumbent obtained through public channels. A typical strategic question from the bid manager -- "where does the scoring formula structurally favour the outgoing incumbent?" -- has no answer in a single fragment. The answer comes from cross-referencing the formula in the RC, the volumes in the DQE, the references required in the CCTP, and the values of the previous contract. RAG, which retrieves paragraphs by semantic similarity to the question, has no way of performing this cross-referencing. It selects a few paragraphs containing the word "weighting", and misses the analysis.

The loss of meta-cognition. A model that sees five fragments retrieved by a search engine cannot know what it does not see. It is unaware that somewhere else in the corpus there is a paragraph that contradicts or qualifies the ones in front of it. It answers confidently on the partial basis it has. Its tone of authority, inherited from RLHF, masks the incompleteness. On a closed question, this is inconsequential. On an open question that demands a comprehensive view, it is disastrous: the answer is at once fluent and insufficient.

The loss of the dynamics of long text. A four-hour meeting is not a half-hour meeting stretched longer. It has distinct phases -- exposition, debate, tacit negotiation, apparent consensus, reversal, political closure -- that reveal themselves only by reading the whole. A participant returning to a point raised two hours earlier lends that point an intensity that is legible only with the complete sequence. A RAG that retrieves, on demand, "the commitments made" presents a flat list. It strips the meeting of its politics -- in the sense that a board meeting is, fundamentally, a political act before it is a deliberative one.

What works on small, what breaks on large

The illusion arises from an error of generalisation. The performance of consumer tools on small tasks is real: drafting a two-paragraph email, summarising a five-page note, rephrasing a three-hundred-word brief, brainstorming on a closed question. On these objects, the context window is amply sufficient, RAG is useless (the document fits in a single pass), and the model can allocate all of its inference capacity to the quality of the output.

The trap is that this performance, experienced daily, grounds an implicit conviction: "this tool has mastered written language, so it will master my serious subjects." That is the error. The tool masters written language only on objects the size of its window. As soon as the object exceeds that size, the architecture switches to RAG mode. And the tool loses the capacity for exploration, connection, and meta-cognition that it never truly had, but that it simulated correctly on small formats.

Three professional zones concentrate this switch.

The tender response. A complete dossier weighs between three hundred and one thousand five hundred pages. The strategic question is rarely factual. It looks like "what frame is this client adopting without realising it, and where are my levers of differentiation?". RAG cannot answer it. No fragment contains it; the answer emerges from the cross-referencing.

The minutes of long, sensitive meetings. Board meetings, executive committees, prolonged commercial negotiations, multi-hour oral defences. Everyone who has tried knows the threshold: beyond thirty minutes of transcript, Copilot can no longer produce a detailed report. A quick synthesis remains possible. A fine-grained report, one that traces the commitments and lets each person prepare for the next milestone, is not.

The technical cause is precise, and it is poorly understood: the dominant constraint lies in the output window, more than in the input window. Even if Copilot swallowed the four-hour verbatim, it could only write a report of a few thousand tokens -- a few pages at most. It is forced to compress, and at that rate, the operational detail disappears. The result is short by construction. It suits the executive who skims the subject and wants to grasp it at a glance. It is not enough for whoever has to dig in, nor for whoever has to decide on the basis of that report.

To this is added the point already named: the substance of a long meeting is not in the sentences spoken, it is in the sequences, the reversals, the silences. A RAG does not see what is not verbalised. And a short output window could not render what a RAG might, by luck, have spotted.

Cross-cutting document analysis. Portfolio audit, competitive analysis across thirty public documents, acquisition due diligence, risk assessment on a contractual corpus. The added value is born from cross-reading. A RAG that retrieves five fragments per query stops at the apparent summary, without reaching the actual analysis.

The other architecture: long context and exploration

The alternative architecture exists, and it is accessible -- provided one accepts the real cost of long-context inference, rather than seeking the margin in throttling. Anthropic opened the way in 2023 with a hundred-thousand-token window on Claude 2, extended to two hundred thousand in 2024 on Claude 3, then to one million tokens on the Opus versions of the 4 series. This extension is above all architectural, beyond the mere quantitative gain: with a million tokens, a complete tender dossier, a four-hour meeting verbatim, a portfolio of thirty competitive documents pass through in a single pass. No RAG. No prior selection. No retrieved fragment. The model sees the whole simultaneously, and can perform the connections that the short architecture does not allow.

The difference is measurable. The "Needle in a Haystack" benchmark offers a simple test: a precise piece of information is inserted into a long corpus, and the model is asked to retrieve it. Anchored long-context models (Claude Opus, Gemini Pro, GPT) reach retrieval rates above 95% on contexts of several hundred thousand tokens. RAG architectures, on the same test, depend entirely on the quality of the retrieval -- if the needle does not have the right vocabulary, it is not retrieved.

The work of Liu et al. (NAACL 2024), "Lost in the Middle: How Language Models Use Long Contexts", documented a nuance: even on technically long-context models, attention declines on the median portions of the document. Performance remains structurally superior to a RAG, but the calibration of long context is not uniform. One more reason to combine long context with protocols of explicit exploration. This is what the agentic architectures increasingly used in professional bid management do: the agent identifies upstream the zones of the corpus that deserve reinforced reading, rather than letting attention dilute across the whole.

The practical test to tell them apart

A simple test makes it possible to distinguish a tool sized for real work from a tool sized for the demo: pose to the tool a question whose answer is in no single document taken in isolation, but emerges from the connection of at least three documents.

On a tender response: "given the timeline imposed by the RC, the minimum headcount required in the CCTP, and the references requested in appendix 4, which candidates were structurally eligible before publication?". No document contains the answer. It emerges from the cross-referencing.

On a board meeting report: "which positions expressed in this meeting contradict those the same participants defended in the two previous meetings?". The answer requires holding three multi-hour corpora simultaneously.

On a competitive audit: "across the thirty public documents analysed, which competitors show a commercial trajectory that signals a strategic repositioning not yet announced?". The answer lodges in the gaps between documents, outside any particular document.

If the tool produces a fluent answer that could not survive an audit, because no document grounds it, it is a RAG tool in the middle of hallucinating. If the tool honestly says "I did not see this information" when it is in the complete corpus, it is a tool whose window is too small. If the tool produces an answer substantiated by the explicit connection of three identified documents, it is a tool sized for real work.

Operational consequence

The lesson, for an executive, a bid manager, a general counsel, a head of strategy, is precise: one must separate the right tool from the right use.

Microsoft Copilot, ChatGPT, Gemini, Claude.ai in its chat interface are excellent tools for tasks whose object fits in the short window their vendor has chosen to serve: email, internal note, quick synthesis, brainstorming, first draft of a short document. On these tasks, their performance is real, their productivity is measurable, their use is legitimate.

On tasks whose object exceeds the window -- a complete tender, a long sensitive meeting, cross-cutting document analysis, due diligence, a complex defence memo -- these tools switch to RAG mode. They lose the capacity for exploration and meta-cognition that would precisely justify using them there. On these tasks, the illusion of performance is more dangerous than the absence of a tool, because it produces deliverables that are fluent, structured, authoritative, and structurally insufficient.

The category error is not neutral. It exposes the company legally. It wastes weeks redoing AI-generated dossiers. And more deeply, it weakens trust in AI applied to real work: the failures of the short window are charged against a reputation that long-context architectures are in the process of earning.

The right tool for the right task. And, in bid management as in strategic steering, the right tool for real dossiers is sized in millions of tokens, far more than in chats of a few dozen pages.


To go further on the consequences of this diagnosis -- the real cost of inference (how much a dossier handled with a premium model and a serious human loop actually costs, and why the current window is paradoxically the cheapest we will see for a long time), then the question of sovereignty opened by DeepSeek V4 (for the large organisations that can deploy a SOTA-class model on proprietary infrastructure).


Primary sources: Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", NeurIPS 2020. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts", NAACL 2024. Karpukhin et al., "Dense Passage Retrieval for Open-Domain Question Answering", EMNLP 2020. Anthropic, "Introducing 100K Context Windows" (May 2023), "Claude 3 family" (March 2024), "Claude Opus 4 with 1M context" (2025). Bai et al., "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding", arXiv 2308.14508, 2023. Greg Kamradt, "Needle in a Haystack: pressure testing LLMs", 2023.

Tags

#AI#LLM#Copilot#RAG#long context#bid management#AI economics

Next step

Ready to transform your tender response?

Keep reading

Recommended articles