Epistemic marking: the human signature that LLMs do not reproduce

First article in a new series devoted to the human cognitive signatures that LLMs do not reproduce. The previous series, on rhetorical figures, closed on aposiopesis — the restraint of commitment through form. Here we descend beneath rhetoric: the restraint of commitment through probabilistic calibration. With, in the background, what litotes had already named — the "active restraint" against the RLHF incentive toward verbal sprawl — and the inventory of human/AI cognitive biases on the terrain of overconfidence.

In October 1962, the Executive Committee of the U.S. National Security Council deliberated for thirteen days over the Soviet missiles in Cuba. On the table: reconnaissance photographs, divergent estimates, operational hypotheses. For forty years, the declassified records would reveal a trait common to every intervention by the senior analysts: no assertion left the room without its certainty marker. "We know that…", "we put the probability at 70% that…", "we cannot rule out that…", "there is nothing to indicate that…". The informational content of the sentences is inseparable from their epistemic charge.

Some would read this as a bureaucrat's affectation. It is, on the contrary, the very condition of possibility for any rational decision under incomplete information. And it is precisely the competence that the architecture of large language models does not reproduce.

Tetlock and the CIA's embarrassing revelation

In 2015, Philip Tetlock published Superforecasting: The Art and Science of Prediction, which synthesizes twenty years of research from the Good Judgment Project. The project, funded by IARPA — the research arm of the Office of the Director of National Intelligence — pitted two populations against each other on geopolitical questions: professional CIA analysts and amateurs selected solely on their score in calibration questions.

The result remains one of the most uncomfortable in the recent history of American intelligence: the best amateurs beat the CIA analysts by roughly 30%, measured in Brier score. Tetlock identifies the trait common to these superforecasters: they are neither more intelligent nor better informed, and they have no access to classified data. They share one meta-competence — probabilistic calibration: the capacity to say "I am 65% confident" rather than "I am almost certain," and to see their 65% come true in 65% of cases verified over the long run.

This meta-competence is built through practice, systematic feedback, and the discipline of explicit marking. Superforecasters operate less like oracles than like accountants of their own uncertainty.

Wittgenstein, Russell, and the marker / operator distinction

The philosophical tradition had posed the problem a century earlier. Wittgenstein, in the closing proposition of the Tractatus (1921), formulates the most quoted aphorism in analytic philosophy: "Whereof one cannot speak, thereof one must be silent." The sentence is almost always misread as an injunction to keep silent about the mystical. Its reach is more precise: an assertion that strays outside the zone where epistemology is tenable is no longer an assertion — it is another speech act, and must be treated as such.

Bertrand Russell, in An Inquiry into Meaning and Truth (1940), pushes the analysis further by positing the concept of the epistemic operator. For him, every assertion implicitly carries a prefix — "I know that p," "I believe that p," "It is probable that p," "I suppose that p." Confusing these operators irreversibly degrades the quality and rationality of discourse.

This distinction, which is central, is almost always conflated with the mere presence of markers.

An epistemic marker is a surface fact — a word, an adverb, a modulation: "perhaps," "it seems that," "plausibly," "in all likelihood." It is stylistic matter: it signals an intention of caution to the reader without altering the nature of the assertion it accompanies.

An epistemic operator intervenes on the substance. It transforms the truth value of the proposition it prefixes and, in a contractual context, its legal scope. To say "we guarantee GDPR compliance" and "we assess ourselves to be compliant with GDPR requirements" is not a matter of stylistic variation. These are two assertions of different legal natures: the first commits the signature, the second takes a position. A contractor bound by the first exposes itself to an action for performance if a non-compliance is found; a contractor bound by the second has taken a verifiable positioning act, not a locked-in contractual commitment.

A tender response is a system of epistemic operators disguised as natural language. The apparent fluency of professional prose masks a stack of prefixes — explicit or implicit — that determine, sentence by sentence, what the contractor is bound to deliver in execution. An experienced buyer does not read the prose; they read the operators.

The formal metric: the Brier score

Epistemic marking has a formal measure. Glenn Brier, an American meteorologist, proposed it in 1950 in an article in the Monthly Weather Review. The Brier score measures the gap between announced probabilities and observed frequencies. A forecaster who announces "80% probability of rain" across a hundred days is calibrated if rain falls on seventy-eight to eighty-two of them; they drift if it falls on fifty or on ninety-five. The individual accuracy of each forecast matters less than the alignment, across the whole, between announced confidence and observed frequency.

The metric has since been adopted by the entire literature on calibration: electoral polling (Nate Silver, FiveThirtyEight), medical assessment, macroeconomic forecasting, and now the evaluation of large language models. It imposes an objective criterion on what appears subjective: a discourse is epistemically honest if the realization frequency of its 70% claims sits around 70%, and that of its 95% claims around 95%. Overconfidence can be measured; it can be corrected; it cannot be masked for long.

Four operator levels in bid management

A tender response is an act of contractual commitment under incomplete information. Across eighty criteria, the bid manager has certainty on only half. On the other half, the epistemic operator prefixed to each sentence is a legal act as much as a rhetorical one.

Four operator levels structure a professional technical proposal.

Level 1 — the anchored factual assertion. "Our team has delivered forty-three similar projects since 2019." Verifiable, dated, quantified. The only level at which a buyer can take the assertion as given. A serious tender response contains around twenty of these, no more.

Level 2 — the calibrated estimate. "We estimate the implementation timeline at twelve weeks, based on consolidated feedback from comparable configurations." The estimate carries its source. The reader knows what it rests on. The margin of error is implicit but not denied.

Level 3 — the hypothetical modalization. "Subject to the availability of business stakeholders during the framing phase, deployment could be completed in eight weeks." The commitment is conditional, the condition is named. The register of points where one takes a position without guaranteeing it.

Level 4 — the admission of operational ignorance. "Compatibility with the specific configurations mentioned in section 4.7.3 of the CCTP will require additional investigation during the framing phase." We don't know. We say so. We name the mode of resolution. This admission paradoxically functions as a powerful signal of seniority, because only someone who masters a subject can precisely identify the zone they do not yet master.

A technical proposal that flattens these four levels into one — whether it is "we guarantee" everywhere or "we would be in a position to" everywhere — loses all informational value for the experienced reader. The system of operators collapses, and with it the legal legibility of the file.

Why LLMs are structurally overconfident

Three converging mechanisms explain why this discipline does not survive in an AI-generated response.

The training distribution flattens the operators. During pre-training, the model learns the joint distribution of tokens across a massive corpus. But direct factual sentences are vastly more frequent in it than epistemically marked ones: "the capital of France is Paris" outweighs "it is probable, at 99.9%, that the capital of France is Paris" by several orders of magnitude. The model learns to produce the dominant form. When asked for an uncertain assertion, it produces the certain form — that is the most probable completion.

RLHF amplifies the bias. Ouyang et al. (NeurIPS 2022) laid out the reference architecture for Reinforcement Learning from Human Feedback. Human annotators — recruited en masse, paid per task, rarely experts in the domain under evaluation — prefer clear, complete, assertive answers. A modalized answer ("I'm not sure, but I think that…") is massively downvoted as "evasive" or "unhelpful." The training gradient therefore pushes the model to increase apparent confidence even as effective knowledge declines. This is, point for point, the inverse of what a superforecaster learns.

The absence of an exposed internal calibration signal. Kadavath et al. (Anthropic 2022), in "Language Models (Mostly) Know What They Know," published a study long read as reassuring: LLMs can, internally, distinguish the questions where they have the right answer from those where they invent it. The probability assigned to the correct token is higher in the first case. But this distinction remains internal and unexposed. The model does not output the probability. It outputs the sentence, with the same intonation of authority whether the probability that the sentence is accurate is 95% or 30%. Lin, Hilton, and Evans (NeurIPS 2022), in "Teaching Models to Express Their Uncertainty in Words," attempted to correct this by training a model to produce explicit verbal confidence estimates. The result is nuanced: the improvement is measurable, but calibration remains far below that of a trained human analyst.

The consequence in bid management. An LLM asked to draft a technical proposal produces, by default, text at operator level 1 throughout — "we guarantee," "our solution fully meets," "our approach enables." Where the human prefixes different operators according to the degree of knowledge, the AI prefixes a single operator according to the statistical mode of the corpus. A senior buyer recognizes the signature immediately: permanent overconfidence is one of the clearest AI markers, just ahead of tricolon saturation and the stacking of corrections.

Calibration as a marker of seniority

In consulting, epistemic calibration is one of the seniority markers hardest to imitate. A junior consultant, faced with a difficult engagement, writes:

"This transformation presents major risks requiring particular attention."

The tone is uniform, the verb is flat, the implicit epistemic operator is the bald statement ("it is the case that"). No source, no calibration, no delimitation. The sentence signs the absence of metacognition.

A senior partner, on the same engagement, writes:

"Based on five comparable engagements conducted between 2019 and 2024, we estimate that this transformation carries a high operational risk over the first six months; two variables will determine success — the quality of business-side steering and the level of preparation of the legacy data."

The second sentence contains four explicit operators — the source (five engagements, a dated period), the estimate (we estimate, not we know), the temporal delimitation (the first six months, not the long term), and the identification of the structuring unknowns (two named variables). The experienced reader extracts more useful information from it than from the first, because every word is calibrated against an observable reality.

This discipline is less an option than the very substance of senior consulting — precisely what LLMs, at this stage of their architecture, do not reproduce, owing to the construction of the training gradient more than to any lack of data.

Three operational practices in AI-assisted bid management

Identify the zones of mandatory operator before generating. A technical proposal typically has six to ten passages where the prefixed operator carries a critical legal charge: timeline commitments, cost commitments, client references, certifications, technical compatibilities, team capacity, regulatory compliance. These passages must be drafted or reviewed by hand. The LLM can produce the skeleton of the chapter, never the final operator of those particular sentences.

Audit overconfidence passage by passage. For each paragraph produced by the AI, ask the question: "what is the real probability that this sentence is accurate if read literally by the buyer?" Probability below 90% → the operator must be adjusted downward. Below 60% → the sentence must be rewritten or deleted. This audit takes three to five minutes per critical passage. It is non-negotiable.

Prepare the instantiation for the oral defense. Every level 2 or 3 operator placed in the proposal must be prepared to be instantiated if the buyer pushes on it in clarification. "We estimate twelve weeks" must be backed by a list of five comparable projects with their actual durations. "Subject to the availability of business stakeholders" must be translatable into "two weekly interviews of one hour each over the first six weeks." A modalization that cannot be instantiated is a trap laid for oneself, disguised as a figure of caution.

What remains to the human author

Epistemic marking is one of the last grounds where the human signature remains structurally more reliable than machine output. Model size changes nothing here. Neither does the transformer architecture. The determining factor lies in the vocation of the training gradient: RLHF was designed to produce answers that are useful, clear, and complete. Probabilistic calibration, which demands precisely that one withhold commitment as knowledge declines, runs head-on against that vocation.

For a bid manager, a consultant, a negotiator, epistemic marking constitutes the condition of possibility for contractual commitment. A tender response belongs to the order of the legal act more than to a demonstration of conversational fluency. As long as models are optimized for fluency rather than calibration, it falls to the human to place the final operator.

The machine can draft the explanatory paragraphs.

The epistemic operator that signs contractual authority — that one must still place oneself.

Principal sources: Tetlock & Gardner, Superforecasting: The Art and Science of Prediction, Crown, 2015. Brier, "Verification of Forecasts Expressed in Terms of Probability," Monthly Weather Review, 1950. Kadavath et al., "Language Models (Mostly) Know What They Know," arXiv 2207.05221, Anthropic 2022. Lin, Hilton & Evans, "Teaching Models to Express Their Uncertainty in Words," NeurIPS 2022. Ouyang et al., "Training language models to follow instructions with human feedback," NeurIPS 2022. Wittgenstein, Tractatus Logico-Philosophicus, 1921. Russell, An Inquiry into Meaning and Truth, A. & U., 1940.

Epistemic marking: the human signature that LLMs do not reproduce

Epistemic marking: the human signature that LLMs do not reproduce

Tetlock and the CIA's embarrassing revelation

Wittgenstein, Russell, and the marker / operator distinction

The formal metric: the Brier score

Four operator levels in bid management

Why LLMs are structurally overconfident

Calibration as a marker of seniority

Three operational practices in AI-assisted bid management

What remains to the human author

Ready to transform your tender response?

Recommended articles

Your bid reviews are useless — and AI is about to prove it

What the Assistant Makes Visible — Four Tiers of Reciprocity

Pre-sales is an exercise in command -- and you are leading it without a staff map