Thought Leadership·May 7, 2026·21 min read

Evaluating an AI output when you are not the expert: the reasoning-pattern path

The classic advice for evaluating an AI output -- check the sources, run an internal red team, multiply the sessions -- has aged badly by 2026. None of it answers the real question: how do you produce an excellent output on a subject you do not master, and how do you verify you reached the objective when you cannot judge the content on its merits? The answer shifts planes. Knowledge and reasoning are two distinct objects. The AI possesses the first; it has no intrinsic preference about the second. Left to its own devices, it applies the median reasoning of its corpus -- an average with no particular superiority. Superiority has to be imposed on it. And the human is the only possible source of that -- provided they have recognised their own reasoning pattern, and accepted to put it to work.

By Aléaume Muller

ER

Evaluating an AI output when you are not the expert: the reasoning-pattern path

Sixth article in the cognition / doctrine block. Once we have laid out the real cost of inference, what genuine agentics is and method design as a competency, one question remains -- the one every AI training course sidesteps by offering, in its place, a catalogue of surface techniques: how do you produce and sign off on an AI output when you are not the expert on the subject it covers?

Three storeys have stacked up over three years in the AI competency as it is taught inside organisations. None answers the real question, and each has made people believe it did.

First storey -- the schoolbook grammar of the prompt. Every beginner course today teaches the same four words, covered by four scrupulously applied paragraphs: role, context, example, task. A few variants add format and constraints. The trained user dutifully writes "you are an expert in X", "here is situation Y", "here is an example of the expected answer", "produce Z in such-and-such format". The result is technically correct. It clears the bar of a usable deliverable, barely exceeds the average, and gives the commissioner the reassuring sense of having mastered their tool. It is the rigour of application that reassures; it is not the quality of the deliverable that rises.

Second storey -- the tricks of apparent sophistication. The user who wants to move up a notch applies the techniques that the "professional" courses of the vendors, consultancies and cloud hosts -- AWS, Azure, Google -- popularised in 2024-2025: chain of thought (asking the AI to reason step by step before concluding), tree of thought (making it explore several branches of hypotheses then select the best), "having several experts debate" (impersonating three expert personae and producing a collegial deliberation), pitting an adversarial agent against itself, generating a self-critique at the end of the answer. Each of these techniques has its literature, its certification, its encouraging field reports. Each produces, on a concrete deliverable, the appearance of deliberation and the feeling of added rigour.

None substantially raises the quality of the substance, because all rest on an anthropological overlay -- the idea that multiplying "agents" the way you multiply humans would make a democracy of AIs emerge, superior to the autocracy of a single agent. That projection is wrong at the root. Where debating humans each bring a situated, differentiated background, training and experience -- and are called upon precisely for those differences -- several AI personae are nothing but a single machine inference talking to itself. The "debate" between simulated experts is a generation loop on the same substrate, with no real contradictor, no outside experience, no history of its own. Running that inference in circles, with no explicit reasoning pattern imposed, produces an averaged consensus that looks like a debate while having none of its productive properties. It is the illusion of debate, the illusion of control, the illusion of a mastered consensus -- all three resting on the central illusion that the AI agent is some kind of human.

The only possible rescue is for a competent human to arbitrate, as the final authority, between the simulated voices. But then the benefit of agentics evaporates: in a near-automated process that has to scale, the human who arbitrates is precisely the bottleneck you were trying to remove. Either the human arbitrates and scalability collapses, or they do not arbitrate and the average prevails. In both cases, the detour through multiplying agents has produced nothing.

Third storey -- the post-hoc evaluation arsenal. For the executive who has to sign off on the produced deliverable, the catalogue of countermeasures has filled out: check the sources, multiply sessions and triangulate, summon a fictitious expert evaluator into the prompt, ask a second model to audit the first, apply the rule of three sources, demand a proof through an explicit chain. All these techniques quietly assume that a competent evaluator exists somewhere in the loop -- oneself, another agent, a summoned expert. When that evaluator does not exist -- and it almost never exists in a zone of user ignorance -- these techniques produce noise that resembles rigour.

An important clarification on what these techniques are no longer meant to solve in 2026. The risk of factual hallucination, which in 2024 and early 2025 justified a good part of the post-hoc evaluation arsenal, has been massively reduced by the anti-hallucination machinery built into SOTA models -- grounding on verified sources, calibration of uncertainties, native RAG, explicit refusal to answer when confidence is low. The model's massive knowledge is, in 2026, relatively reliable -- not perfect, but an order of magnitude beyond what it was eighteen months ago. The problem left to address is therefore no longer the occasional factual accuracy. It lies elsewhere: in the quality of the path by which this reliable knowledge is mobilised to answer a complex question.

The three storeys share a structural flaw. They treat production and evaluation as two separate moments. They assume the quality of an output is decided in how the prompt is framed upstream, or in critical vigilance downstream. Neither is true. The quality of an AI output is decided in the path by which the model's massive knowledge is mobilised to reach the answer. And that path lies neither in the role, nor the context, nor the example, nor the triangulation. It lies in the reasoning pattern the user imposes on the model -- whose recognition, mastery and imposition constitute the real AI competency of 2026.

We need to shift planes.

The distinction missing from most AI training

An AI output is the culmination of two very different things.

The first is the knowledge mobilised. Everything the model knows about the subject: facts, references, vocabulary, conceptual structures, comparable examples. On the SOTA models of 2026 -- Opus 4.7, GPT-5.5, Gemini Pro -- the amount of available knowledge exceeds, by several orders of magnitude, what the best human expert holds in active memory on a given subject. Knowledge is, in the majority of practical cases, a battle lost in advance for the human, and one they have no reason to seek to fight again.

The second is the reasoning applied. The path by which this knowledge is mobilised to reach the output. Which inferences are made, in what order, on which hypotheses, eliminating which options, confronting which tensions. On this terrain, the AI has no intrinsic preference. The model learns, through exposure to its corpus, multiple reasoning patterns -- deductive, inductive, abductive, dialectical, narrative, first principles, Bayesian. When you ask it a question with no reasoning framing, it applies the pattern that is statistically dominant in the corpus for that type of question. It is an average. And an average, by construction, has no particular superiority.

It is this distinction that changes everything.

The user who asks the AI "answer me on this subject" gets an answer rich in knowledge, fluent in exposition, median in reasoning. They validate the knowledge they do not have, they validate the fluency they take for rigour, they validate a median that has no particular reason to be good for their specific objective. The result is what it is: a deliverable that looks competent, that does not survive an audit, and that the user themselves has no way to defend if challenged on the reasoning path that produced it.

This observation matters because it lets us locate exactly where the quality of an AI output is decided. Knowledge is acquired. Reasoning remains to be imposed. And it is imposed neither by accumulating experts, nor by triangulating sessions, nor by red teaming. It is imposed through human awareness of the chosen reasoning pattern.

The main ways of reasoning

For centuries, epistemology has mapped the productive reasoning patterns. A dozen archetypes coexist, proven across varied disciplines and available for a human to appropriate.

Deductive reasoning starts from general premises and draws the necessary consequences for a particular case. The dominant form of law, mathematics, compliance checking. Productive when the premises are reliable; a trap when one mistakes a convention for a truth.

Inductive reasoning starts from particular observations and draws a probable generalisation. The dominant form of the experimental sciences, market studies, consolidated lessons learned. Productive when the sample is representative; a trap when one extrapolates to an out-of-distribution case.

Abductive reasoning starts from a surprising fact and seeks the most economical hypothesis that would explain it. The dominant form of medical diagnosis, criminal investigation, senior strategic consulting. Productive when one holds the tree of hypotheses without stopping at the first seductive one.

Steelmanning consists of reconstructing the opposing argument in its strongest version before criticising it. A discipline of practical epistemology, indispensable in pre-sales, in negotiation, in adversarial debate.

Dialectics sets thesis and antithesis in tension to produce a synthesis that transcends both. The dominant form of transformation consulting, political philosophy, arbitration strategy in complex situations.

First principles consists of decomposing a problem down to its irreducible building blocks, then reconstructing the solution without permitting any shortcut by analogy. The favoured form of disruptive engineering, product innovation, cost rationalisation.

Bayesian reasoning updates a probabilistic belief as new information arrives. The dominant form of forecasting, intelligence, diagnosis under uncertainty.

Scenario reasoning explores several coherent futures to prepare a decision robust to more than one of them. The dominant form of strategic planning, war gaming, risk analysis.

This list is not exhaustive. It is enough to establish the main point: there is no single good way of reasoning, there are about a dozen. Each has its domains of validity and its own traps. And every human, by temperament, training and experience, practises two or three better than the others -- to the point of understanding them from the inside, knowing how to calibrate them, knowing how to criticise them, and being able to impose them on others.

It is on that competency that the quality of an AI output is decided, in 2026.

The reasoning contract

The human who lays their reasoning pattern before the AI enters into a contract with it of a different nature than those the courses teach. Not a contract of expertise ("you are an expert in X"). Not a contract of format ("answer in six paragraphs"). Not a contract of validation ("check your sources"). A contract of path: "you may know everything better than me and much faster. But on this subject, you will follow this process, because it is in this process that your knowledge will maximise its relevance to reach the objective."

This contract is not reducible to naming an archetype -- "take a deductive approach", "reason by abduction". Reducing the pattern to a label would mean falling back into the very schoolbook grammar we are trying to transcend. Useful reasoning is polymorphic and multidimensional. It combines several strata that must be made explicit separately, because each engages a distinct decision by the user.

The nature of the main reasoning -- deductive, abductive, first principles, dialectical -- sets the general direction. But that direction is not enough. One must still specify: which premises the reasoning rests on, and which must be verified before being taken for granted. Which intermediate steps are mandatory and in what order, which may be merged. Which evaluation criteria each intermediate hypothesis must clear to be retained, in the form of a genuine grid -- not a soft "check that it's right", but a discriminating list ("does this hypothesis also explain fact Y? does it survive counter-example Z? is it compatible with constraint W?"). Which points of return are permitted -- feedback loops that re-evaluate a hypothesis in light of the consequences deduced, or even abandon a branch to start again from higher up. Which known traps of the pattern must be explicitly neutralised (for instance, on abduction: stopping at the first seductive hypothesis without holding the tree of alternatives). Which exit conditions mark the end of the reasoning -- reaching a calibrated conviction, exhaustion of admissible hypotheses, a clear signal of irreducible uncertainty to be exposed as such.

A complete reasoning contract then looks less like an instruction than a detailed operating procedure. On certain short tasks, two or three lines suffice. On strategic tasks -- analysing a tender, complex diagnosis, arbitration under constraint -- it takes the form of a multi-step process with its verification loops, its criteria grids, its permitted breaks. It is this density that distinguishes a genuine contract of path from a mere method label.

This sentence, and what extends it, changes everything, because it reinscribes the user in a role they can actually hold. They no longer judge the content -- they have given up judging it. They no longer validate the fluency -- they know it proves nothing. They validate the path by which the knowledge they do not possess was mobilised to reach the output. And that path is of their own making. They hold its internal grammar. They know, step by step, whether the trajectory they laid down was followed, or whether the agent drifted toward the statistical median.

A decisive clarification: the user does not validate this path a priori, on the basis of a promise or a declaration of intent. They validate it within the AI's production itself -- in the structure of the sentences, the chaining of paragraphs, the unfolding of the description or the argumentation, the explicit markers of the steps cleared. The imposed pattern must surface on the face of the produced text, in a form recognisable by whoever laid it down. The user then reads two things simultaneously: the result, which they cannot always judge on its merits; and the demonstration, which they can verify because they wrote its grammar.

TenderGraph TITAN illustrates this mechanism concretely. The model gives central importance to the distinction between what is explicitly written in the specifications and the hypotheses inferred from those descriptions. That boundary is drawn in the output itself -- each assertion carries the mark of its epistemic regime: literal citation, faithful paraphrase, inference assumed as such. The model never loses track of its own reasoning: it indicates where each element comes from, at which step it was produced, on what basis it holds. On its own inferences, it first seeks factual or plausible verifications in the DCE after having studied other alternative leads, and explicitly states when it retains a hypothesis for lack of a discriminating element.

This is, in the end, an approach close to the scientific method applied to agentic production: never admit as given what has not been proven, and surround the handling of the hypotheses that follow from it with extreme caution. The bid manager does not need to be an expert on the subject of the DCE to validate this discipline. It is enough to read the epistemic-regime markers in the output -- and to observe that the reasoning pattern they contracted with the agent is indeed present, step by step, in the production before their eyes.

This posture resolves the central paradox of AI evaluation in a zone of ignorance. The user does not become an expert -- they never will in the time of a session. But they become the architect of the reasoning that will produce the answer. Their competency ceases to be the content, which will always escape them. Their competency becomes meta-cognitive -- lucidity about their own form of thought, and the discipline of imposing it on an agent that, without it, would fall back into the median.

This shift is the contemporary translation of an old Socratic posture, but reinvented. To recognise that one does not know -- gnoseological humility. To recognise nonetheless how one reasons, and to accept that it is on this terrain alone that one can legitimately govern -- strategic requirement. To impose this reasoning on the machine, and to validate the only terrain one is in a position to validate -- taking responsibility for the result.

AI as a device for cognitive elevation

There is, in this posture, an effect that few people are describing much in 2026, and that is probably the most profound consequence of well-conducted AI use.

When a human lays their reasoning pattern before the AI, and the AI produces conclusions according to that pattern with its massive knowledge, the human can do three things in succession. They can verify that the pattern was indeed followed. They can read the conclusion they partially validate. And above all, they can confront the depth their own pattern reaches when it is applied with knowledge they do not have.

This confrontation is new in the history of human-machine collaboration. The user discovers, in their own reasoning, potentialities they had never explored alone, for lack of knowledge. The pattern becomes a producer of depth. The AI ceases to be a producer of deliverables and becomes a device for personal cognitive elevation, whose quality is exactly proportional to the quality of the pattern the user imposes.

A mediocre pattern produces a mediocre elevation. A rigorous pattern -- abduction applied, systematic steelmanning, strict hypothetico-deductive method -- produces an elevation that, over time, transforms the user themselves. Their practice rises. Their reading of situations rises. Their judgement rises. This effect is not immediate; it unfolds over dozens, then hundreds of sessions. It is the most profitable personal investment that AI makes possible in 2026, and it is what distinguishes the users who rise through their use from those who go numb.

Conversely, the user who lays down no pattern, or lays down a pattern they are not aware of holding, follows a bell-shaped trajectory. At first, the AI sharpens their intuition -- they reach more material faster. Then, as use intensifies without a frame, the AI collapses their intuition. The plausible fluency of the outputs degrades the approximation detector the user had built over twenty years of career. They get used to validating without discernment. They get used to no longer seeking the path. They become, through cognitive anaesthesia, the mirror of the median that comes out of the machine.

To choose your reasoning pattern, to master it, to impose it, is to choose the rising side of this curve. To give up laying it down is to choose the descending side.

The case of semi-autonomous agentics

The terrain where this discipline becomes absolutely decisive is that of agentics, and more particularly semi-autonomous agentics -- the dominant mode in 2026 on complex missions that the human can neither automate end to end, nor restart from scratch at each step.

On a tender, for example, the human must judge, correct and steer the AI all along the chain of analysis and production. They cannot redo alone what the agent does -- otherwise the agent serves no purpose. Nor can they validate everything after the fact -- otherwise they validate blind. They must rise to the level of the agent during the mission, which presupposes that they elevate themselves continuously, and that they make the agent the very instrument of that elevation.

This requires, on a tender, that they be aware of the reasoning pattern to impose at each phase. Abductive reasoning on the mapping phase, where the buyer's implicit strategy must be reconstructed. First principles reasoning on the pricing-simulation phase, where the weighting formula must be reconstructed from its building blocks. Steelmanning on the argument-review phase, where each thesis must be tested against its strongest version of the counter-argument. Scenario reasoning on the defence phase, where several trajectories of the evaluator's questioning must be anticipated. Without this discipline, the agent applies on each phase the median of its corpus, which resembles reasoning and is not.

And it is precisely this discipline that elevates the user over time. A bid manager who spends twelve months imposing these patterns on an agent over real cases ends up practising them themselves better than they did before. The machine, by executing their reasoning with knowledge they do not have, reveals to them the depth their reasoning potentially contains. They become, through exposure to their own amplified pattern, a better thinker than they were when they started.

This is what TenderGraph TITAN was designed to make possible. The platform orchestrates, on a tender, a sequence of eleven semi-deterministic phases -- from exploring the DCE to the final revision, by way of mapping, strategy, solutioning, chapter production, review, simulated defence. Each phase carries an explicit reasoning frame, calibrated on what the phase demands: exploratory and abductive at the start, deductive and rigorous in the middle, steelmanning and adversarial on the reviews, scenario-based on the defence preparation. The bid manager does not have to reconstruct this intellectual discipline alone for each case. It is built into the infrastructure, and it grows richer as the bid manager inscribes their own reasoning style into it.

This integration produces two cumulative effects. In the short term, it averts the degradation through cognitive anaesthesia -- each session with TITAN carries an explicit reasoning frame, which keeps the user on the rising side of the curve, without their having to invent the discipline themselves. In the medium term, it accelerates personal elevation -- the bid manager who collaborates intensively with an agent whose every phase is governed by a rigorous reasoning pattern ends up internalising those patterns. Their reading of a tender rises, their quality of questioning rises, their strategic judgement rises. The AI transformation they benefit from becomes, without their having to put it into words, a personal transformation.

Operational consequence

For a management that has invested in AI training for eighteen months and observes results below expectations, the diagnosis lends itself to a clear formulation. The classic techniques of evaluation and prompting have hit their ceiling, because they assume a competent evaluator who does not exist in a zone of user ignorance. The competency that takes their place -- the awareness of a chosen, owned, intelligible, imposable reasoning pattern -- belongs to a deeper discipline, and has almost nothing to do with what is called prompt engineering.

Stop training in post-hoc evaluation. Beyond basic vigilance against gross hallucinations, evaluation by the user alone hits its ceiling quickly, and the additional margin of investment proves disappointing.

Commit to training in awareness of reasoning. For bid managers, consultants, analysts, this training bears on their own cognitive practice -- which reasoning pattern fits them, how to recognise it, how to impose it on an agent, how to verify it was followed. This training has a name in the philosophical tradition: applied practical epistemology. It has almost never been taught outside the academic circuits of the philosophy of science. It becomes, in 2026, the most profitable AI-transformation investment of mature organisations.

Equip this discipline within an agentic infrastructure that embodies it. An isolated intellectual discipline remains a dead document. Inscribed in a semi-deterministic agentic chain -- where each phase carries its reasoning frame and where the user can intervene at each step -- it becomes a productive asset and a device for continuous elevation.

AI has not taken power over knowledge -- it has socialised it.

It has not taken power over reasoning -- it is waiting for someone to lay it down.

The competency that will distinguish, two years from now, the transformed organisations from those that will merely have consumed AI is neither expertise, nor prompting, nor method. It is the awareness each employee has of their own reasoning pattern, and the discipline of imposing that pattern on the machine. So that it amplifies the pattern without distorting it. And so that, through it, that pattern becomes more accurate with every session.


Main sources -- epistemology & reasoning: Peirce, "Deduction, Induction and Hypothesis", Popular Science Monthly, 1878. Polya, How to Solve It, Princeton University Press, 1945. Toulmin, The Uses of Argument, Cambridge University Press, 1958. Kuhn, The Structure of Scientific Revolutions, University of Chicago Press, 1962. Hempel, Aspects of Scientific Explanation, Free Press, 1965. Lakatos, The Methodology of Scientific Research Programmes, Cambridge University Press, 1978. -- Cognitive science & dual process: Kahneman, Thinking, Fast and Slow, FSG, 2011. Stanovich & West, "Individual Differences in Reasoning", Behavioral and Brain Sciences, 2000. Evans, "Dual-Process Theories of Higher Cognition", Perspectives on Psychological Science, 2008. -- Heuristics & bounded rationality: Simon, "A Behavioral Model of Rational Choice", Quarterly Journal of Economics, 1955. Gigerenzer & Todd, Simple Heuristics That Make Us Smart, Oxford University Press, 1999. -- Tacit knowledge & expertise: Polanyi, The Tacit Dimension, University of Chicago Press, 1966. Dreyfus & Dreyfus, Mind over Machine, Free Press, 1986. -- Applied methods: Schwartz, The Art of the Long View (scenario reasoning), Doubleday, 1991. Tetlock & Gardner, Superforecasting (probabilistic calibration), Crown, 2015. Galef, The Scout Mindset (steelmanning), Portfolio, 2021. -- AI & alignment: Bai et al., "Constitutional AI: Harmlessness from AI Feedback", arXiv 2212.08073, Anthropic 2022. Anthropic, "Building effective agents", anthropic.com, 2024.

Tags

#AI#LLM#evaluation#reasoning#epistemology#agentic#bid management

Next step

Ready to transform your tender response?

Keep reading

Recommended articles