Episode Four · Measurement

How to measure
what you can't see

GEO metrics for the B2B practitioner. When your buyer's research journey runs through ChatGPT, Perplexity, and Gemini before arriving at your domain, the measurement problem is fundamental — not methodological. Here's what to track, how to calculate it, and what early data reveals.

MICHAEL ZACHRAU · FEBRUARY 2026 · 12 MIN READ

geo_audit.py · rhinegold.be

$ python geo_audit.py --client rhinegold --prompts 440

# Executing 440 prompts across 4 LLM providers

# 7-phase buyer journey coverage

Providers: OpenAI · Google · Anthropic · Perplexity

Languages: DE (primary) · EN (secondary)

Prompts executed: 1,760

─── RESULTS ───────────────────────────

SOV_overall: 34.2%

discovery_rate: 41.8%

list_presence: 28.6%

win_probability_score:0.52

position_score: 0.61

─── PHASE BREAKDOWN ───────────────────

awareness: 61% (strong)

consideration: 29% (moderate)

decision: 8% (critical gap)

─── ALERTS ────────────────────────────

⚠ switching_housebank: 0 citations

⚠ criteria_selection: 2 citations

✗ competitor_compare: hallucination risk detected

Report: output/geo_report_rhinegold_feb26.xlsx

// 01

The measurement problem, stated precisely

Before we discuss what to measure, let's be precise about what the problem actually is. It is not that AI search is unmeasurable. It is that the measurement framework developed for Google — impressions, clicks, rankings, CTR — is categorically inapplicable to how LLMs work.

Search engines produce result pages. Those pages are measurable. LLMs produce synthesized responses. The synthesis process is opaque. No impression is registered. No referral is logged. The buyer who spent ninety seconds reading a Perplexity summary of your category left no trace in any system you own.

The implication is uncomfortable but necessary: you cannot measure LLM-mediated research the way you measure search traffic. You can only measure the outputs of that research — through systematic prompting, response analysis, and citation detection. GEO measurement is active, not passive. You must query the LLMs yourself, as your buyers do, and analyze what comes back.

GEO measurement inverts the traditional model. Instead of waiting for signals from buyers, you simulate the buyer — systematically, at scale, across every relevant question in the journey.

// 02

The five core GEO metrics

After tracking LLM visibility for B2B financial services clients across multiple providers in 2025, five metrics have proven consistently meaningful. They measure different dimensions of the same underlying phenomenon: how prominently and accurately your brand appears in AI-generated responses to buyer-relevant queries.

KPI_01

Share of Voice (SOV)

The percentage of relevant prompts in which your brand is mentioned at least once.

Brand_Mentions / Total_Prompts × 100

Benchmark (B2B Financial DE)

<20% weak20–50% mid>50% strong

KPI_02

Discovery Rate

Of all prompts where your brand could be mentioned, what percentage actually contain a brand mention.

Prompts_with_Brand / Total_Prompts × 100

Interpretation

High discovery + low list presence = mentioned but not recommended. The gap between the two reveals the quality problem.

KPI_03

List Presence

Frequency of brand appearance in recommendation lists or ranked shortlists — weighted higher than casual mentions.

Brand_in_List / List_Prompts × 100

Why it matters

Buyers who ask "which providers should I consider" receive list answers. Appearing in that list is categorically more valuable than being mentioned in a definition.

KPI_04

Win Probability Score

Composite score combining position, sentiment framing, and context quality of brand mentions.

(Position × 0.4)
+ (Sentiment × 0.3)
+ (Context × 0.3)

Score range: 0.0 – 1.0

A WPS of 0.7+ indicates the LLM is positioning your brand as a primary recommendation, not merely an option.

KPI_05

Position Score

Where in the response the brand first appears. First mention carries disproportionate weight in buyer interpretation.

1 − (First_Mention_Position / Response_Length)

LLM behavior note

Models consistently position what they interpret as "best fit" first. First position is not coincidental — it reflects the model's implicit ranking.

// 03

From metrics to audit: what a GEO assessment looks like

The metrics above require a structured prompt library to produce meaningful results. An ad-hoc query to ChatGPT tells you almost nothing. A systematic audit across 440 structured prompts, executed across four LLM providers in two languages, tells you quite a lot.

The prompt library must cover all seven buyer journey phases — not just the obvious provider-selection queries. The discovery that your brand is well-cited in awareness-phase prompts ("what is engineering consulting") but almost absent from decision-phase prompts ("which engineering firm is right for a manufacturing company with €2M project budget") is among the most actionable findings a GEO audit produces.

// Anatomy of a GEO Audit — Four Phases

Prompt Architecture

Build structured prompt library covering all 7 information situations, 4 LLM providers, 2+ languages. Minimum 80 prompts; 440+ for comprehensive coverage.

Output: prompt_library.json

Multi-Provider Execution

Run all prompts across OpenAI, Google, Anthropic, Perplexity simultaneously. Capture raw responses with timestamps and provider metadata.

Output: raw_responses.json

Response Parsing

Extract brand mentions, position, sentiment framing, context quality, competitor co-mentions, and hallucination flags from each response.

Output: parsed_signals.csv

Metric Aggregation

Calculate the 5 core KPIs per phase, per provider, per language. Identify gaps, risks, and high-priority content opportunities.

Output: geo_report.xlsx

// 04

What early data reveals

Running GEO audits across B2B financial services clients over the past six months has produced findings that are consistent enough to treat as early patterns rather than isolated observations.

Pattern 1: Awareness strength, decision weakness

Every audited company shows significantly higher brand visibility in awareness-phase prompts than in decision-phase prompts. This is structurally expected — awareness content is definitional, and definitional content is what LLMs absorb most readily. The concerning finding is the magnitude of the gap.

A company with 150 well-structured glossary pages may achieve 60%+ SOV on awareness prompts while sitting below 10% on decision-phase provider queries. This is not a content volume problem. It is a content function problem. The decision-phase content simply does not exist in a form that LLMs can accurately synthesize and cite.

Pattern 2: The hallucination risk concentration

LLM hallucinations are not uniformly distributed across query types. They concentrate in two areas: specific company claims (ownership, market position, product availability) and comparative statements ("X is better than Y because..."). The first type is dangerous to your brand. The second is structurally unavoidable, but the risk varies based on how clearly your positioning is communicated in source content.

// Old measurement framework

Organic sessions Misses all pre-session LLM research; declining metric that shows the problem, not the cause

Keyword rankings LLMs don't expose query signals; ranking for "what is engineering consulting" tells you nothing about LLM citation rate

Page impressions Counts after-the-fact website visits; invisible to the synthesis sessions that precede them

Content engagement Measures what your existing visitors do; says nothing about the research journey of buyers who never arrive

// GEO measurement framework

Share of Voice by phase Reveals where in the buyer journey you're visible vs. invisible — actionable by content type

Discovery Rate by provider Shows which LLMs are your brand advocates and which are systematically overlooking you

Win Probability Score Distinguishes between being mentioned and being recommended — the quality dimension of LLM visibility

Hallucination risk flags Identifies specific factual claims about your company that LLMs are getting wrong — proactively addressable

Pattern 3: Provider divergence

The four major LLM providers do not produce consistent brand visibility results. A company that appears in 45% of prompts on OpenAI may appear in 22% on Google Gemini and 61% on Perplexity. The divergence is not random — it reflects different training data compositions, different recency weighting, and different approaches to synthesizing competitive comparisons.

Provider divergence is diagnostic. When you are well-cited on Perplexity (which uses live web search) but poorly cited on ChatGPT (which relies on training data), the implication is that your content is crawlable and current, but was underrepresented in the training corpus. The remediation is different than if the pattern were reversed.

// Actionable insight

A GEO audit produces a prioritized content gap report — not by topic, but by buyer situation and LLM provider. The finding "your switching_housebank content produces 0 citations across all providers" is specific, verifiable, and directly addressable with a single well-structured article. This is a fundamentally different quality of insight than "your organic traffic declined 12% this quarter."

The attribution model it suggests

Early data from clients running parallel GEO tracking and traditional analytics produces a consistent pattern: non-brand organic clicks decline, brand clicks increase, brand impressions grow substantially. The interpretation is clear — content is being absorbed by LLMs and driving brand awareness, but the awareness is materializing as branded search rather than organic click-through.

This reframes the attribution model entirely. The organic content that produced no direct traffic in 2025 may have been responsible for the branded search spike in 2025. Traditional last-click attribution misses this completely. GEO metrics, cross-referenced against branded search trends, begin to reconstruct the invisible journey.

In the next episode, we'll turn from measurement to architecture — what content structure, semantic precision, and internal linking strategy give your content the highest probability of becoming the source that LLMs absorb and reproduce accurately.