The measurement problem, stated precisely
Before we discuss what to measure, let's be precise about what the problem actually is. It is not that AI search is unmeasurable. It is that the measurement framework developed for Google — impressions, clicks, rankings, CTR — is categorically inapplicable to how LLMs work.
Search engines produce result pages. Those pages are measurable. LLMs produce synthesized responses. The synthesis process is opaque. No impression is registered. No referral is logged. The buyer who spent ninety seconds reading a Perplexity summary of your category left no trace in any system you own.
The implication is uncomfortable but necessary: you cannot measure LLM-mediated research the way you measure search traffic. You can only measure the outputs of that research — through systematic prompting, response analysis, and citation detection. GEO measurement is active, not passive. You must query the LLMs yourself, as your buyers do, and analyze what comes back.
GEO measurement inverts the traditional model. Instead of waiting for signals from buyers, you simulate the buyer — systematically, at scale, across every relevant question in the journey.
The five core GEO metrics
After tracking LLM visibility for B2B financial services clients across multiple providers in 2025, five metrics have proven consistently meaningful. They measure different dimensions of the same underlying phenomenon: how prominently and accurately your brand appears in AI-generated responses to buyer-relevant queries.
KPI_01
Share of Voice (SOV)
The percentage of relevant prompts in which your brand is mentioned at least once.
Brand_Mentions / Total_Prompts × 100
Benchmark (B2B Financial DE)
<20% weak20–50% mid>50% strong
KPI_02
Discovery Rate
Of all prompts where your brand could be mentioned, what percentage actually contain a brand mention.
Prompts_with_Brand / Total_Prompts × 100
Interpretation
High discovery + low list presence = mentioned but not recommended. The gap between the two reveals the quality problem.
KPI_03
List Presence
Frequency of brand appearance in recommendation lists or ranked shortlists — weighted higher than casual mentions.
Brand_in_List / List_Prompts × 100
Why it matters
Buyers who ask "which providers should I consider" receive list answers. Appearing in that list is categorically more valuable than being mentioned in a definition.
KPI_04
Win Probability Score
Composite score combining position, sentiment framing, and context quality of brand mentions.
(Position × 0.4)
+ (Sentiment × 0.3)
+ (Context × 0.3)
Score range: 0.0 – 1.0
A WPS of 0.7+ indicates the LLM is positioning your brand as a primary recommendation, not merely an option.
KPI_05
Position Score
Where in the response the brand first appears. First mention carries disproportionate weight in buyer interpretation.
1 − (First_Mention_Position / Response_Length)
LLM behavior note
Models consistently position what they interpret as "best fit" first. First position is not coincidental — it reflects the model's implicit ranking.
From metrics to audit: what a GEO assessment looks like
The metrics above require a structured prompt library to produce meaningful results. An ad-hoc query to ChatGPT tells you almost nothing. A systematic audit across 440 structured prompts, executed across four LLM providers in two languages, tells you quite a lot.
The prompt library must cover all seven buyer journey phases — not just the obvious provider-selection queries. The discovery that your brand is well-cited in awareness-phase prompts ("what is engineering consulting") but almost absent from decision-phase prompts ("which engineering firm is right for a manufacturing company with €2M project budget") is among the most actionable findings a GEO audit produces.
// Anatomy of a GEO Audit — Four Phases
Prompt Architecture
Build structured prompt library covering all 7 information situations, 4 LLM providers, 2+ languages. Minimum 80 prompts; 440+ for comprehensive coverage.
Output: prompt_library.json
Multi-Provider Execution
Run all prompts across OpenAI, Google, Anthropic, Perplexity simultaneously. Capture raw responses with timestamps and provider metadata.
Output: raw_responses.json
Response Parsing
Extract brand mentions, position, sentiment framing, context quality, competitor co-mentions, and hallucination flags from each response.
Output: parsed_signals.csv
Metric Aggregation
Calculate the 5 core KPIs per phase, per provider, per language. Identify gaps, risks, and high-priority content opportunities.
Output: geo_report.xlsx
What early data reveals
Running GEO audits across B2B financial services clients over the past six months has produced findings that are consistent enough to treat as early patterns rather than isolated observations.
Pattern 1: Awareness strength, decision weakness
Every audited company shows significantly higher brand visibility in awareness-phase prompts than in decision-phase prompts. This is structurally expected — awareness content is definitional, and definitional content is what LLMs absorb most readily. The concerning finding is the magnitude of the gap.
A company with 150 well-structured glossary pages may achieve 60%+ SOV on awareness prompts while sitting below 10% on decision-phase provider queries. This is not a content volume problem. It is a content function problem. The decision-phase content simply does not exist in a form that LLMs can accurately synthesize and cite.
Pattern 2: The hallucination risk concentration
LLM hallucinations are not uniformly distributed across query types. They concentrate in two areas: specific company claims (ownership, market position, product availability) and comparative statements ("X is better than Y because..."). The first type is dangerous to your brand. The second is structurally unavoidable, but the risk varies based on how clearly your positioning is communicated in source content.
// Old measurement framework
Organic sessions
Misses all pre-session LLM research; declining metric that shows the problem, not the cause
Keyword rankings
LLMs don't expose query signals; ranking for "what is engineering consulting" tells you nothing about LLM citation rate
Page impressions
Counts after-the-fact website visits; invisible to the synthesis sessions that precede them
Content engagement
Measures what your existing visitors do; says nothing about the research journey of buyers who never arrive
// GEO measurement framework
Share of Voice by phase
Reveals where in the buyer journey you're visible vs. invisible — actionable by content type
Discovery Rate by provider
Shows which LLMs are your brand advocates and which are systematically overlooking you
Win Probability Score
Distinguishes between being mentioned and being recommended — the quality dimension of LLM visibility
Hallucination risk flags
Identifies specific factual claims about your company that LLMs are getting wrong — proactively addressable
Pattern 3: Provider divergence
The four major LLM providers do not produce consistent brand visibility results. A company that appears in 45% of prompts on OpenAI may appear in 22% on Google Gemini and 61% on Perplexity. The divergence is not random — it reflects different training data compositions, different recency weighting, and different approaches to synthesizing competitive comparisons.
Provider divergence is diagnostic. When you are well-cited on Perplexity (which uses live web search) but poorly cited on ChatGPT (which relies on training data), the implication is that your content is crawlable and current, but was underrepresented in the training corpus. The remediation is different than if the pattern were reversed.
// Actionable insight
A GEO audit produces a prioritized content gap report — not by topic, but by buyer situation and LLM provider. The finding "your switching_housebank content produces 0 citations across all providers" is specific, verifiable, and directly addressable with a single well-structured article. This is a fundamentally different quality of insight than "your organic traffic declined 12% this quarter."
The attribution model it suggests
Early data from clients running parallel GEO tracking and traditional analytics produces a consistent pattern: non-brand organic clicks decline, brand clicks increase, brand impressions grow substantially. The interpretation is clear — content is being absorbed by LLMs and driving brand awareness, but the awareness is materializing as branded search rather than organic click-through.
This reframes the attribution model entirely. The organic content that produced no direct traffic in 2025 may have been responsible for the branded search spike in 2025. Traditional last-click attribution misses this completely. GEO metrics, cross-referenced against branded search trends, begin to reconstruct the invisible journey.
In the next episode, we'll turn from measurement to architecture — what content structure, semantic precision, and internal linking strategy give your content the highest probability of becoming the source that LLMs absorb and reproduce accurately.