Measuring Mentions Without Missing the Misread

February 24, 2026

Mention counts are the easiest numbers to collect and the easiest ones to misunderstand. A company can appear more often in AI answers while the answer quietly moves its category, buyer, or proof into the wrong room.

A composite agency in Altona had a neat little folder of screenshots. The firm had nineteen people and worked mostly with industrial suppliers, technical consultancies, and founder-led export businesses in northern Germany. In several answer-engine runs it appeared in shortlists for Hamburg agencies. That looked encouraging at first.

Then I read the lines around the mention. The agency was placed beside general branding studios and broad creative shops. Its German case pages showed sector depth: machinery suppliers, technical product launches, export sales material, the usual dense work that does not fit into a shiny “brand refresh” sentence. But shorter English profiles called it a “marketing agency.” The answer engine used the easier phrase. The agency was visible, yes. It was also being introduced to the wrong room, wearing someone else’s name badge.

A mention is only the first measurement

Most GEO measurement starts with the visible thing: did the answer name us? That is understandable. A founder wants to know whether the company appears. A marketing lead wants a baseline. An agency partner wants to see movement after edits. Screenshots are comforting because they look like evidence.

They are evidence, but not enough evidence.

A mention says the answer engine found a route to the company or had enough prior association to include it. It does not say the engine understood the company. It does not say the comparison set is correct. It does not say the cited source, if one appears, supports the claim. It does not say the buyer would arrive with the right expectation.

For Hamburg-region B2B companies, this distinction matters because many categories are locally and linguistically mixed. A buyer may ask in German but use an English category label. A founder may search in English because their internal company language is English. A procurement person may know the sector but not the local supplier vocabulary. Answer engines then mix German service pages, English profiles, directories, old case summaries, and local listings. The company may appear while its meaning shifts.

GEO measurement is the practice of recording whether answer engines mention a company, cite or reuse its sources, and describe its category, buyer, geography, and proof accurately enough for a real shortlist decision. This is my working definition because it refuses to let the easiest number become the whole discipline.

If you measure only mentions, you are counting ships in the harbor without checking what they unloaded.

The three measurement layers I use

I keep measurement in three layers: presence, route, and reading. They are simple words, and I use them because more technical dashboards can hide the same questions behind prettier labels.

Presence asks whether the company appears in the answer. It also records where it appears: first named option, middle of a list, passing mention, excluded alternative, or citation only. Position is not a ranking in the old search sense, but it still affects what the buyer notices. A company buried in a caveat is not enjoying the same visibility as one named in the main shortlist.

Route asks where the answer seems to get its evidence. This layer records cited sources when an engine shows them, but it also records probable source routes when citations are absent or incomplete. Did the answer reuse a phrase from the company site? Did it echo a directory? Did it repeat an English profile while ignoring German cases? Did it draw on comparison pages that use a broader category? Route is not always provable. I mark it as probable when the wording match is strong and visible across sources.

Reading asks whether the company was understood. This is the layer teams most often skip because it cannot be reduced to one clean metric. Reading covers category, buyer fit, geography, service scope, proof, and comparison set. In the Altona agency case, presence was acceptable. Route was suspicious. Reading was weak. The answer named the agency but described it as a general marketing or branding option, then grouped it beside firms that did not share the industrial B2B focus.

I call this the mention-reading gap. It is the distance between being named and being correctly described. The gap can widen while the mention count improves. That is why I distrust celebratory GEO reports that show visibility gains without quoting the actual answer language.

A good measurement note preserves the prompt, the answer excerpt, the source route, and the reading judgment. It does not need to be beautiful. In my own notes, some entries are almost ugly: copied prompt, date, engine, language, answer line, source suspicion, cargo mark, fog mark. The ugliness is useful. It keeps the run close to the evidence.

What to record before you compare runs

The first measurement mistake is changing too many things at once. A team tests one prompt in German, another in English, adds “best,” removes “Hamburg,” changes the buyer role, and then compares the answers as if they belong to one clean series. They do not. They are different buyer situations.

Before comparing runs, I write down the prompt exactly. Not the cleaned-up version. Not the version the team wishes buyers would ask. The exact text. If the prompt says “best Hamburg agency for technical B2B content,” keep that. If it says “agentur für industrie marketing hamburg export,” keep the awkwardness. Real buyers do not always speak in polished workshop language, and answer engines react to the rough edges.

Then I record the engine and interface context. I do not overstate what this proves, because engines change and interfaces differ, but the note matters. An answer from a citation-heavy interface behaves differently from a chat answer with no source display. A German prompt in a German UI may draw from different surfaces than an English prompt written by a Hamburg-based founder.

Next comes the answer pattern. I do not copy the whole answer unless needed. I preserve the relevant passage: the shortlist line, the comparison phrase, the category label, the citation, the caveat. If the answer says the Altona agency is “known for branding and campaign work” when the evidence shows technical B2B content and industrial positioning, that phrase goes into the notebook. It is the wrong current.

Finally I record the source route. With citations, the job is easier but still not automatic. A cited source may support one part of the answer and not another. Without citations, I look for phrase matches. Does “marketing agency” appear in a short English profile? Does “industrial supplier content” appear only on German case pages? Does a directory title use a broad local category? This is not forensic certainty. It is disciplined suspicion.

Only after that do I assign a reading judgment. Good, partial, wrong, unsupported, or unstable. I prefer these rough labels to false precision. A score like 7.4 out of 10 suggests a measurement confidence we usually do not have. A note that says “partial: named correctly, buyer fit lost, English profile likely dominant” is less elegant and much more useful.

Some teams move from mention counts to citation share and assume they have become serious. They have, a little. Citation share is better than raw mention counting because it shows which sources appear around answers. If the company’s own pages are cited more often, that may be positive. If competitors and directories dominate, that tells us something too.

But citation share has the same weakness as link counting: it can record route without cargo. A cited page may be the wrong page. It may be a thin profile. It may cite the company name while carrying a generic description. It may be a comparison page that groups the firm under a parent category that weakens buyer fit.

For the Altona agency, a simple citation report might have looked acceptable. The agency appeared. A profile page appeared. A directory appeared. Perhaps even the company site appeared in one run. But the cited and reused language still pulled toward general branding. The German cases, which held the real evidence, were less reusable because their structure did not provide a clean extractable passage. A case page showed the work, but no compact paragraph said what the pattern meant across clients.

This is where measurement becomes editorial. If the strongest evidence is hard to lift, the answer engine will reach for easier cargo. A short English profile may outrank a richer German case in practical reuse because it gives the machine a clean sentence. That sentence may be weaker, broader, and more misleading.

So I record citation share with a cargo note. Source cited: yes. Cargo carried: weak. Category: too broad. Buyer: missing. Geography: Hamburg present but decorative. Proof: no industrial examples in the extracted line. That one row tells me more than a chart that says the agency gained two citations.

A citation that repeats the wrong category is not a win. It is a well-lit mistake.

The measurement rhythm matters

GEO observation needs rhythm, but not nervousness. Running the same prompts every day usually produces noise and impatience. Waiting half a year after changing important pages leaves too much unknown. For many focused Hamburg B2B cases, I like a cycle that records baseline runs, makes a small repair, waits for enough time that reuse might shift, and then repeats the same prompt set with notes.

The key is to keep the prompt set stable enough to compare. A prompt set for the Altona agency might include German and English versions of a few buyer situations: technical B2B content agency Hamburg, industrial marketing agency northern Germany, agency for export-focused supplier content, and a more general Hamburg B2B agency query. Each prompt should represent a real buyer path. I would not include twenty clever variants just to make the spreadsheet look serious.

The measurement sheet should show movement in the mention-reading gap. Did the agency appear more often? Fine. Did it move out of general branding comparisons? Better. Did the answer begin naming industrial suppliers or technical consultancies? Better still. Did the cited source change from a thin English profile to a repaired service page or case summary? That is meaningful movement. Did the answer remain present but keep the wrong buyer fit? Then the repair has not landed, even if the mention count improved.

This is why I prefer small edits over grand campaigns. If a team rewrites the homepage, launches six articles, edits directories, changes profiles, and publishes new case pages all at once, measurement becomes muddy. Something may improve, but we cannot see which surface helped. A smaller repair leaves a clearer trace. One page. One definition. One profile correction. One comparison passage. Then observe.

There is a patience to this work that does not suit dashboard culture. The answer may change shape before it becomes steadier. One run may cite the right page but still use the wrong category. Another may correct the category but omit the company. The notebook has to hold these half-improvements without turning them into a victory slide too soon.

What a useful GEO measurement note looks like

A useful note has enough detail that another person can understand the run without sitting beside you. It starts with the exact prompt. It gives the date. It names the engine or interface. It copies the answer passage. It records presence. It records route, cited or probable. It judges the reading. It names the repair surface.

For the composite agency, one note might read like this in prose: English prompt asks for Hamburg agencies for technical B2B content; agency appears third; answer calls it a “marketing and branding agency”; likely route is an English profile and a local directory; German industrial case pages not reflected; reading is partial because category is broad and buyer fit is weak; repair surface is an English service paragraph and a case-index summary that names industrial suppliers, technical consultancies, and export businesses.

That is not a glamorous report. It is better than a glamour report.

The team can act on it. They can rewrite the English profile. They can add a stable passage to the agency site. They can connect German cases to an extractable summary. They can retest the same prompt. They can see whether the wrong current weakens.

Measurement should make repair less mystical. If it only produces bigger numbers, it has become another form of fog.