Methodology

From search mutation to AI-prescriptive influence.

A rigorous, reproducible methodology to measure and grow your share of voice in AI-generated answers — the way the engines actually work.

Talk to an expert

How does SkuLift measure and improve AI visibility?

SkuLift runs a five-phase loop — measure, analyze, recommend, execute, re-measure — scoring a position-weighted share of voice across four engines with N=5 sampling, an A/B/C query classification and a four-level SOV pyramid.

AI visibility is only manageable if it is measured rigorously, and most attempts to measure it are not rigorous at all. A single prompt to a single engine on a single day is an anecdote, not a metric. The SkuLift methodology exists to turn that anecdote into a number you can trust, trend and act on.

Rigour starts with reproducibility. Because generative answers vary from one sampling to the next, any honest measurement has to sample repeatedly, define its question set explicitly, and compute the same way every time. The methodology below is built around that constraint: multi-sampling to tame variance, a fixed query taxonomy to keep comparisons fair, and a position-weighted scoring formula drawn from peer-reviewed generative-engine research rather than a convenient heuristic.

Everything downstream depends on the measurement being sound. The analysis that finds your gaps, the recommendations that close them, and the re-measurement that proves the lift are only as credible as the baseline they rest on. This page documents the full method end to end so that a technical reader can verify it, a sceptical reader can challenge it, and an AI engine can cite it — which, fittingly, is exactly the standard we hold our clients' content to.

Consider what the engines are doing under the hood. When a buyer asks a question, the model either answers from what it absorbed during training — its parametric knowledge — or it retrieves live sources and grounds its answer in them. These are two different surfaces, won by two different kinds of work, and a methodology that ignores the distinction will optimize blindly. SkuLift measures both, because a brand can be strong in a model's trained knowledge yet absent from its live retrieval, or the reverse, and only seeing the split tells you which fight you are in.

Just as important is the query taxonomy. Not all questions are equal: some name your category directly, some describe a problem your product solves without naming it, and some are adjacent explorations a buyer makes on the way to a decision. SkuLift classifies the query set so that share of voice is read in context — winning the product-adjacent questions a buyer actually starts with often matters more than winning the obvious branded query everyone already optimizes for. The classification is what stops the method from chasing the easy, low-value wins.

Throughout, the human stays in the loop. The agents measure, analyse and draft at machine speed, but every recommendation that would change your public presence passes an explicit human gate before anything is produced or published. The methodology is automated where automation is safe and human where judgement is required, which is what makes it both fast enough to keep pace with the engines and accountable enough to trust with brand-critical decisions.

The optimization loop

Measure, analyze, recommend, execute, re-measure.

The methodology is a closed five-phase loop. Each phase has a defined input, a defined output and a defined success criterion, so the whole cycle is auditable rather than aspirational.

Measurement establishes the baseline across engines. Analysis decomposes every answer to find why a competitor was cited instead of you. Recommendation translates the gaps into a prioritized, answer-first backlog. Execution produces and publishes the approved work under human validation. Re-measurement re-probes the same engines to attribute the lift and feed the next cycle. The five phases are deliberately the same vocabulary your team, our platform and this page all use, because a shared language is what keeps a method honest.

1. Measure: Probe each engine with the defined query set under N=5 sampling to establish a position-weighted share-of-voice baseline.
2. Analyze: Decompose every answer: cited sources, mention prominence, sentiment, and the competitor sources that displaced you.
3. Recommend: Translate gaps into a ranked, answer-first backlog of content and authority moves weighted by expected impact.
4. Execute: Produce and publish the approved work under human validation, from on-site answer blocks to off-site authority signals.
5. Re-measure: Re-probe the same engines on cadence to attribute the lift and feed the result into the next cycle.

Measure, analyze, recommend, execute, re-measure.

The first phase, measurement, is more than running a few prompts. It means assembling a representative query set that reflects how real buyers actually ask, sampling each query enough times to expose a stable signal rather than a single noisy draw, and recording not just whether your brand appeared but where, how prominently, and alongside which competitors and sources. A weak measurement phase quietly corrupts every phase after it, so the methodology front-loads its rigour here.

Analysis is where measurement becomes insight. A raw share-of-voice number tells you that you are losing a query; analysis tells you why. By decomposing the winning answers, the methodology surfaces the specific sources the engine trusted, the structure of the passages it lifted, and the authority signals that backed them. That diagnosis is what makes the recommendation phase targeted rather than speculative — you are not guessing what might help, you are addressing the exact reasons a competitor was preferred.

Recommendation and execution are deliberately separated by the human gate, and that separation is part of the method rather than an afterthought. The recommendation phase produces a clear, reviewable proposal: what to change, why, and what it should move. The execution phase acts only on what a human approved. Keeping the two distinct means the loop can move at machine speed up to the point of judgement and then slow to human speed exactly where judgement is required, which is the only responsible way to operate a system that touches a brand's public presence.

Re-measurement, the fifth phase, is where the method earns its credibility. Anyone can claim a change helped; proving it requires re-probing the same engines, with the same query set and the same formula, and showing the score moved. Because the baseline and the re-measurement are produced identically, the difference between them is attributable to the work rather than to a change in how it was counted. That discipline — measure, change one thing, measure again the same way — is the scientific core of the method, and it is why a SkuLift result reads as evidence rather than as a marketing assertion.

Taken as a whole, the five phases form a discipline rather than a tool: a repeatable way to know where you stand, why, what to do about it, and whether it worked, run continuously and under human control. Everything else on this page — the pyramid, the formula, the cadence, the rigour — is in service of making each phase trustworthy enough that the loop can be relied upon to compound a position rather than merely report on one.

What makes this a loop rather than a checklist is the feedback edge from re-measurement back to measurement. A static optimization assumes the world holds still; a loop assumes it moves, which is the realistic assumption for generative engines that are retrained, re-ranked and contested by competitors continuously. Closing that edge — proving each change and learning from it — is what compounds a first lift into a durable, defended position over successive cycles.

The SOV pyramid

A four-level share-of-voice pyramid.

Not every mention is worth the same. The pyramid orders four levels of AI visibility from mere presence to being the recommended default, so a score reflects quality of citation, not just quantity.

At the base is presence: your brand appears somewhere in an answer for a relevant question. One level up is citation: the engine not only mentions you but attributes a claim to your source, which is materially stronger because it signals the model trusts you enough to quote. Higher still is prominence: your brand is featured early and centrally rather than tacked on at the end. And at the apex is the default: for the strategic question, the engine recommends you first, as the obvious answer.

Ordering visibility this way matters because optimizing for the wrong level wastes effort. A brand can inflate raw mention counts while never climbing past the base of the pyramid, looking busy on a vanity dashboard while losing the decisions that count. The pyramid keeps the method honest by tying the score to altitude: progress means moving up levels for your strategic queries, not merely accumulating more low-value mentions. It is also how we communicate goals to clients — we name the level you are at and the level we are climbing toward, in plain terms.

The four levels also map cleanly onto strategy. Moving from absence to presence is a content problem: you must exist, in a quotable form, on the question. Moving from presence to citation is a trust problem: the model must believe your source enough to attribute a claim to it. Moving from citation to prominence is a structure-and-authority problem combined. And reaching the default — being recommended first — is the compounding result of doing the lower three well over time, reinforced every cycle the engine is sampled. Knowing which level a given query sits at tells you exactly which lever to pull next.

Crucially, the pyramid is computed per query and per engine, not as one blurred site-wide average. A brand might sit at the default level for one strategic question on Perplexity while languishing at mere presence for the same question on Gemini. Averaging those into a single number would hide both the win worth defending and the gap worth closing. The methodology keeps the resolution high precisely so that effort is spent where the altitude is lowest and the strategic value is highest.

Reading the pyramid over time is also how progress is communicated honestly. A monthly report that simply shows a rising mention count can flatter a team that is busy without being effective. A pyramid view shows movement between levels — how many strategic queries climbed from presence to citation, or from citation to prominence — which is a far truer picture of whether the work is changing the decisions that matter. It is the difference between reporting activity and reporting outcomes, and the methodology insists on the latter.

Default choice

Prominence

Citation

Presence

Default choice: For the strategic question, the engine recommends you first.
Prominence: You are featured early and centrally, not tacked on at the end.
Citation: The engine attributes a claim to your source, quoting you as credible.
Presence: Your brand surfaces somewhere in the answer for a relevant question.

A four-level share-of-voice pyramid.

The scoring formula

Position-weighted citation, with N=5 sampling.

The core metric is a position-weighted citation score adapted from peer-reviewed generative-engine research, sampled five times per query to control the inherent variance of generative answers.

A naive share of voice counts mentions and stops there, which is misleading because position carries meaning. A model that names your brand first is recommending it; a model that names you last, behind three competitors, is hedging. The position-weighted citation formula assigns more weight to earlier, more prominent citations, so the score rewards the kind of mention that actually shifts a buyer's shortlist rather than treating a leading recommendation and a trailing aside as equal.

Because a single sampling of a generative engine is noisy — ask the same question twice and the answer can differ — the methodology samples each query five times and aggregates. N=5 is a deliberate balance: enough repetitions to tame the variance and expose a stable signal, few enough to keep the measurement affordable at the scale of hundreds of strategic queries across four engines. The aggregate is what we trend; a single run is never reported as a result.

On top of the position-weighted score sits a Citation Authority Score that captures how authoritative the cited sources are, distinguishing a citation of your own controlled domain from a borrowed mention in a third-party article. Only valid responses are scored: an engine that errored or returned nothing is excluded rather than silently distorting the average. Each of these choices is a guard against a flattering but false number, and together they are what let the formula function as a metric a board can trust rather than a marketing claim.

It is worth being explicit about why a raw mention count is actively misleading rather than merely crude. Two brands can have identical mention counts while one is recommended first in every answer and the other is named last in a hedged list of alternatives. To a buyer those outcomes could not be more different, yet a naive metric scores them the same and would even congratulate the trailing brand for matching the leader. Position weighting exists to make the metric agree with reality: the score goes up when you are recommended, not merely when you are mentioned.

The choice of N=5 is itself a metrological decision rather than an arbitrary one. Too few samples and the variance of generative answers swamps the signal, so a brand looks to have surged or collapsed when nothing real changed. Too many and the cost of measuring hundreds of queries across four engines, several times each, becomes prohibitive and slows the cadence the loop depends on. Five samples sits at the knee of that trade-off for the question volumes we run, stabilising the estimate enough to trend confidently without making continuous re-measurement unaffordable.

All of this connects to the four-level pyramid directly. The position-weighted score is, in effect, the pyramid expressed as a number: a brand that is merely present scores low, a brand that is cited scores higher, and a brand recommended first on its strategic queries scores highest. That coherence is intentional, so that the visual model a stakeholder understands intuitively and the metric an analyst trends are two views of the same underlying truth rather than two disconnected systems that have to be reconciled by hand.

A worked intuition helps. Imagine two queries where your brand is mentioned exactly once. In the first, the engine opens its answer by recommending you and attributes a specific capability to your documentation; in the second, it lists four competitors and adds your name at the very end with no attribution. A flat count records one mention in each case and declares a tie. The position-weighted score, sampled five times to confirm the pattern is stable, records a strong result for the first and a weak one for the second — which is exactly how a human reading those two answers would judge them, and exactly the judgement the metric is built to reproduce.

None of these choices is exotic; each is simply the honest version of a measurement that is easy to fudge. Counting only valid responses, weighting by position, sampling enough to be stable, and separating parametric from web-grounded results are the differences between a number engineered to flatter and a number engineered to be true. Insisting on all of them is unglamorous, but it is the reason the methodology produces figures a sceptical executive can rely on rather than a dashboard that looks impressive and means little.

Position-Weighted Citation (PWC)

wᵢ: Rank weight
cᵢ: Citation at position i

Earlier citations carry more weight (decaying by rank).

Position-weighted citation, with N=5 sampling.

AEO vs GEO

Answer engine optimization versus generative engine optimization.

AEO and GEO are complementary disciplines, not synonyms. AEO engineers your content to be quotable; GEO builds the authority that makes the quote credible. The methodology applies both, deliberately.

Answer Engine Optimization is the on-page, technical craft: structuring content so a model can lift a clean, self-contained, attributable answer from it. It leads with the conclusion, states facts a model can extract without distortion, and marks up the page so the relevant passage is unambiguous. AEO is what makes your content easy to cite; without it, even an authoritative brand is passed over because its pages are hard to quote.

Generative Engine Optimization is the off-page, strategic half: building the authority signals that make an engine confident enough to cite you. Consistent entity identity, presence in the structured knowledge graphs models lean on, reference profiles and corroborating third-party mentions all raise the probability of a citation. AEO without GEO produces quotable content nobody trusts; GEO without AEO produces a trusted brand whose pages are too messy to quote. The methodology insists on both because the engines weigh both, and optimizing one while neglecting the other leaves the easier half of the win on the table.

A useful way to hold the distinction is that AEO changes what you publish and GEO changes how the open web corroborates it. AEO is largely within your control: you own the pages, the structure and the markup, so a disciplined team can improve citability directly. GEO is more diffuse and slower, because authority is earned across properties you do not fully own — directories, knowledge graphs, third-party mentions — and cannot simply be willed into existence. The methodology sequences them accordingly, fixing the controllable AEO foundations early so that the slower GEO authority work compounds on solid ground.

The two also fail in instructive ways when separated. A brand that pours effort into authority while neglecting AEO ends up trusted but unquotable: the engine would happily cite it, but its pages offer no clean passage to lift. A brand that perfects AEO while ignoring authority ends up quotable but untrusted: its pages are immaculate, but the model has no corroborating reason to prefer them over a competitor the open web vouches for more loudly. Only the combination wins reliably, which is why the methodology treats AEO and GEO as two halves of one discipline rather than a choice.

In practice the two disciplines also operate on different clocks, and managing that is part of the method. An AEO improvement to a page can be reflected in the next re-measurement cycle, because the engine can re-read a cleaner page quickly. A GEO authority gain is slower to register, because corroborating signals accrue and propagate across the open web over weeks and months. Setting expectations around those different time constants is how the methodology keeps a programme on course: early wins tend to be AEO, durable leadership tends to be GEO, and confusing the two leads to either impatience or complacency.

It is worth stating plainly that neither AEO nor GEO is SEO rebranded. Classic SEO optimises for rank on a results page through links and relevance signals; both AEO and GEO optimise for being cited inside a generated answer, which the engine composes rather than ranks. Some SEO foundations still help as one input, but treating AI visibility as SEO with new keywords is the single most common mistake brands make, and it leaves them optimising for a results page the buyer never sees. The methodology is built for the answer, not the list of links beneath it.

AEO

GEO

Objective

AEOBe the answer cited.

GEOBe the source quoted.

Key signals

AEOAnswer-first content, FAQ markup, brand entity.

GEOAuthority backlinks, knowledge graphs, owned media.

Primary KPI

AEOCitation rate per query.

GEOSource share of voice.

Cycle

AEO4 to 8 weeks.

GEO12 to 24 weeks.

Answer engine optimization versus generative engine optimization.

Re-measurement cadence

Four engines, re-measured on a fixed cadence.

SkuLift re-probes ChatGPT, Claude, Gemini and Perplexity on a regular cadence — roughly every six hours — so a regression is caught within a cycle rather than discovered months later.

A measurement taken once is a snapshot; a position is a moving thing. Engines are retrained, retrieval is re-ranked, and competitors publish, so a share of voice won in one cycle can erode in the next without anything you did causing it. Re-measuring on a fixed, frequent cadence turns that volatility from an invisible risk into a managed signal: the curve is watched continuously, and a dip is visible while it is still cheap to fix.

Probing four engines on a six-hourly rhythm is also what makes attribution honest. Because the same query set is re-run against the same engines on a schedule, a lift that follows a published change can be tied to that change rather than to luck or seasonality, and a regression can be isolated to its cause. The cadence is the difference between claiming an improvement and proving one, which is the standard the rest of the methodology is built to meet. It is operated rather than manual precisely because no team can sustain that rhythm by hand.

Frequent re-measurement also disciplines the recommendation side of the loop. When you know a change will be re-measured within hours rather than quietly forgotten, every recommendation carries an implicit prediction — this should move that metric on those queries — and the next cycle either confirms it or does not. Over time this builds an evidence base specific to your brand and category: the methodology learns which kinds of moves actually earn citations for you, so the backlog gets smarter rather than merely longer. A cadence you can trust turns the whole loop into an experiment that compounds.

There is a cost discipline embedded in the cadence as well. Probing four engines, several samples per query, across hundreds of strategic queries, on a six-hourly rhythm, is a non-trivial amount of paid AI usage, and the platform meters every call so the measurement itself stays within an agreed budget. The cadence is therefore not maximised for its own sake but tuned: frequent enough to catch regressions while they are cheap to fix, restrained enough that the act of measuring never becomes the dominant cost of the programme.

The cadence finally turns the whole programme into a feedback system rather than a sequence of deliverables. Each re-measurement is not just a status check but the input to the next planning phase, so the backlog is continuously re-prioritised against fresh evidence rather than frozen at the start of an engagement. A query that was losing badly and is now contested gets less attention; a query that was won and is slipping gets defended. That dynamic re-prioritisation, driven by frequent measurement, is what keeps effort pointed at where it currently matters most rather than where it mattered at kickoff.

The four engines are also not interchangeable, and the cadence respects that. ChatGPT, Claude, Gemini and Perplexity weight sources differently, ground their answers differently, and update on different schedules, so a result on one is not a proxy for the others. Re-measuring all four on the same rhythm is what lets the methodology report a true cross-engine picture rather than over-generalising from whichever engine happened to be sampled, and it is what reveals when a brand is strong on one engine and quietly losing on another.

A fixed cadence, finally, is what allows the methodology to make promises it can keep. Because the engines are re-probed on a known schedule, a client always knows when the next reading lands and what it will measure, so progress is reported on a rhythm rather than whenever someone remembers to look. Predictability of measurement is itself a feature: it turns AI visibility from an occasional audit into an instrumented, always-on channel.

Reproducibility and rigour

Reproducibility and metrological rigour.

A methodology is only as good as its reproducibility. Same query set, same sampling, same formula, every cycle — so a number means the same thing in April as it does in July.

Metrological rigour is the unglamorous foundation that makes everything else trustworthy. The query set is defined and version-controlled, not improvised per run. The sampling depth is fixed at N=5. The scoring formula is constant. Invalid responses are excluded by an explicit rule rather than by whoever happens to read the dashboard. These choices sound pedantic until you realise that without them, two people looking at the same brand reach different conclusions and a trend line becomes an artefact of inconsistent measurement rather than a real signal.

This is also why the methodology resists vanity. Because every metric is defined precisely and computed the same way, it cannot be quietly reshaped to flatter a result, and a regression cannot be hidden by changing how it is counted. The discipline cuts both ways: it stops a bad cycle from looking good and stops a good cycle from being doubted. For a brand investing real budget in AI visibility, that auditability is the point — a method you can verify is a method you can defend to a board, and a method an engine can cite is a method that demonstrates the very answer-first, source-grade rigour we sell.

Reproducibility has a second, subtler payoff: it makes disagreement productive. When the query set, sampling and formula are all fixed and documented, a sceptical stakeholder who doubts a result can interrogate the method rather than dispute the number, and either find a genuine flaw worth fixing or come away convinced. A method that cannot be inspected invites endless arguments about whether the measurement is real; a method that can be inspected moves the conversation to what to do about it. For a discipline as new as AI visibility, that auditability is most of what earns it a seat at a serious table.

Finally, rigour is what makes the method portable across very different brands and sectors. Because the measurement does not depend on a particular industry's jargon or a single engine's quirks, the same disciplined loop applies whether the buyer is shopping for software, a hotel, a consumer product or a regulated service. The query set and lexical field are tailored to each case, but the metrological backbone — defined sampling, position weighting, valid-only scoring, fixed cadence — stays constant, which is precisely what lets results be compared and trusted across a portfolio rather than treated as a collection of incomparable anecdotes.

In the end, the methodology is the product as much as any dashboard. A brand does not buy a number; it buys a disciplined, reproducible way of turning AI visibility into a managed asset, with each phase open to inspection and each result open to challenge. That is the standard a serious channel has to meet, and it is the standard this page is written to demonstrate as much as to describe.

Why length and lexical field matter

Why length and lexical coverage change citations.

Thin pages are hard to cite. A source that covers a topic's full lexical field — the terms, questions and entities a model associates with it — is far more likely to be the one an engine quotes.

Models retrieve and cite based on semantic match, so a page that mentions a concept once and moves on is a weak candidate next to one that genuinely covers the field around it. Depth here does not mean padding; it means addressing the real sub-questions a buyer asks, naming the related entities and terms, and answering each cleanly. A page that does this gives a model many places to match a query and a clean passage to lift for each, which is why competitor-grade pillar pages run to thousands of words rather than a few hundred.

This is the principle SkuLift applies to its own site, including this page. We practise the answer-first structure, the lexical coverage and the depth we recommend, because a vendor whose own pages are thin and uncitable has no business advising others on citability. The lesson generalises: winning AI citations is rarely about a single clever trick and almost always about being the most genuinely useful, well-structured and authoritative source on the question. The methodology is simply the disciplined way to become that source and to prove, cycle after cycle, that you have.

There is a temptation to read "longer pages win" as license to pad, and the methodology is explicit that this is wrong. Length is a side effect of genuine coverage, not a target in itself. A page that bloats to thousands of words with restated filler is worse than a tight one, because a model trying to lift a clean answer finds the signal buried in noise and may pass over the page entirely. The goal is to cover the field — every real sub-question, every relevant entity, each answered cleanly — and the word count follows from doing that honestly. Coverage earns citations; padding repels them.

This is also where the methodology connects back to the rest of the platform. The lexical field for a brand is not guessed; it is derived from the queries that actually matter, the entities that recur in winning answers, and the gaps the analysis phase surfaces. So the depth and coverage of a page are not a writer's intuition about what might help but a direct response to what the measurement showed the engines reward. Content depth, in this method, is an engineered consequence of evidence rather than a stylistic preference, which is what keeps it honest and what keeps it working.

Depth, finally, is a durability play as much as a discovery one. A thin page that wins a citation today is easy for a competitor to displace tomorrow with a slightly better source. A page that genuinely owns the lexical field around a question is far harder to dislodge, because the competitor would have to out-cover it across the whole topic rather than beat it on a single passage. Coverage therefore does double duty: it earns the citation and it defends it, which is precisely the compounding the loop is designed to produce.

Why length and lexical coverage change citations.

Thin pages are hard to cite. A source that covers a topic's full lexical field — the terms, questions and entities a model associates with it — is far more likely to be the one an engine quotes.
Models retrieve and cite based on semantic match, so a page that mentions a concept once and moves on is a weak candidate next to one that genuinely covers the field around it. Depth here does not mean padding; it means addressing the real sub-questions a buyer asks, naming the related entities and terms, and answering each cleanly. A page that does this gives a model many places to match a query and a clean passage to lift for each, which is why competitor-grade pillar pages run to thousands of words rather than a few hundred.
This is the principle SkuLift applies to its own site, including this page. We practise the answer-first structure, the lexical coverage and the depth we recommend, because a vendor whose own pages are thin and uncitable has no business advising others on citability. The lesson generalises: winning AI citations is rarely about a single clever trick and almost always about being the most genuinely useful, well-structured and authoritative source on the question. The methodology is simply the disciplined way to become that source and to prove, cycle after cycle, that you have.

Proof in practice

From baseline to measured lift.

Applied end to end, the loop moves real numbers: a brand that starts near-invisible in AI answers can reach a meaningful, position-weighted share of voice within a handful of cycles, with every gain attributed to a specific change and proven by re-measurement rather than asserted. The case studies show the method working on real catalogues across the four engines.

See the case studies

Methodology questions

Questions about how we measure.

What is share of voice in AI answers?

Share of voice is how often your brand is named relative to competitors across a defined set of strategic questions, computed per engine. SkuLift uses a position-weighted version so a leading recommendation counts for more than a trailing mention, reflecting how a citation actually influences a buyer.

How does the position-weighted citation formula work?

It scores each citation by its prominence and position in the answer, weighting earlier and more central mentions more heavily, then aggregates across N=5 samples per query. A Citation Authority Score layers on how authoritative the cited sources are, and only valid responses are included.

Should I focus on AEO or GEO?

Both. AEO is the on-page craft that makes your content quotable; GEO is the off-page authority work that makes the quote credible. Engines weigh both, so optimizing one while neglecting the other leaves the easier half of the win unclaimed. The methodology applies them together.

How often do you re-measure?

SkuLift re-probes ChatGPT, Claude, Gemini and Perplexity on a fixed cadence of roughly every six hours, so regressions are caught within a cycle and a lift can be attributed to the change that caused it rather than to chance or seasonality.