Hrizn logo
Analysis

Why Asking ChatGPT “Who’s the Best?” Isn’t an AI Visibility Benchmark

Two versions of the same exercise show up every week. A vendor posts a LinkedIn screenshot: “we asked ChatGPT who’s the best in our category and, humbled, the answer was us.” A dealer principal or GM types “who is the best [brand] dealer in [city]” into ChatGPT from their desk and forwards the result to their marketing team. Both rely on the same flawed methodology. Here is why, and what actually measures AI visibility.

Quick AnswerLast updated April 2026

Asking an LLM like ChatGPT "who is the best in [category]" and screenshotting the answer is not a measurement of AI visibility. The same flawed exercise runs two ways: publicly (a vendor posting the screenshot to LinkedIn) and privately (a dealer principal or GM typing "who is the best [brand] dealer in [city]" from their desk and forwarding the answer to their marketing team). LLM outputs are non-deterministic, shaped by temperature sampling, model version drift, retrieval augmentation, and personalization from account memory. A headless, incognito prompt is the least representative query a real user could make. There is no stable AI ranking to track because LLMs are not search engines. Real AI visibility is measured through structured data coverage, server-log evidence of AI crawlers, AI referral traffic in analytics, branded search volume, and entity mentions across authoritative sources. Not screenshots.

  • LLM answers are non-deterministic. The same prompt returns different answers seconds apart by design.

  • A "clean" headless prompt is the least representative query a real user could make, not the most.

  • LLMs are not search engines. There is no stable ranking to appear in.

  • This is not only a vendor posturing problem. Dealer principals and GMs running the same prompt on their own stores make real vendor, budget, and personnel decisions from unreliable output.

  • Measure AI visibility through structured data, AI referral traffic, crawler logs, and entity mentions. Not screenshots.

TL;DR
  • LLMs are non-deterministic. The same prompt returns different answers seconds apart.
  • A headless, incognito prompt is the least representative query a real user could make.
  • LLMs are not search engines. There is no stable ranking to appear in.
  • A logged-in account is worse: persistent memory turns “who’s the best” into a leading question.
  • This is not only a vendor posturing problem. Dealer principals running the same prompt on their own stores make real decisions (vendor firings, budget cuts, promotions) from output that is not a measurement.
  • Measure AI visibility through structured data coverage, server-log crawler evidence, AI referral traffic, branded search volume, and entity mentions. Not screenshots.
Two Concepts to Know First

Before the screenshots, a quick mechanics check.

Most of this page only lands if you have two ideas in hand. If you have already internalized these, skim ahead. If they are new, give them sixty seconds.

1. Deterministic vs Non-Deterministic

A deterministic system returns the same answer every time for the same input. A calculator. Traditional Google search at a given moment. An LLM is non-deterministic by design. The same prompt can produce different answers seconds apart because the model samples from probability distributions to generate fluent text. That is not a defect. It is the mechanism. Every screenshot is one roll of a weighted die, not a reading off a ranking.

2. Semantic Search

Old Google matched the literal words of your query against the literal words on a page. AI search matches meaning. Queries and pages are converted into vectors of numbers (embeddings) that represent what each is about. “Best Honda dealer,” “top-rated Honda dealership,” and “where should I buy my CR-V” all point to similar regions of that numerical space. The system retrieves from that region and generates an answer. There is no stable ranked list to appear in.

Those two concepts, together, are enough to make the rest of this page obvious. Want the complete vocabulary (retrieval-augmented generation, temperature, structured data, AI crawlers, hallucination, personalization)?

The Pattern

The Post You’ve Already Seen a Hundred Times

An executive at a vendor posts a screenshot. They asked ChatGPT “who is the best [their category]” and the model returned a flattering answer. The caption says something like “humbled but not surprised.” The hashtags include their industry, their role, and sometimes their own brand.

The comment section, without fail, immediately breaks the premise. Five different people run the same prompt and post five different answers. Half of them name their own employer. One person remarks, dryly, that “every LLM is biased. I realize that’s a surprise to everyone.” Another jokes about running it through a “Van Damme LLM” for a second opinion.

That comment section is the proof. If the exercise were a measurement, the answers would converge. They never do. What’s actually being demonstrated, every single time, is that the methodology produces noise and presents it as signal.

“This post has certainly shown the vast array of answers an LLM can provide different users. I guess we’re all ‘best’ in our own ways.”

(the original poster, in his own comment thread)

The Version Nobody Posts

When Dealers Run It on Themselves

The louder version of this exercise is the LinkedIn post. The quieter one causes more damage. Dealer principals, GMs, and marketing directors open ChatGPT at their desk and type “who is the best [brand] dealer in [their city]” or “where should I buy a [model] near [their zip].” If their store shows up, the screenshot gets forwarded to the marketing team as proof the program is working. If it does not, the same screenshot gets forwarded as proof the program is broken.

Neither interpretation survives contact with reality. The prompt is non-deterministic. The model is personalized to that executive’s own browsing and prior ChatGPT memory. The answer is drawn from a corpus that favors whichever competing store has the biggest PR footprint in the metro, not the store with the best conversion rate. A dealership dominant in Google organic, the local pack, and GBP engagement can still be missing from a single ChatGPT response. A dealership with poor organic visibility can still get name-dropped because one inventory feed happened to be indexed last month.

The damage is what gets decided next. Vendors get fired. Vendors get defended. Budgets get cut or expanded. Marketing directors get promoted or removed. Because the exercise feels like a measurement, the decisions made from it carry the weight of data without the reliability of data. That is the real failure mode of LLM self-search at a dealership, and it is worse than any vanity post.

The rule for dealer principals and GMs:

Do not make vendor, budget, or personnel decisions from a ChatGPT prompt. If the question is “is my store visible in AI search,” the answer lives in your analytics referrers, server logs, structured data coverage, GBP health, and branded search trend. Not in a screenshot.

The Breakdown

Five Reasons the Methodology Breaks Down

Each of these reasons alone would disqualify the screenshot as evidence. Stacked, they explain why every comment thread diverges.

LLMs are non-deterministic by design

The same prompt sent to the same model seconds apart can return completely different answers. Temperature, top-p sampling, and retrieval reordering are not bugs. They are core to how these systems work. A single screenshot captures one roll of the dice, not a ranking.

Model version + RLHF drift

ChatGPT, Gemini, Claude, and Perplexity all ship model updates weekly. "The best provider in category X" last Monday will not be the same answer next Monday, not because the market changed, but because the model did.

Personalization the asker forgets they enabled

Logged-in ChatGPT accounts have persistent memory across sessions. Gemini pulls from Google account history. Copilot uses Microsoft telemetry. A founder asking "who is the best in my space" is not running a neutral benchmark. They are asking a model that has been trained on their own prior prompts.

Headless is the *least* representative state

Real users bring location, device locale, prior turns in the conversation, account history, and (increasingly) persistent memory. An incognito, one-shot prompt is the opposite of a real user. If anything, it is the query least like the one your actual customers run.

Training data bias toward well-branded entities

LLMs were trained on public web text dominated by press releases, directory pages, and the entities with the most prior coverage, not the businesses with the best outcomes. A brand that spends on PR will get name-dropped more often than a better competitor with quieter coverage. That is a marketing artifact, not a quality signal.

Variance, Demonstrated

Five People Run the Same Prompt. Five Answers.

What a typical “who’s the best in [category]” prompt actually produces across five users, all asking within the same hour. Names are illustrative; the pattern is not.

UserContextChatGPT’s answer to “who is the best in [category]?”
User ALogged in, past prompts about Vendor XVendor X
User BLogged in, past prompts about Vendor YVendor Y
User CIncognito, US East coast IPVendor Z (a large, heavily-PR’d brand)
User DIncognito, US West coast IPVendor W (different regional coverage)
User ELogged in, Plus account, same prompt 5 min laterDifferent ranked list than their first attempt

If the methodology produced a measurement, rows A through E would all return the same answer. They don’t, because there is no stable ranking to return.

Category Error

An LLM Is Not a Search Engine

Google has an index you can petition, a crawler you can submit URLs to, a console that reports which queries drove clicks, and a ranking that, while opaque, is at least reproducible on a per-query basis. A brand can invest in ranking for “best [category] in [city]” and measure its position over time.

An LLM has none of that. It is a next-token predictor with retrieval layered on top. There is no ranking, no index you rank in, no canonical list of “the best X” that the model consults. When you ask “who is the best?” the model does not look up an answer. It generates a plausible-sounding one from a weighted sample over its training distribution and (sometimes) retrieved snippets.

Treating an LLM’s generated answer as a ranking is a category error. It is the difference between a scoreboard and a poet asked who the best team in the league is. One has a stable answer. The other has a confident one.

The correct comparison to traditional search is not “we used to rank in Google, now we rank in ChatGPT.” It is “we invest in the signals that make authoritative sources citable, and measure whether those signals drive citation-adjacent outcomes.” That is a measurable program. A ChatGPT screenshot is not.

Training Data Is Not Reality

The “Best” Answer Rewards PR, Not Quality

Large language models are trained on public web text. That corpus is dominated by press releases, award listicles, directory pages, G2 and Capterra summaries, forum threads, and the long tail of SEO content. The entities that appear most often in that corpus are, by definition, the ones that spent the most effort on getting covered, not the ones that produced the best outcomes for their customers.

A vendor with a large PR budget and five years of “Top 10” features in trade publications will be name-dropped in an LLM’s response more reliably than a better competitor with quieter coverage. The LLM is not endorsing the loud vendor. It is reciting what the web told it to recite.

This matters because the people most likely to post the “ChatGPT said we’re the best” screenshot are the ones whose brands are already over-represented in the training corpus. They are not discovering a new signal; they are taking a screenshot of the historical one.

Measure This Instead

What a Real AI Visibility Measurement Framework Tracks

Every signal below is auditable, reproducible, and linked to an actual outcome. None of them require a screenshot. For the hands-on setup, see the GA4 and server-log measurement playbook.

Structured data coverage

Which schema types (Organization, LocalBusiness, AutoDealer, Product, FAQPage, Article) are deployed, on which templates, and whether the entity resolves cleanly via @id. This is auditable and reproducible.

Structured Data for AI Visibility
AI referral traffic in analytics

Track visits from chat.openai.com, perplexity.ai, copilot.microsoft.com, gemini.google.com, and bing.com/search (AI Answers). This is your real measurement of AI-driven demand, not whether a prompt hallucinated your name.

AI crawler activity in server logs

GPTBot, PerplexityBot, ClaudeBot, OAI-SearchBot, and Google-Extended all identify themselves in user-agent headers. Log analysis shows what AI systems are actually fetching from your site, and how often.

Branded-query volume in GSC

If AI visibility is working, more people encounter your brand and then search for it directly. Branded search volume is a leading indicator of citation-driven awareness, measurable in Google Search Console and Google Trends.

Entity mention graph

Who is citing you where, using what anchor text, on which surfaces (Reddit threads, YouTube descriptions, news articles, Q&A sites). Entity co-occurrence is how AI models learn which brands belong to which categories.

Reddit & AI Citations
GBP engagement + review velocity

Google Business Profile remains the single strongest local entity signal for both traditional search and AI-generated local answers. Review volume, response rate, and profile completeness correlate directly with inclusion in local AI responses.

Google Business Profile for Dealerships
Vendor Diagnostic

Before You Pay for an “AI Rank” Report

These questions cut both ways. Ask them of any vendor pitching an “AI ranking,” “AI citation monitoring,” or an “AI visibility score” as its own line item. Ask them of yourself before you forward a ChatGPT screenshot to your marketing team as evidence of anything. If any single question cannot be answered cleanly, the output is decorative, not diagnostic.

1

Is the ranking reproducible? If I run the same prompt right now, will I get the same answer?

2

How many distinct prompts and model versions are in your panel? If the answer is "one prompt, one model, one snapshot," it is not a measurement.

3

Do your "AI rankings" correlate with any outcome (leads, traffic, branded search volume, demos) or only with themselves?

4

What changes on my site when my 'AI rank' improves from #3 to #1? If the answer is 'the report,' you are paying for the report.

5

Are you controlling for model personalization, account memory, and geographic IP routing in your test panel?

6

If I switch to a different LLM next quarter, does your methodology adapt, or does your entire metric break?

7

What does "my content was cited" actually look like in your data: a URL match, an entity name match, or a fuzzy string match?

FAQ

Common Questions

Is a screenshot of ChatGPT saying my company is 'the best' meaningful?

No. It is a single roll of a non-deterministic system, from a headless session with no user context, on a model that may have been updated an hour later. Another person running the same prompt from the same machine five minutes later will often get a different answer. The screenshot proves that the prompt ran. Nothing more.

Why do LLMs give different answers to the same question?

Four stacked reasons: (1) temperature + top-p sampling introduce intentional randomness, (2) the model version may have changed between requests, (3) retrieval-augmented systems like ChatGPT web search pull different live results each time, and (4) A/B experiments on the production surface route users to different system prompts. All four are invisible to the end user.

I'm a dealer principal or GM. Should I stop running these prompts on my own store?

You can keep running them for curiosity, but stop treating the output as a measurement. A ChatGPT answer about your dealership reflects that specific model, that specific day, your personal account memory, and whatever got scraped into the training corpus last indexing cycle. It does not reflect whether a real buyer searching for your brand in your city will find you. If a single prompt is about to drive a vendor firing, a content budget change, or a staffing decision, stop and pull the actual data: AI referral traffic, server logs for GPTBot and PerplexityBot, structured data coverage, branded search trend in GSC, and GBP engagement. Every one of those is reproducible and tied to an outcome. See what actually measures AI visibility

Why is a logged-in account even worse for self-benchmarking?

ChatGPT's persistent memory remembers your past prompts, your business, your preferences, and the people you have discussed. Asking 'who is the best provider in my space' from a logged-in account that has spent months discussing your business is not a neutral question. It is a leading one. The model has every incentive to produce a flattering, coherent-sounding answer that aligns with what it already knows about you.

If LLM self-search is not a metric, what is?

The measurable inputs to AI visibility are: structured data coverage (Organization, LocalBusiness, AutoDealer, Product, FAQ), server-log evidence of AI crawlers (GPTBot, PerplexityBot, ClaudeBot) actually fetching your pages, AI referral traffic (chat.openai.com, perplexity.ai, copilot.microsoft.com) in analytics, branded search volume trends, and entity mentions across authoritative third-party sources. All of these are reproducible, auditable, and correlated with real business outcomes.

What about agencies selling "AI rank tracking" for my brand?

Ask them these two questions: (1) Is the ranking reproducible? Can I run it again right now and get the same number? (2) Has my "AI rank" ever correlated with a business outcome (leads, traffic, branded search) or only with itself? If the honest answer to either is no, you are paying for a decorative report. See our full list of questions every dealership should ask a vendor claiming to sell AI visibility. See all vendor diagnostic questions

Does this mean AI visibility does not matter?

The opposite. AI visibility matters enormously, which is exactly why it deserves a rigorous measurement framework instead of a screenshot. The signals that actually drive AI citations (structured data, E-E-A-T, topical authority, entity coverage, real user engagement) are the same signals that drive traditional search. You can invest in them, measure them, and compound them over time. What you cannot do is reduce them to a single LLM prompt.

Keep Reading

The Honest AI Visibility Stack

If you skip LLM self-search, what replaces it? These are the resources that describe the actual signals AI systems use to decide whether to cite your business, and how to invest in them.

Diverse team of dealership professionals standing together
Diverse team of dealership professionals standing together
Don't Wait

Build Before You Need To

The teams gaining ground aren't reacting faster. They're building a content system that works for them even when they're not working on it.

That advantage grows every month.

Start Free

We Rise Together.