imo
Writing
Structured dataAISEO

What the web actually marks up — and why it matters for AI search

Schema.org and Google just revealed how the web really uses structured data. The picture is surprising — and a roadmap for staying visible in AI search.

12 min read
A grid of 958 dots, one per schema.org type — just twelve lit in slate, the only types used on more than ten million sites

Abstract

For the first time, Schema.org and Google have shown us how the web actually uses structured data — the invisible labels that tell machines what a page is. The picture is surprising: most of the web marks up plumbing, not the things that win visibility. In the age of AI answers, that gap is the opportunity.

I've spent twenty years in and around SEO — long enough to watch "structured data" go from a niche trick to table stakes, and now into something bigger. So when Schema.org and Google quietly published a dataset I'd wanted for a decade, I went straight to the numbers. They tell a better story than I expected.

First — what is schema markup?

To a machine, your web page is a wall of text. Schema markup (or "structured data") is a thin layer of invisible labels you add to the HTML that says, in a vocabulary machines agree on: this is a product, this is its price, this is the author, this is the business and here's its address.

It's the difference between a search engine guessing what your page is and being told. Get it right and you earn the enhanced results — the star ratings, the prices, the breadcrumbs. Get the entity bits right — who you are, what you're called, where else you live online — and you become a thing search engines, and increasingly AI assistants, can recognise and cite.

That last part is why this is having a moment. More on it shortly.

What just happened

On 4 June 2026, Schema.org and Google did something they'd never done: they published a public dataset showing how often each piece of schema vocabulary is actually used across the web. Until now this was guesswork and vendor estimates. Now it's an open, monthly file on GitHub.

The first release covers May 2026: 958 types and 4,587 properties. Here's what it says.

The web runs on a handful of types

Of those 958 types, almost nobody uses most of them. Usage collapses into a long tail:

50.6%24.6%15.8%
  • < 1K485(50.6%)
  • 1K – 10K236(24.6%)
  • 10K – 100K151(15.8%)
  • 100K – 1M39(4.1%)
  • 1M – 10M35(3.7%)
  • 10M+12(1.3%)
How the 958 schema.org types split by how many domains use them — count and share of the vocabulary (May 2026).

Look at the slate sliver on the right. Just twelve types — barely 1% of the vocabulary — are used on more than ten million sites. At the other end, half of it (485 types) appears on fewer than a thousand. The web isn't drawing on a rich, varied vocabulary. It's leaning hard on a tiny core and ignoring almost everything else.

So what's in the core?

The ubiquitous twelve are plumbing, not rich results

Here are the only twelve types that clear ten million domains:

TypeWhat it saysEarns a rich result?
WebSite"This is a site"No (powered the retired search box)
WebPage"This is a page"No
ThingThe root of everythingNo
Organization"This is the company"Logo & knowledge panel
Person"This is a person"Knowledge panel / author
ImageObjectImage metadataImage features
BreadcrumbListThe trail to this pageYes — breadcrumbs
ListItemA row in a listNo (builds breadcrumbs)
SearchAction"You can search this site"No (retired in 2024)
EntryPointWhere an action pointsNo
ReadAction"You can read this page"No
PropertyValueSpecificationDefines a search inputNo

Notice the pattern. Exactly one of the twelve — BreadcrumbList — reliably earns an enhanced result today. The rest is scaffolding: "this is a page," "this is a site," "you can read this."

There's a simple reason you see the same twelve everywhere: you didn't add most of them — your CMS did. WordPress and plugins like Yoast emit most of this bundle automatically, on every page, the moment you switch them on. The most-marked-up vocabulary on earth isn't a strategy. It's a default.

The web's structured-data foundation isn't strategy. It's defaults.

And some of those defaults now point at nothing. Look at the last three rows — SearchAction, EntryPoint and PropertyValueSpecification. They all come from a single code snippet: Google's own sitelinks-search-box example, copy-pasted across millions of sites. Google retired that feature in November 2024. The snippet lives on, quietly describing a box that no longer appears.

The types that actually win are under-used

Put the most-used markup in one view — bar length for reach, colour for what Google actually does with it — and the gap jumps out:

  • Google rich result
  • Recognised, no rich result
  • Deprecated / restricted
Breadcrumbs10M+
Organization10M+
Person10M+
Image10M+
WebPage10M+
WebSite10M+
Thing10M+
ListItem10M+
SearchAction10M+
Product · Offer · Review1M – 10M
Article1M – 10M
Video1M – 10M
FAQ1M – 10M
Event100K – 1M
Job posting100K – 1M
Recipe10K – 100K
Course · Dataset · Q&A10K – 100K
The most-adopted schema types by reach (bar length) and what Google does with them (fill). May 2026.

Notice where the slate sits. The types that earn a visible result thin out as the bars get shorter — the further down the adoption ladder you go, the more of the markup is the stuff Google actually rewards. Up top, where everyone marks up, it's mostly grey plumbing and a couple of dead features. The markup that wins is the markup the web hasn't bothered with.

Zoom into just the rich types and the drop is plain:

MarkupGoogle resultDomains using it
BreadcrumbsBreadcrumb trail10M+
OrganizationLogo, knowledge panel10M+
Product · Offer · ReviewPrice & star snippets1M – 10M
ArticleArticle / Top stories1M – 10M
VideoVideo thumbnail1M – 10M
EventEvent listings100K – 1M
Job postingGoogle Jobs100K – 1M
RecipeRecipe cards10K – 100K
Course · Dataset · Q&ADedicated results10K – 100K

Product and Review markup — the stuff that puts a star rating beside your listing — sits on a tier with roughly ten times fewer domains than the CMS plumbing. Recipes, courses, job postings: the features exist, the adoption is a rounding error.

Two ways the web wastes effort

FAQ markup is still on over a million domains — but Google restricted FAQ rich results to government and health sites back in 2023. How-to is the same story: the feature was removed, the markup lingers. A lot of the web is maintaining schema that pays out nothing.

Why this matters now

For a decade, structured data was an SEO trick — a way to get a star rating into a blue link. That framing is too small now.

Search is becoming answers. Google's AI Overviews, ChatGPT, Perplexity — they don't just rank ten links, they read the web, decide what's true, and cite a handful of sources. So the obvious question: does marking up your pages get you into those answers?

Here's where I'll be straight with you, because plenty of people selling "GEO" won't. A large language model doesn't read your schema. It reads your words. These models were built to make sense of the messy, unlabelled web — there's no parser inside them hunting for <tags>. The model understood your sentence the same way you did: by reading it.

So where does structured data actually earn its keep for AI? Upstream. The entity labels — Organization, Person, and the sameAs links that tie you to your other profiles — feed Google's Knowledge Graph: the entity records answer engines lean on to recognise a brand, tell it apart from the other firms with your name, and quote it correctly. That isn't the model reading your tags. It's the machine you're describing yourself to getting a cleaner picture of who you are.

Even Google says as much. In its first official guidance on optimising for AI search, it lists "over-focusing on structured data" as something you don't need for AI features — while noting, same breath, that it's still worth doing for rich results. Read honestly, that's not "skip schema." It's "don't expect markup to be the thing that makes a model love your prose."

Two things to hold at once, though. This is early — barely into answer engines being a mass habit — and the direction of travel is clear: the more the machines do, the more they reward clean, unambiguous signals about the world. AI is built to read the mess, yes, but anything that makes the machine's job easier is an edge, not a wash. Structured data is about as close to zero-downside as optimisation gets.

And here's the quiet lesson in the data: the entity types that drive recognition — Organization, Person, sameAs — are common, but most sites do the bare minimum. A logo and a name. Not the full picture a machine needs to trust you.

Your markup is a claim — your page has to back it

Structured data mostly restates, in a format machines prefer, what your page already says in prose. That's the job. But it cuts both ways: mark up a price, a rating or an author your content doesn't actually support, and the best case is it gets ignored — the worst case is Google treats the mismatch as a violation. The labels and the words have to tell the same story.

What I'd actually do with this

You don't need to mark up all 958 types. You need to do a few things on purpose instead of by accident:

  1. Own your entity. Make Organization (or Person) complete: legal name, logo, and sameAs links to every official profile — LinkedIn, Crunchbase, Wikipedia, your socials. It's the cheapest brand-visibility work there is, and most sites half-do it.
  2. Add the rich types that fit your content — and that Google still rewards. Product and Review if you sell. Article if you publish. Video if you produce it. Event, JobPosting, Recipe if they're literally what you do.
  3. Stop maintaining markup that pays nothing. Audit your FAQ and How-to schema; if you're not a government or health site, it's dead weight.
  4. Use the new dataset as a competitive lens. A type that's rare and relevant to your niche is a chance to be the rich result no competitor has bothered with.
  5. Validate, then move on. Google's Rich Results Test and the Schema.org validator take minutes. Mark it up, confirm it parses, ship it.

Key takeaways

  • For the first time, we can measure how the web uses structured data — and it leans on a tiny core of twelve types.
  • Those twelve are mostly CMS defaults; only breadcrumbs reliably earn a rich result, and one (SearchAction) is already obsolete.
  • The markup that wins visibility — Product, Review, Article, Video, Event — is a full tier less adopted.
  • In the AI era, the entity signals — Organization, Person, sameAs — are the highest-leverage, most-neglected work.
  • Do a few types deliberately. Drop the ones that pay nothing.

This is the first month of what should become a standard reference — a monthly read on how the machine-readable web is actually built. The headline won't shift quickly: most of the web marks up plumbing on autopilot. Which means the gap between default and deliberate is wide open. In a world where machines increasingly decide who gets seen, that isn't a technical footnote. It's a head start.