LLM-Friendly Markup Guide

Document Type: Implementation Guide
Section: Docs
Repository: https://aio.fabledsky.com
Maintainer: Fabled Sky Research
Last updated: April 2025

Scope and Intent

This guide standardizes HTML and semantic-web markup patterns so Large Language Models (LLMs) can ingest, contextualize, and reason over web content with minimal hallucination and maximal recall. It is P1 in the AIO Standards & Protocols category and is required reading for all engineers, technical writers, SEO strategists, and content architects integrating AIO principles.

Terminology

• LLM-Friendly — Markup that is syntactically valid, semantically explicit, and disambiguated enough for an LLM to parse deterministically.
• Canonical Node — The single, authoritative representation of an entity in a document (e.g., Article, Product).
• Semantic Neighborhood — The immediate graph of properties and linked entities surrounding a canonical node.
• Dual Embedding — Publishing schema.org data in both JSON-LD and visible HTML to serve crawlers without penalizing accessibility.
• Fence — Any code block (

, ``` etc.) that prevents the LLM from interpreting contents as prose.  
Core Principles

Determinism over Density — Prefer fewer, unambiguous triples to verbose, overlapping vocabularies.  
Canonical First — Each page must expose one—and only one—mainEntity.  
Symmetric Visibility — Ensure that information displayed to users is represented in structured data, and vice-versa.  
Explicit Units & Context — Always qualify numbers, times, currencies, and geographies.  
Immutable Identifiers — Use stable @id IRIs or full URLs to prevent entity drift in model embeddings.

Baseline HTML Structure
<!DOCTYPE html>
<html lang="en" data-aiotype="primary">
  <head>
    <meta charset="utf-8" />
    <title>LLM-Friendly Markup Explained</title>
    <link rel="canonical" href="https://example.com/llm-friendly-markup" />
    <script type="application/ld+json" id="aio-schema">
      { /* JSON-LD injected in later section */ }
    </script>
  </head>
  <body vocab="https://schema.org/" typeof="Article" resource="#article">
    <header>
      <h1 property="headline">LLM-Friendly Markup Explained</h1>
      <p property="author" typeof="Person">
        <span property="name">Ada Lovelace</span>
      </p>
    </header>
    <article property="articleBody">
      <!-- visible content goes here -->
    </article>
  </body>
</html>
JSON-LD Canonical Pattern
{
  "@context": ["https://schema.org", { "aio": "https://aio.fabledsky.com/vocab#" }],
  "@id": "https://example.com/llm-friendly-markup#article",
  "@type": "Article",
  "mainEntityOfPage": { "@id": "https://example.com/llm-friendly-markup" },
  "headline": "LLM-Friendly Markup Explained",
  "author": {
    "@type": "Person",
    "name": "Ada Lovelace",
    "aio:contributorRole": "Technical Writer"
  },
  "datePublished": "2025-04-16",
  "dateModified": "2025-04-16",
  "keywords": [
    "LLM",
    "Semantic HTML",
    "schema.org",
    "AIO Standards"
  ],
  "aio:priority": "P1"
}
Correct vs. Incorrect Patterns
Correct: Use @context as an array so custom vocab (aio) does not shadow schema.org.

Incorrect:  
{ "@context": "https://schema.org aio:https://aio.fabledsky.com/vocab#" }
— Space-delimited contexts are invalid JSON; LLMs may tokenize incorrectly.
Correct: Expose author once in visible HTML and JSON-LD.

Incorrect: Hide a different author in structured data for SEO manipulation.
Correct: Declare language on every root <html> element (lang="en").

Incorrect: Omit language; LLM may default to incorrect locale and mis-stem tokens.
Semantic Neighborhood Design

Identify the Canonical Node (Article, Product, Dataset).  
Map first-degree properties: headline, name, description, image, url.  
Attach second-degree entities (Person, Organization) with explicit @ids.  
Validate using both the W3C RDF validator and the AIO linter (aio lint markup).  

Microdata + RDFa Hybrid Example
<section vocab="https://schema.org/" typeof="Product" resource="#prod42">
  <h2 property="name">Photon Stabilizer 3000</h2>
  <img property="image" src="/img/photon.jpg" alt="Photon Stabilizer 3000" />
  <p>
    <span property="description">
      A compact, LLM-optimized photon stabilizer for quantum workloads.
    </span>
  </p>
  <data property="sku">PS-3000-LLM</data>
  <span property="offers" typeof="Offer">
    $<span property="price" content="1999.00">1,999</span>
    <meta property="priceCurrency" content="USD" />
  </span>
</section>
This hybrid approach grants crawlers micro-parseable attributes while preserving JSON-LD as the canonical graph.
Disambiguation Strategies
• Units: Always annotate with <meta property="unitText" content="USD"> or a full QuantitativeValue.

• Time: ISO-8601 only (2025-04-16T05:01:00Z).

• Person names: first-class Person entity rather than a string whenever a biography link exists.

• Acronyms: Provide rdfs:label expansions in @context when domain-specific.
Fenced Code Blocks
LLMs treat fenced content as immutable tokens. Use for examples, not for critical semantic cues.
Bad:  
<article>
```html <!-- nested fence confusing LLM -->
...
Good:

Separate narrative and fenced blocks, and never nest fences of the same delimiter depth.
Accessibility Alignment
Because most modern LLMs co-train on accessibility corpora, meeting WCAG 2.2 AA guidelines increases parse fidelity:

• Always pair aria-label with visual cues.

• Use <figure>/<figcaption> for images; map figcaption to schema:image.caption.

• Provide language alternatives linked via inLanguage.
Testing Workflow

Run npm run build && aio lint markup ./dist.  
Inspect generated triples in Turtle: aio export ./dist --to turtle.  
Load graph into a local SPARQL endpoint; verify canonical node count (SELECT (COUNT(?s) AS ?count) WHERE { ?s a schema:Article }). Must equal 1.  
Pass output through open-source LLM (e.g., Llama-3-instruct) and ask factual recall questions. Expect ≤1% hallucination.

Versioning and Change Management
• Increment aio:schemaVersion in @context on breaking changes.

• Keep an immutable archive in /schema/versions/{semver}/.

• Notify the AIO Schema mailing list 14 days before deprecation.
Security and Integrity
To prevent prompt injection via embedded JSON-LD:

• Sanitize user-generated properties server-side.

• Validate IRIs against an allow-list to block malicious @id references (e.g., javascript: URIs).

• CSP header should include script-src 'self' to confine inlined JSON-LD execution context.
Reference Implementations
• AIO Demo Portal: https://demo.aio.fabledsky.com

• Fabled Sky GitHub Template: https://github.com/FabledSky/aio-markup-starter  
Additional Resources
• W3C HTML Living Standard

• schema.org Full Hierarchy (CSV dump updated nightly)

• “Designing Data-Intensive Applications” — Chapter 12 for data contracts  
Adhering to these patterns ensures that your content becomes a first-class citizen in LLM knowledge graphs, reduces semantic drift, and aligns fully with the AIO vision of deterministic, machine-interpretable web experiences.

Fabled Sky Research

Contents

Scope and Intent

Terminology

Core Principles

Baseline HTML Structure

JSON-LD Canonical Pattern

Correct vs. Incorrect Patterns

Semantic Neighborhood Design

Microdata + RDFa Hybrid Example

Disambiguation Strategies

Fenced Code Blocks

Accessibility Alignment

Testing Workflow

Versioning and Change Management

Security and Integrity

Reference Implementations

Additional Resources