Fabled Sky Research

AIO Standards & Frameworks

LLM-Friendly Markup Guide

Contents

Document Type: Implementation Guide
Section: Docs
Repository: https://aio.fabledsky.com
Maintainer: Fabled Sky Research
Last updated: April 2025

Scope and Intent

This guide standardizes HTML and semantic-web markup patterns so Large Language Models (LLMs) can ingest, contextualize, and reason over web content with minimal hallucination and maximal recall. It is P1 in the AIO Standards & Protocols category and is required reading for all engineers, technical writers, SEO strategists, and content architects integrating AIO principles.

Terminology

• LLM-Friendly — Markup that is syntactically valid, semantically explicit, and disambiguated enough for an LLM to parse deterministically.
• Canonical Node — The single, authoritative representation of an entity in a document (e.g., Article, Product).
• Semantic Neighborhood — The immediate graph of properties and linked entities surrounding a canonical node.
• Dual Embedding — Publishing schema.org data in both JSON-LD and visible HTML to serve crawlers without penalizing accessibility.
• Fence — Any code block (

, ``` etc.) that prevents the LLM from interpreting contents as prose.  

Core Principles

  1. Determinism over Density — Prefer fewer, unambiguous triples to verbose, overlapping vocabularies.
  2. Canonical First — Each page must expose one—and only one—mainEntity.
  3. Symmetric Visibility — Ensure that information displayed to users is represented in structured data, and vice-versa.
  4. Explicit Units & Context — Always qualify numbers, times, currencies, and geographies.
  5. Immutable Identifiers — Use stable @id IRIs or full URLs to prevent entity drift in model embeddings.

Baseline HTML Structure

<!DOCTYPE html>
<html lang="en" data-aiotype="primary">
  <head>
    <meta charset="utf-8" />
    <title>LLM-Friendly Markup Explained</title>
    <link rel="canonical" href="https://example.com/llm-friendly-markup" />
    <script type="application/ld+json" id="aio-schema">
      { /* JSON-LD injected in later section */ }
    </script>
  </head>
  <body vocab="https://schema.org/" typeof="Article" resource="#article">
    <header>
      <h1 property="headline">LLM-Friendly Markup Explained</h1>
      <p property="author" typeof="Person">
        <span property="name">Ada Lovelace</span>
      </p>
    </header>
    <article property="articleBody">
      <!-- visible content goes here -->
    </article>
  </body>
</html>

JSON-LD Canonical Pattern

{
  "@context": ["https://schema.org", { "aio": "https://aio.fabledsky.com/vocab#" }],
  "@id": "https://example.com/llm-friendly-markup#article",
  "@type": "Article",
  "mainEntityOfPage": { "@id": "https://example.com/llm-friendly-markup" },
  "headline": "LLM-Friendly Markup Explained",
  "author": {
    "@type": "Person",
    "name": "Ada Lovelace",
    "aio:contributorRole": "Technical Writer"
  },
  "datePublished": "2025-04-16",
  "dateModified": "2025-04-16",
  "keywords": [
    "LLM",
    "Semantic HTML",
    "schema.org",
    "AIO Standards"
  ],
  "aio:priority": "P1"
}

Correct vs. Incorrect Patterns

Correct: Use @context as an array so custom vocab (aio) does not shadow schema.org.
Incorrect:

{ "@context": "https://schema.org aio:https://aio.fabledsky.com/vocab#" }

— Space-delimited contexts are invalid JSON; LLMs may tokenize incorrectly.

Correct: Expose author once in visible HTML and JSON-LD.
Incorrect: Hide a different author in structured data for SEO manipulation.

Correct: Declare language on every root <html> element (lang="en").
Incorrect: Omit language; LLM may default to incorrect locale and mis-stem tokens.

Semantic Neighborhood Design

  1. Identify the Canonical Node (Article, Product, Dataset).
  2. Map first-degree properties: headline, name, description, image, url.
  3. Attach second-degree entities (Person, Organization) with explicit @ids.
  4. Validate using both the W3C RDF validator and the AIO linter (aio lint markup).

Microdata + RDFa Hybrid Example

<section vocab="https://schema.org/" typeof="Product" resource="#prod42">
  <h2 property="name">Photon Stabilizer 3000</h2>
  <img property="image" src="/img/photon.jpg" alt="Photon Stabilizer 3000" />
  <p>
    <span property="description">
      A compact, LLM-optimized photon stabilizer for quantum workloads.
    </span>
  </p>
  <data property="sku">PS-3000-LLM</data>
  <span property="offers" typeof="Offer">
    $<span property="price" content="1999.00">1,999</span>
    <meta property="priceCurrency" content="USD" />
  </span>
</section>

This hybrid approach grants crawlers micro-parseable attributes while preserving JSON-LD as the canonical graph.

Disambiguation Strategies

• Units: Always annotate with <meta property="unitText" content="USD"> or a full QuantitativeValue.
• Time: ISO-8601 only (2025-04-16T05:01:00Z).
• Person names: first-class Person entity rather than a string whenever a biography link exists.
• Acronyms: Provide rdfs:label expansions in @context when domain-specific.

Fenced Code Blocks

LLMs treat fenced content as immutable tokens. Use for examples, not for critical semantic cues.

Bad:

<article>
```html <!-- nested fence confusing LLM -->
...

Good:
Separate narrative and fenced blocks, and never nest fences of the same delimiter depth.

Accessibility Alignment

Because most modern LLMs co-train on accessibility corpora, meeting WCAG 2.2 AA guidelines increases parse fidelity:
• Always pair aria-label with visual cues.
• Use <figure>/<figcaption> for images; map figcaption to schema:image.caption.
• Provide language alternatives linked via inLanguage.

Testing Workflow

  1. Run npm run build && aio lint markup ./dist.
  2. Inspect generated triples in Turtle: aio export ./dist --to turtle.
  3. Load graph into a local SPARQL endpoint; verify canonical node count (SELECT (COUNT(?s) AS ?count) WHERE { ?s a schema:Article }). Must equal 1.
  4. Pass output through open-source LLM (e.g., Llama-3-instruct) and ask factual recall questions. Expect ≤1% hallucination.

Versioning and Change Management

• Increment aio:schemaVersion in @context on breaking changes.
• Keep an immutable archive in /schema/versions/{semver}/.
• Notify the AIO Schema mailing list 14 days before deprecation.

Security and Integrity

To prevent prompt injection via embedded JSON-LD:
• Sanitize user-generated properties server-side.
• Validate IRIs against an allow-list to block malicious @id references (e.g., javascript: URIs).
• CSP header should include script-src 'self' to confine inlined JSON-LD execution context.

Reference Implementations

• AIO Demo Portal: https://demo.aio.fabledsky.com
• Fabled Sky GitHub Template: https://github.com/FabledSky/aio-markup-starter

Additional Resources

• W3C HTML Living Standard
• schema.org Full Hierarchy (CSV dump updated nightly)
• “Designing Data-Intensive Applications” — Chapter 12 for data contracts

Adhering to these patterns ensures that your content becomes a first-class citizen in LLM knowledge graphs, reduces semantic drift, and aligns fully with the AIO vision of deterministic, machine-interpretable web experiences.