Document Type: Implementation Guide
Section: Docs
Repository: https://aio.fabledsky.com
Maintainer: Fabled Sky Research
Last updated: April 2025
Overview
Long-form digital assets—whitepapers, knowledge bases, e-books, and multipart API references—often exceed the effective context window of today’s Large Language Models (LLMs). Naïve pagination or arbitrary text chunking can break semantic continuity, causing hallucinations, diminished answer quality, and loss of referential integrity (e.g., tables, footnotes, figure references).
This guide defines AIO-compliant methods for paginating and splitting content so that:
- Each chunk remains independently comprehensible.
- Forward and backward references survive across chunk boundaries.
- Automated scoring pipelines can evaluate whether the split content preserves LLM comprehension.
Terminology
Term | Definition |
---|---|
Chunk | The smallest text unit purposely fed into an LLM window. |
Continuity Token (CT) | A structured anchor string that denotes parent/child chunk relationships. |
Lead-in | A deterministic recap sentence prepended to every chunk after the first. |
Lead-out | A deterministic preview sentence appended to every chunk before the last. |
Overlap Window | A repeated character or token sequence shared between consecutive chunks to maintain context. |
Architectural Requirements
- Hard cap of
Nmax = ctx_size – buf_size
tokens per chunk, wherebuf_size
= reserved tokens for system prompts + CTs (default 512 for GPT-4-Turbo class). - Bidirectional CT graph stored in a machine-readable manifest (
chunks.json
). - Deterministic overflow strategy: when a semantic unit (e.g., list, code block) straddles
Nmax
, shift entire unit to next chunk—never split intra-unit. - Schema.org/CreativeWork representation for each chunk to maintain metadata parity.
Content Splitting Algorithm
flowchart TD
A[Load Markdown] --> B[Tokenize via tiktoken]
B --> C{Tokens > Nmax?}
C -- No --> D[Emit Chunk 01]
C -- Yes --> E[Find Last Safe Breakpoint ≤ Nmax]
E --> F[Insert Lead-out + CT Fwd]
F --> G[Emit Chunk n]
G --> H[Create Lead-in + CT Bwd]
H --> C
Reference Implementation (Python 3.11)
from pathlib import Path
import json, tiktoken
ENC = tiktoken.get_encoding("cl100k_base")
CTX_SIZE = 16384
BUF_SIZE = 512
NMAX = CTX_SIZE - BUF_SIZE
def split_markdown(path: str):
text = Path(path).read_text()
tokens = ENC.encode(text)
chunks, start = [], 0
while start < len(tokens):
end = min(start + NMAX, len(tokens))
1. Backtrack to nearest markdown boundary
while end < len(tokens) and tokens[end] not in ENC.encode("\n#"):
end -= 1
chunk_tokens = tokens[start:end]
chunks.append(ENC.decode(chunk_tokens))
start = end
return chunks
def add_ct_graph(chunks):
enriched = []
for idx, content in enumerate(chunks):
ct = {
"id": f"chunk-{idx+1:03}",
"prev": None if idx == 0 else f"chunk-{idx:03}",
"next": None if idx == len(chunks)-1 else f"chunk-{idx+2:03}"
}
enriched.append({"ct": ct, "content": content})
return enriched
md_chunks = add_ct_graph(split_markdown("guide.md"))
Path("chunks.json").write_text(json.dumps(md_chunks, indent=2))
Continuity Token Design
Continuity Tokens are placed inside HTML comments so they remain invisible to human readers yet machine-parsable.
Example injection for chunk 002:
<!--CT {"id":"chunk-002","prev":"chunk-001","next":"chunk-003"}-->
<p>[Lead-in] Continuing from chunk-001 — …</p>
...
<p>[Lead-out] Upcoming in chunk-003: Advanced use cases.</p>
Metadata Schema (JSON-LD)
{
"@context": "https://schema.org",
"@type": "CreativeWork",
"@id": "https://example.com/ebook/chunk-002",
"name": "Chapter 1 — Part 2",
"position": 2,
"isBasedOn": "https://example.com/ebook",
"pagination": "2/10",
"continuityToken": {
"@type": "PropertyValue",
"name": "CT",
"value": "{\"id\":\"chunk-002\",\"prev\":\"chunk-001\",\"next\":\"chunk-003\"}"
}
}
Lead-in / Lead-out Templates
Lead-in: 'Continuing from {prev_id}: {one-sentence summary}'
Lead-out: 'Upcoming in {next_id}: {one-sentence preview}'
Generate summaries and previews with an LLM in low-temperature (T=0.2
) mode to maximize determinism.
Testing Protocols
- Data Set: At minimum, 100 documents spanning 1k–200k tokens.
- Control Group: Unsplitted input passed through the same LLM.
- Experimental Group: Chunked + CT-stitched input.
- Queries: 25 Q&A pairs per document focusing on cross-chunk references.
- Metrics captured:
- Answer F1 (exact match & semantic).
- Context Recall (percentage of required tokens restored via CT graph).
- Pagination Integrity Score (PIS):
PIS = (continuity_hits / (continuity_hits + continuity_misses)) * 100
- Latency Delta (ms) between control and experimental groups.
AIO compliance threshold (P2 priority):
• Answer F1 ≥ 0.92
• Context Recall ≥ 0.97
• PIS ≥ 98
• Latency Delta ≤ +15 %
Scoring Methodology (Automated)
aio-scorer \
--manifest chunks.json \
--queries queries.yaml \
--model gpt-4o-mini \
--metrics f1,recall,pis,latency \
--target-thresholds 0.92 0.97 98 15
The scorer reconstructs context by following CT links. Any dangling or circular CT graph edges trigger immediate failure (EXIT_CODE=13
).
Common Pitfalls & Mitigations
Pitfall | Symptom | Mitigation |
---|---|---|
Overlapping code block fences | LLM mis-renders or truncates code | Validate Markdown AST after splitting. |
CT collision in merged docs | Duplicate IDs | Use UUID v7 or repo-wide monotonic counter. |
Excessive overlap window | Token budget erosion | Keep overlap ≤ 5 % of NMAX. |
Async chunk loading | Missing lead-in content | Defer render until all adjacent chunks are fetched. |
Deployment Checklist
- [ ] All chunks ≤ NMAX tokens when CTs included
- [ ]
chunks.json
present at content root with SHA-256 checksum - [ ] HTML build passes W3C validator after CT comment stripping
- [ ]
aio-scorer
run with success ≥ 3 consecutive CI builds - [ ] Incident playbook updated with pagination rollback steps
By following the guidelines above, content strategists and engineers can guarantee that paginated assets remain logically cohesive for both human readers and LLM workflows, ensuring high retrieval accuracy and stable downstream reasoning across the entire document set.