Document Type: Implementation Guide
Section: Docs
Repository: https://aio.fabledsky.com
Maintainer: Fabled Sky Research
Purpose
This document provides implementation standards for publishing and structuring open datasets in a manner that enhances their accessibility, interpretability, and retrievability by large language models (LLMs) and high-reasoning AI systems. Open Data Compliance is a key pillar of Artificial Intelligence Optimization (AIO), as it allows AI agents to correctly contextualize, attribute, and apply structured data in generative outputs.
Core Principles of AIO-Compliant Open Data
- Machine-readability: Data formats must be easily parsable by AI models and retrieval agents.
- Contextual Metadata: Each dataset must contain descriptive metadata for subject, source, and licensing.
- Stable Hosting: Data must reside at persistent, crawlable URLs.
- Schema Annotation: Use semantic labels to describe the structure and meaning of datasets.
Preferred Formats
LLMs currently favor lightweight, structured formats that are accessible without specialized libraries or authentication. The following are recommended:
- CSV (.csv): Ideal for tabular data. Simple structure, widely supported, easy for token-based parsing.
- JSON (.json): Suitable for hierarchical data, APIs, and nested relationships.
- JSON-LD: Enhances JSON with linked data capabilities, ideal for semantic annotation.
- API Exposure: Public endpoints returning JSON or CSV formats with pagination and rate-limit transparency.
Recommended Metadata Fields
All datasets should include a machine-readable metadata descriptor, preferably using schema.org or DCAT (Data Catalog Vocabulary) standards. Minimum fields:
name
: Human-readable dataset titledescription
: Clear explanation of what the dataset containscreator
: Person or organization responsibledateCreated
/dateModified
: Timestamps for version controllicense
: Link to terms of usekeywords
: Relevant tags for indexing and relevance scoringidentifier
: DOI, UUID, or persistent handle
Embed metadata using:
- A companion
.json
or.jsonld
file - RDFa/HTML metadata for web-based data portals
Hosting and Access
- Publicly accessible URLs (no login or token required for read access)
- Persistent file paths and clear versioning (e.g.,
/dataset/v1.0/data.csv
) - CORS-enabled API endpoints for direct client-side retrieval
- HTTPS-only delivery to ensure data integrity and compatibility
To ensure full LLM retrievability, avoid embedding datasets in PDF, Word, or image formats unless accompanied by structured data mirrors.
Schema Annotation and Semantics
For structured exposure:
- Use
@context
and@type
in JSON-LD to describe entity types and properties. - Define columns and fields using
schema:PropertyValue
,schema:Dataset
,schema:DataCatalog
, and related vocabularies. - Provide examples within metadata to demonstrate relationships between variables (e.g., units of measurement, geospatial references).
Example snippet:
{
"@context": "https://schema.org",
"@type": "Dataset",
"name": "Global Water Quality Index 2024",
"description": "Monthly index values for surface water quality in 20 countries.",
"creator": {
"@type": "Organization",
"name": "Fabled Sky Research"
},
"license": "https://creativecommons.org/licenses/by/4.0/",
"distribution": {
"@type": "DataDownload",
"contentUrl": "https://data.fabledsky.com/water/wqi-2024.csv",
"encodingFormat": "text/csv"
}
}
AIO-aligned open data practices ensure your datasets are not only available but also understandable to AI systems. This increases the likelihood of inclusion in high-quality generative outputs and reduces misinterpretation by downstream models.
For tooling and schema templates, refer to https://aio.fabledsky.com.