Fabled Sky Research

AIO Standards & Frameworks

LLM Citation Behavior Analysis

Contents

Document Type: Framework
Section: Docs
Repository: https://aio.fabledsky.com
Maintainer: Fabled Sky Research
Last updated: April 2025

Scope and Purpose

This framework defines a repeatable methodology for evaluating and optimizing how large language models (LLMs) handle citation, paraphrasing, attribution, and source inclusion. It is intended for AIO engineers, search strategists, and content platform developers who must ensure that generative outputs remain legally compliant, context-rich, and verifiably sourced.

Target Audiences

• AIO Optimization engineers integrating multi-model workflows
• Product managers governing LLM-based knowledge systems
• Content strategists enforcing citation policies
• Compliance and legal teams auditing generated text

Terminology

• Explicit Citation — A formal reference (URL, DOI, ISBN, etc.) returned in the model output.
• Implicit Citation — A reference implied through paraphrase or mention without a resolvable link.
• Hallucinated Source — A citation that cannot be resolved or verified.
• Retrieval Augmented Generation (RAG) — Pipeline that feeds external documents into the context window for citation-rich answers.
• Chain-of-Density (CoD) — Prompting technique instructing density-controlled paraphrasing with inline citations.

Supported LLMs

The framework currently benchmarks the following model families and checkpoints:

Model Family Checkpoints (Apr 2025) Native Citation Support Typical Context Window
GPT-4 (OpenAI) gpt-4-0125-preview None (requires prompt engineering) 128k tokens
Claude 3 (Anthropic) sonnet, opus citation_sources beta API 200k tokens
Gemini Ultra (Google) gemini-ultra-1.0 Native citation tokens 1M tokens streaming
Llama-3 (Meta) llama-3-70b None 32k tokens
Mistral Large mistral-large-2404 Inline citation tag support 32k tokens

Evaluation Matrix

Each model is scored across five dimensions (0–5).

{
  "dimensions": ["Precision", "Recall", "Verifiability", "Paraphrase Fidelity", "Hallucination Rate"],
  "weights": [0.25,0.20,0.20,0.20,0.15],
  "thresholds": {
    "acceptable": 3.5,
    "optimal": 4.3
  }
}

Weighted scores feed the AIO Structural & Semantic Optimization dashboards.

Behavioral Taxonomy

  1. No-Citation Mode
    • Returns plain prose.
    • Risk: unverifiable claims.

  2. Inline Parenthetical Mode
    example: “The Saturn V generated 34 MN of thrust (NASA, 1969).”

  3. Markdown Link Mode
    [source](/url) pattern is machine-friendly.

  4. JSON Array Mode

    "citations": [
     {"text_span":"…", "url":"https://nasa.gov/saturn-v", "confidence":0.92}
    ]
  5. Native Schematized Mode (Gemini)
    • Tokens flagged internally; exposed via API in "citations" field.

Standard Test Cases

ID Prompt Intent Expected Behavior Failure Mode
TC-001 Ask for numerical fact Provide fact + explicit URL Hallucinated number
TC-004 Request summary of paywalled PDF Paraphrase + “no access” notice Direct reproduction of excerpt
TC-007 Multi-source synthesis Distinct citations per claim Source blending

Sample Python harness (AIO TestKit):

from aio_testkit import run_test

def test_case(prompt, model_id):
    config = {"citation_style":"json","max_tokens":2048}
    return run_test(prompt, model_id, config)

results = [test_case("List 3 papers that introduced transformers.", "claude-3-opus")]

Prompt Engineering Patterns

• “Cite all sources using Markdown links after each sentence.”
• “For every numeric claim include (source) with a valid URL.”
• CoD example:

Write concise answer <density=0.7>. Provide inline citations <format=parenthetical>.

• RAG example:

rag.run(
  query="What is the Marburg virus case fatality rate?",
  documents=vector_store.similarity_search(top_k=8),
  prompt="Answer with bullet points; cite doc.id after each bullet."
)

Retrieval Mapping Guidelines

  1. Map each candidate document to a unique, canonical identifier (CID).
  2. Pass CID to the model context within delimiters:
    <<CID:uuid4>> …content… <</CID>>
  3. Instruct the model to return CIDs inline. Post-processor replaces CIDs with human-readable citations.

Edge-Case Handling

• Duplicate URLs → deduplicate by canonical host, preserve longest path.
• Non-resolvable DOI → flag confidence:0.0 and mark for human review.
• Cross-lingual paraphrase → ensure ISO 639-1 language code in citation JSON.

Recommended Monitoring Pipeline

  1. Ingest each production response into a citation verifier microservice.
  2. Verify URL status code < 400.
  3. Run semantic similarity between cited text and source snippet (≥ 0.85 cosine threshold).
  4. Persist metrics to aio_citation_metrics BigQuery table.
  5. Trigger PagerDuty if hallucination rate > 8% over rolling hour.

Example Grafana query:

sum(rate(hallucination_total[5m])) / sum(rate(response_total[5m])) > 0.08

JSON-LD Annotation Templates

Embed machine-readable provenance in downstream content:

<script type="application/ld+json">
{
  "@context":"https://schema.org",
  "@type":"TechArticle",
  "name":"Marburg Virus Case Fatality Rate",
  "citation":[
    {
      "@type":"CreativeWork",
      "url":"https://www.who.int/Marburg_fact_sheet.pdf",
      "datePublished":"2023-11-10"
    }
  ],
  "isBasedOn":[
    "urn:cid:6f7b2c0e-9a8d-4fae-912e-77b6ad9c2af4"
  ],
  "maintainer":{
    "@type":"Organization",
    "name":"Fabled Sky Research"
  }
}
</script>

Compliance Checklist

☐ All numeric claims contain at least one explicit citation.
☐ Markdown or JSON format matches project specification.
☐ Source URLs return HTTP 2xx or 3xx status.
☐ Paraphrasing below 80% textual overlap with source.
☐ Hallucination rate ≤ 5% over trailing 24 h.

Appendix: Example Evaluation Report (Excerpt)

model: "gpt-4-0125-preview"
run_id: "2025-04-06T14:00:07Z"
cases_executed: 32
results:
  precision: 4.1
  recall: 3.8
  verifiability: 3.6
  paraphrase_fidelity: 4.5
  hallucination_rate: 0.07
status: "acceptable"
recommendations:
  - Increase RAG top_k from 6 to 10 for historical data queries.
  - Switch citation style to JSON for machine post-processing.

With these guidelines, teams can systematically benchmark, monitor, and improve citation behavior across heterogeneous LLM deployments while aligning with AIO structural and semantic optimization standards.