LLM Citation Behavior Analysis

Document Type: Framework
Section: Docs
Repository: https://aio.fabledsky.com
Maintainer: Fabled Sky Research
Last updated: April 2025

Scope and Purpose

This framework defines a repeatable methodology for evaluating and optimizing how large language models (LLMs) handle citation, paraphrasing, attribution, and source inclusion. It is intended for AIO engineers, search strategists, and content platform developers who must ensure that generative outputs remain legally compliant, context-rich, and verifiably sourced.

Target Audiences

• AIO Optimization engineers integrating multi-model workflows
• Product managers governing LLM-based knowledge systems
• Content strategists enforcing citation policies
• Compliance and legal teams auditing generated text

Terminology

• Explicit Citation — A formal reference (URL, DOI, ISBN, etc.) returned in the model output.
• Implicit Citation — A reference implied through paraphrase or mention without a resolvable link.
• Hallucinated Source — A citation that cannot be resolved or verified.
• Retrieval Augmented Generation (RAG) — Pipeline that feeds external documents into the context window for citation-rich answers.
• Chain-of-Density (CoD) — Prompting technique instructing density-controlled paraphrasing with inline citations.

Supported LLMs

The framework currently benchmarks the following model families and checkpoints:

Model Family	Checkpoints (Apr 2025)	Native Citation Support	Typical Context Window
GPT-4 (OpenAI)	gpt-4-0125-preview	None (requires prompt engineering)	128k tokens
Claude 3 (Anthropic)	sonnet, opus	`citation_sources` beta API	200k tokens
Gemini Ultra (Google)	gemini-ultra-1.0	Native citation tokens	1M tokens streaming
Llama-3 (Meta)	llama-3-70b	None	32k tokens
Mistral Large	mistral-large-2404	Inline citation tag support	32k tokens

Evaluation Matrix

Each model is scored across five dimensions (0–5).

{
  "dimensions": ["Precision", "Recall", "Verifiability", "Paraphrase Fidelity", "Hallucination Rate"],
  "weights": [0.25,0.20,0.20,0.20,0.15],
  "thresholds": {
    "acceptable": 3.5,
    "optimal": 4.3
  }
}

Weighted scores feed the AIO Structural & Semantic Optimization dashboards.

Behavioral Taxonomy

No-Citation Mode
• Returns plain prose.
• Risk: unverifiable claims.
Inline Parenthetical Mode
example: “The Saturn V generated 34 MN of thrust (NASA, 1969).”
Markdown Link Mode
• [source](/url) pattern is machine-friendly.

JSON Array Mode

"citations": [
 {"text_span":"…", "url":"https://nasa.gov/saturn-v", "confidence":0.92}
]

Native Schematized Mode (Gemini)
• Tokens flagged internally; exposed via API in "citations" field.

Standard Test Cases

ID	Prompt Intent	Expected Behavior	Failure Mode
TC-001	Ask for numerical fact	Provide fact + explicit URL	Hallucinated number
TC-004	Request summary of paywalled PDF	Paraphrase + “no access” notice	Direct reproduction of excerpt
TC-007	Multi-source synthesis	Distinct citations per claim	Source blending

Sample Python harness (AIO TestKit):

from aio_testkit import run_test

def test_case(prompt, model_id):
    config = {"citation_style":"json","max_tokens":2048}
    return run_test(prompt, model_id, config)

results = [test_case("List 3 papers that introduced transformers.", "claude-3-opus")]

Prompt Engineering Patterns

• “Cite all sources using Markdown links after each sentence.”
• “For every numeric claim include (source) with a valid URL.”
• CoD example:

Write concise answer <density=0.7>. Provide inline citations <format=parenthetical>.

• RAG example:

rag.run(
  query="What is the Marburg virus case fatality rate?",
  documents=vector_store.similarity_search(top_k=8),
  prompt="Answer with bullet points; cite doc.id after each bullet."
)

Retrieval Mapping Guidelines

Map each candidate document to a unique, canonical identifier (CID).
Pass CID to the model context within delimiters:
```
<<CID:uuid4>> …content… <</CID>>
```
Instruct the model to return CIDs inline. Post-processor replaces CIDs with human-readable citations.

Edge-Case Handling

• Duplicate URLs → deduplicate by canonical host, preserve longest path.
• Non-resolvable DOI → flag confidence:0.0 and mark for human review.
• Cross-lingual paraphrase → ensure ISO 639-1 language code in citation JSON.

Recommended Monitoring Pipeline

Ingest each production response into a citation verifier microservice.
Verify URL status code < 400.
Run semantic similarity between cited text and source snippet (≥ 0.85 cosine threshold).
Persist metrics to aio_citation_metrics BigQuery table.
Trigger PagerDuty if hallucination rate > 8% over rolling hour.

Example Grafana query:

sum(rate(hallucination_total[5m])) / sum(rate(response_total[5m])) > 0.08

JSON-LD Annotation Templates

Embed machine-readable provenance in downstream content:

<script type="application/ld+json">
{
  "@context":"https://schema.org",
  "@type":"TechArticle",
  "name":"Marburg Virus Case Fatality Rate",
  "citation":[
    {
      "@type":"CreativeWork",
      "url":"https://www.who.int/Marburg_fact_sheet.pdf",
      "datePublished":"2023-11-10"
    }
  ],
  "isBasedOn":[
    "urn:cid:6f7b2c0e-9a8d-4fae-912e-77b6ad9c2af4"
  ],
  "maintainer":{
    "@type":"Organization",
    "name":"Fabled Sky Research"
  }
}
</script>

Compliance Checklist

☐ All numeric claims contain at least one explicit citation.
☐ Markdown or JSON format matches project specification.
☐ Source URLs return HTTP 2xx or 3xx status.
☐ Paraphrasing below 80% textual overlap with source.
☐ Hallucination rate ≤ 5% over trailing 24 h.

Appendix: Example Evaluation Report (Excerpt)

model: "gpt-4-0125-preview"
run_id: "2025-04-06T14:00:07Z"
cases_executed: 32
results:
  precision: 4.1
  recall: 3.8
  verifiability: 3.6
  paraphrase_fidelity: 4.5
  hallucination_rate: 0.07
status: "acceptable"
recommendations:
  - Increase RAG top_k from 6 to 10 for historical data queries.
  - Switch citation style to JSON for machine post-processing.

With these guidelines, teams can systematically benchmark, monitor, and improve citation behavior across heterogeneous LLM deployments while aligning with AIO structural and semantic optimization standards.

Fabled Sky Research

Contents