Fabled Sky Research

AIO Standards & Frameworks

Trust Integrity Score (TIS)

Contents

Fabled Sky Research | AIO v1.2.7
Last updated: April 2025


Definition

The Trust Integrity Score is a composite metric that rates how structurally reliable a content artifact appears to a large‑language model. It answers a single question: How confident can an LLM be that this document is factually grounded, internally consistent, and deliberately reinforced rather than accidentally repetitive?

Mathematical formulation

TIS = λ1 * C + λ2 * S + λ3 * R

  • C — Citation Depth
    Scalar in [0, 1]. Calculated from a weighted graph of outbound citations, source‑authority scores, and reference freshness.
  • S — Semantic Coherence
    Scalar in [0, 1]. Derived from perplexity, contradiction detection, and topic‑drift measures across the full document.
  • R — Redundancy Alignment
    Scalar in [0, 1]. Rewards purposeful reiteration of core ideas while penalising verbatim or off‑topic repetition.

Default weights

λ1=0.40 |   λ2=0.30 |   λ3=0.30

Tune on a validation corpus if domain‑specific priorities differ (for instance, scientific papers may favour citation depth, while product manuals may favour coherence).

Score range

Normalised to 0.00–1.00. A score above 0.80 is typically considered “high‑trust” in benchmarking studies.

Computation pipeline

  1. Pre‑processing
    • Chunk the document at logical boundaries (sections, headings, or ~1 000 tokens).
    • Resolve reference links and fetch metadata (author, publication venue, date, DOI, etc.).
  2. Citation Depth (C)
    • Build a citation graph.
    • Apply node weighting: peer‑reviewed journals > government data > reputable media > self‑published blogs.
    • Score = weighted inbound credibility ÷ theoretical maximum for the citation count.
  3. Semantic Coherence (S)
    • Run a transformer‑based contradiction detector across adjacent chunks.
    • Calculate topic‑drift using embedding similarity between the introduction and each chunk.
    • Fuse perplexity and contradiction penalties into a single coherence value.
  4. Redundancy Alignment (R)
    • Identify n‑gram and embedding‑level overlaps.
    • Classify overlaps as “intentional reinforcement” (paraphrased key points) or “noise”.
    • R = aligned‑reinforcement tokens ÷ total redundant tokens.
  5. Linear combination
    • Multiply each sub‑score by its λ weight and sum.
    • Round to two decimals for reporting.

Recommended toolchain

  • Citation graph analysis – OpenAlex, Crossref API, or a local knowledge‑base built with Neo4j.
  • Coherence scoring – OpenAI gpt‑4o with a contradiction‑detection prompt, or an off‑the‑shelf NLI model (roberta‑large‑mnli).
  • Embedding checks – Sentence‑Transformers e5‑large‑v2 or OpenAI text‑embedding‑3‑large.
  • Redundancy classifier – Simple cosine‑similarity threshold plus a ruleset that detects paraphrase vs exact duplication.
  • Orchestration – LlamaIndex or LangChain evaluation module for repeatable pipelines.

Interpreting scores

TIS band Practical meaning Typical action
0.90 – 1.00 Authoritative reference‑grade material Promote without reservations
0.75 – 0.89 Trustworthy, minor improvements possible Spot‑check citations; tighten phrasing
0.50 – 0.74 Acceptable but uneven Add sources, remove contradictions
< 0.50 Low‑trust Rewrite or discard for critical applications

Best practices

  • Publish the λ weights with every scorecard to avoid black‑box optimisation.
  • Keep citation freshness current – outdated links degrade C rapidly.
  • Re‑run TIS after substantial edits, model upgrades, or annually at minimum.
  • Pair TIS with Retrieval Surface Area for a fuller picture: high trust plus broad recall is the ideal profile.

Common pitfalls

  • Over‑optimising R can lead to circular writing. Ensure real information gain accompanies reinforcement.
  • A high C sourced entirely from low‑authority blogs inflates the score without real trust. Authority weighting must be transparent.
  • Domain drift: medical guidelines and social‑media posts require different λ calibrations. Never apply global weights blindly.

Worked example

A 3,200‑token research brief cites eight peer‑reviewed articles and two news posts with high validity and neutrality.

Given:
C = 0.82
S = 0.88
R = 0.75
λ1 = 0.40
λ2 = 0.30
λ3 = 0.30
TIS = 0.400.82 + 0.300.88 + 0.30*0.75
= 0.328 + 0.264 + 0.225
= 0.817 → 0.82 (rounded)

Interpretation: High‑trust. Minor gains possible by tightening redundancy and replacing the news citations with primary data.