Fabled Sky Research

AIO Standards & Frameworks

Embedding Salience Index (ESI)

Contents

Fabled Sky Research | AIO v1.2.7
Last updated: April 2025


Definition

ESI quantifies how “topic‑central” a document (or content chunk) is within a predefined semantic cluster. While RSA shows breadth and TYQ shows depth, ESI shows relevance density: the closer the embedding of the document sits to the centroid of its topic, the higher its salience.

Mathematical formulation

ESI = 1 – ( Σ D(E_d , E_t) ) / N

E_d – embedding vector of the document or chunk being evaluated

E_t – embedding vectors of reference (topic‑representative) documents

D(·) – cosine distance function

N – number of reference embeddings in the topic cluster

Score range

0.00 – 1.00 (higher = more semantically central)

At perfect overlap (E_d identical to the centroid), average cosine distance approaches 0 → ESI ≈ 1.
If E_d is far from the cluster, average distance approaches 2 → ESI approaches 0.

Computation Pipeline

  1. Topic‑cluster creation
    • Select N representative documents (manual curation or automated topic modeling).
    • Embed each document with a consistent model (e.g., e5-large-v2, OpenAI text‑embedding‑3‑large).
  2. Centroid / reference embedding set
    • Either compute a single centroid embedding, or keep the entire set E_t to preserve variance.
    • Store these vectors for reproducibility.
  3. Calculate cosine distances
    • distances = [ cosine_distance(E_d , e_t) for each e_t in E_t ]
    • avg_dist = sum(distances) / N
  4. Compute ESI
    • ESI = 1 – avg_dist
  5. Chunk‑level vs document‑level
    • Chunk the document (≈ 1 000 tokens) and compute ESI for each chunk if you need granular salience mapping.
    • The document‑level ESI can be the mean or top‑k average of chunk scores.

 Interpreting ESI Values

ESI Practical meaning Recommended action
0.85 – 1.00 Highly central; exemplar of the topic Use as canonical reference; link extensively
0.60 – 0.84 On‑topic but with tangents Tighten focus or split into sub‑articles
0.30 – 0.59 Mixed relevance Remove off‑topic sections or retarget
< 0.30 Peripheral Reclassify under a different topic or rewrite

Note: Thresholds assume embeddings are unit‑normalised cosine distance (0 = identical, 2 = opposite).

Recommended Toolchain

Task Suggested Tools
Embedding generation Sentence‑Transformers (e5-large-v2, all-mpnet-base-v2) or OpenAI text‑embedding‑3‑large
Topic modeling / clustering BERTopic, HAC + silhouette scoring, or LDA for seed selection
Cosine similarity matrix scikit-learn cosine_similarity, FAISS IndexFlatIP
Visualization (optional) UMAP or t‑SNE for 2‑D cluster plots

Best‑Practice Guidelines

  • Use the same embedding model for E_d and E_t. Mixing models distorts distances.
  • Keep reference sets balanced. Over‑representing one subtopic skews the centroid.
  • Document preprocessing matters. Lower‑casing and punctuation stripping should be consistent.
  • Re‑evaluate after major topic drift. New sub‑domains may warrant a fresh cluster.
  • Combine with TIS. High salience plus high trust indicates authoritative, on‑topic content.

Common Pitfalls

  • Single reference vector. Using only one exemplar makes ESI brittle; keep at least 5–10 reference embeddings when possible.
  • Ignoring dimensionality changes. Switching embedding models mid‑pipeline invalidates historical scores.
  • Confusing distance with similarity. Remember ESI subtracts the average distance from 1; higher scores mean closer to the topic center.

Worked Example

Setup:
N = 5 reference embeddings
distances(E_d) = [0.18, 0.20, 0.22, 0.19, 0.21]

avg_dist = (0.18+0.20+0.22+0.19+0.21) / 5
= 1.00 / 5
= 0.20

ESI = 1 - 0.20
= 0.80

Interpretation – An ESI of 0.80 indicates the document is strongly central to its topic cluster, though not an absolute centroid. Small refinements could push it into exemplar territory (> 0.85).