Embedding Salience Index (ESI)

Fabled Sky Research | AIO v1.2.7
Last updated: April 2025

Definition

ESI quantifies how “topic‑central” a document (or content chunk) is within a predefined semantic cluster. While RSA shows breadth and TYQ shows depth, ESI shows relevance density: the closer the embedding of the document sits to the centroid of its topic, the higher its salience.

Mathematical formulation

ESI = 1 – ( Σ D(E_d , E_t) ) / N

E_d – embedding vector of the document or chunk being evaluated

E_t – embedding vectors of reference (topic‑representative) documents

D(·) – cosine distance function

N – number of reference embeddings in the topic cluster

Score range

0.00 – 1.00 (higher = more semantically central)

At perfect overlap (E_d identical to the centroid), average cosine distance approaches 0 → ESI ≈ 1.
If E_d is far from the cluster, average distance approaches 2 → ESI approaches 0.

Computation Pipeline

Topic‑cluster creation
- Select N representative documents (manual curation or automated topic modeling).
- Embed each document with a consistent model (e.g., e5-large-v2, OpenAI text‑embedding‑3‑large).
Centroid / reference embedding set
- Either compute a single centroid embedding, or keep the entire set E_t to preserve variance.
- Store these vectors for reproducibility.
Calculate cosine distances
- distances = [ cosine_distance(E_d , e_t) for each e_t in E_t ]
- avg_dist = sum(distances) / N
Compute ESI
- ESI = 1 – avg_dist
Chunk‑level vs document‑level
- Chunk the document (≈ 1 000 tokens) and compute ESI for each chunk if you need granular salience mapping.
- The document‑level ESI can be the mean or top‑k average of chunk scores.

 Interpreting ESI Values

ESI	Practical meaning	Recommended action
0.85 – 1.00	Highly central; exemplar of the topic	Use as canonical reference; link extensively
0.60 – 0.84	On‑topic but with tangents	Tighten focus or split into sub‑articles
0.30 – 0.59	Mixed relevance	Remove off‑topic sections or retarget
< 0.30	Peripheral	Reclassify under a different topic or rewrite

Note: Thresholds assume embeddings are unit‑normalised cosine distance (0 = identical, 2 = opposite).

Recommended Toolchain

Task	Suggested Tools
Embedding generation	Sentence‑Transformers (`e5-large-v2`, `all-mpnet-base-v2`) or OpenAI text‑embedding‑3‑large
Topic modeling / clustering	BERTopic, HAC + silhouette scoring, or LDA for seed selection
Cosine similarity matrix	`scikit-learn` `cosine_similarity`, FAISS `IndexFlatIP`
Visualization (optional)	UMAP or t‑SNE for 2‑D cluster plots

Best‑Practice Guidelines

Use the same embedding model for E_d and E_t. Mixing models distorts distances.
Keep reference sets balanced. Over‑representing one subtopic skews the centroid.
Document preprocessing matters. Lower‑casing and punctuation stripping should be consistent.
Re‑evaluate after major topic drift. New sub‑domains may warrant a fresh cluster.
Combine with TIS. High salience plus high trust indicates authoritative, on‑topic content.

Common Pitfalls

Single reference vector. Using only one exemplar makes ESI brittle; keep at least 5–10 reference embeddings when possible.
Ignoring dimensionality changes. Switching embedding models mid‑pipeline invalidates historical scores.
Confusing distance with similarity. Remember ESI subtracts the average distance from 1; higher scores mean closer to the topic center.

Worked Example

Setup:
  N                = 5 reference embeddings
  distances(E_d)   = [0.18, 0.20, 0.22, 0.19, 0.21]

avg_dist  = (0.18+0.20+0.22+0.19+0.21) / 5
          = 1.00 / 5
          = 0.20

ESI       = 1 - 0.20
          = 0.80

Interpretation – An ESI of 0.80 indicates the document is strongly central to its topic cluster, though not an absolute centroid. Small refinements could push it into exemplar territory (> 0.85).

Fabled Sky Research

Embedding Salience Index (ESI)

Contents

Definition

Mathematical formulation

Score range

Computation Pipeline

Interpreting ESI Values

Recommended Toolchain

Best‑Practice Guidelines

Common Pitfalls

Worked Example

 Interpreting ESI Values