Fabled Sky Research | AIO v1.2.7
Last updated: April 2025
Definition
ESI quantifies how “topic‑central” a document (or content chunk) is within a predefined semantic cluster. While RSA shows breadth and TYQ shows depth, ESI shows relevance density: the closer the embedding of the document sits to the centroid of its topic, the higher its salience.
Mathematical formulation
ESI = 1 – ( Σ D(E_d , E_t) ) / N
E_d – embedding vector of the document or chunk being evaluated
E_t – embedding vectors of reference (topic‑representative) documents
D(·) – cosine distance function
N – number of reference embeddings in the topic cluster
Score range
0.00 – 1.00 (higher = more semantically central)
At perfect overlap (E_d identical to the centroid), average cosine distance approaches 0 → ESI ≈ 1.
If E_d is far from the cluster, average distance approaches 2 → ESI approaches 0.
Computation Pipeline
- Topic‑cluster creation
- Select N representative documents (manual curation or automated topic modeling).
- Embed each document with a consistent model (e.g.,
e5-large-v2
, OpenAI text‑embedding‑3‑large).
- Centroid / reference embedding set
- Either compute a single centroid embedding, or keep the entire set
E_t
to preserve variance. - Store these vectors for reproducibility.
- Either compute a single centroid embedding, or keep the entire set
- Calculate cosine distances
- distances = [ cosine_distance(E_d , e_t) for each e_t in E_t ]
- avg_dist = sum(distances) / N
- Compute ESI
- ESI = 1 – avg_dist
- Chunk‑level vs document‑level
- Chunk the document (≈ 1 000 tokens) and compute ESI for each chunk if you need granular salience mapping.
- The document‑level ESI can be the mean or top‑k average of chunk scores.
Interpreting ESI Values
ESI | Practical meaning | Recommended action |
---|---|---|
0.85 – 1.00 | Highly central; exemplar of the topic | Use as canonical reference; link extensively |
0.60 – 0.84 | On‑topic but with tangents | Tighten focus or split into sub‑articles |
0.30 – 0.59 | Mixed relevance | Remove off‑topic sections or retarget |
< 0.30 | Peripheral | Reclassify under a different topic or rewrite |
Note: Thresholds assume embeddings are unit‑normalised cosine distance (0 = identical, 2 = opposite).
Recommended Toolchain
Task | Suggested Tools |
---|---|
Embedding generation | Sentence‑Transformers (e5-large-v2 , all-mpnet-base-v2 ) or OpenAI text‑embedding‑3‑large |
Topic modeling / clustering | BERTopic, HAC + silhouette scoring, or LDA for seed selection |
Cosine similarity matrix | scikit-learn cosine_similarity , FAISS IndexFlatIP |
Visualization (optional) | UMAP or t‑SNE for 2‑D cluster plots |
Best‑Practice Guidelines
- Use the same embedding model for E_d and E_t. Mixing models distorts distances.
- Keep reference sets balanced. Over‑representing one subtopic skews the centroid.
- Document preprocessing matters. Lower‑casing and punctuation stripping should be consistent.
- Re‑evaluate after major topic drift. New sub‑domains may warrant a fresh cluster.
- Combine with TIS. High salience plus high trust indicates authoritative, on‑topic content.
Common Pitfalls
- Single reference vector. Using only one exemplar makes ESI brittle; keep at least 5–10 reference embeddings when possible.
- Ignoring dimensionality changes. Switching embedding models mid‑pipeline invalidates historical scores.
- Confusing distance with similarity. Remember ESI subtracts the average distance from 1; higher scores mean closer to the topic center.
Worked Example
Setup:
N = 5 reference embeddings
distances(E_d) = [0.18, 0.20, 0.22, 0.19, 0.21]
avg_dist = (0.18+0.20+0.22+0.19+0.21) / 5
= 1.00 / 5
= 0.20
ESI = 1 - 0.20
= 0.80
Interpretation – An ESI of 0.80 indicates the document is strongly central to its topic cluster, though not an absolute centroid. Small refinements could push it into exemplar territory (> 0.85).