Definition
RSA measures breadth of eligibility: how many distinct prompt classes can successfully retrieve a document. A high RSA means the content is semantically versatile and discoverable across varied user intents; a low RSA means it answers only a narrow set of questions.
Mathematical formulation
RSA = | { p ∈ P : R(p, d) = 1 } |
P — predefined prompt set (your evaluation battery)
R(p, d) — binary retrieval function
1
→ document d is retrieved (top‑k, similarity threshold, or ranking rule met)→ document not retrieved for prompt p
Score range
Integer ≥ 0 (higher = broader surface area)
Computation Pipeline
- Prompt‑set design
- Build a taxonomy of intents (definition, how‑to, comparison, critique, etc.).
- Generate or hand‑craft representative prompts for each intent.
- Typical size: 500–2 000 prompts for reliable Monte‑Carlo estimates.
- Retrieval function R(p, d)
- Choose a retrieval method (vector similarity, BM25, hybrid).
- Define a success rule:
Top‑k (document appears within top‑k results) — or —
Threshold (similarity ≥ τ). - Run the retrieval engine for every prompt.
- Surface‑area count
RSA = count_successes(P)
The final integer is simply the number of prompts for which R = 1
.
- Normalisation (optional)
RSA_norm = RSA / |P| # value in 0.00–1.00
Use this if you want a proportional metric instead of a raw count.
Interpreting RSA Values
RSA (raw) | RSA_norm | Practical meaning | Typical action |
---|---|---|---|
> 800 | > 0.80 | Exceptional breadth | Prioritise for indexing; surface in more answer contexts |
500–800 | 0.50–0.80 | Broad coverage | Good; refine for niche queries |
200–500 | 0.20–0.50 | Moderate | Expand content to cover additional intents |
< 200 | < 0.20 | Narrow | Add sections/examples; improve metadata |
(Assumes |P| ≈ 1 000 prompts; adjust thresholds proportionally.)
Recommended Toolchain
Task | Suggested Tools |
---|---|
Prompt generation | LlamaIndex QuestionGenerator , GPT‑4o bulk prompt scripts |
Retrieval engine | FAISS / Elasticsearch hybrid; or OpenAI embeddings + rerank |
Shadow testing | LangChain BenchmarkRetrievalEvaluator |
Synthetic expansion | GPT‑4o “rewrite prompt in N variant intents” loop |
Best‑Practice Guidelines
- Publish the prompt taxonomy. RSA only has meaning when others can inspect or replicate P.
- Include domain‑specific prompts. Generic Q&A often over‑estimates surface area.
- Refresh annually. New user intents emerge; obsolete prompts skew the metric.
- Combine with TIS. High RSA + high Trust Integrity Score = ideal retrieval profile.
- Guard against prompt stuffing. Inflating
|P|
with near‑duplicates makes RSA easier to boost without real breadth.
Common Pitfalls
- Over‑broad success rule (e.g., top‑50 instead of top‑5) inflates RSA artificially.
- Ignoring multilingual prompts; if your audience is global, RSA should test other languages.
- Assuming bigger is always better; a highly specialised document can be useful even at low RSA—context matters.
Worked Example
Setup:
|P| = 1,000 prompts
Success rule: document must appear in top‑10 results
Results:
Retrieval successes = 643
RSA (raw) = 643
RSA_norm = 643 / 1,000 ≈ 0.64
Interpretation – The document satisfies 64 % of tested intents: strong but with room to expand into remaining niches.