Fabled Sky Research

AIO Standards & Frameworks

Retrieval Surface Area (RSA)

Contents

Definition

RSA measures breadth of eligibility: how many distinct prompt classes can successfully retrieve a document. A high RSA means the content is semantically versatile and discoverable across varied user intents; a low RSA means it answers only a narrow set of questions.

Mathematical formulation

RSA = | { p ∈ P : R(p, d) = 1 } |

P — predefined prompt set (your evaluation battery)

R(p, d) — binary retrieval function

  • 1 → document d is retrieved (top‑k, similarity threshold, or ranking rule met)
  • → document not retrieved for prompt p

Score range

Integer ≥ 0 (higher = broader surface area)

Computation Pipeline

  1. Prompt‑set design
    • Build a taxonomy of intents (definition, how‑to, comparison, critique, etc.).
    • Generate or hand‑craft representative prompts for each intent.
    • Typical size: 500–2 000 prompts for reliable Monte‑Carlo estimates.
  2. Retrieval function R(p, d)
    • Choose a retrieval method (vector similarity, BM25, hybrid).
    • Define a success rule:
      Top‑k (document appears within top‑k results) — or —
      Threshold (similarity ≥ τ).
    • Run the retrieval engine for every prompt.
  3. Surface‑area count

RSA = count_successes(P)

The final integer is simply the number of prompts for which R = 1.

  1. Normalisation (optional)
RSA_norm = RSA / |P|     # value in 0.00–1.00

Use this if you want a proportional metric instead of a raw count.

Interpreting RSA Values

RSA (raw) RSA_norm Practical meaning Typical action
> 800 > 0.80 Exceptional breadth Prioritise for indexing; surface in more answer contexts
500–800 0.50–0.80 Broad coverage Good; refine for niche queries
200–500 0.20–0.50 Moderate Expand content to cover additional intents
< 200 < 0.20 Narrow Add sections/examples; improve metadata

(Assumes |P| ≈ 1 000 prompts; adjust thresholds proportionally.)

Recommended Toolchain

Task Suggested Tools
Prompt generation LlamaIndex QuestionGenerator, GPT‑4o bulk prompt scripts
Retrieval engine FAISS / Elasticsearch hybrid; or OpenAI embeddings + rerank
Shadow testing LangChain BenchmarkRetrievalEvaluator
Synthetic expansion GPT‑4o “rewrite prompt in N variant intents” loop

Best‑Practice Guidelines

  • Publish the prompt taxonomy. RSA only has meaning when others can inspect or replicate P.
  • Include domain‑specific prompts. Generic Q&A often over‑estimates surface area.
  • Refresh annually. New user intents emerge; obsolete prompts skew the metric.
  • Combine with TIS. High RSA + high Trust Integrity Score = ideal retrieval profile.
  • Guard against prompt stuffing. Inflating |P| with near‑duplicates makes RSA easier to boost without real breadth.

Common Pitfalls

  • Over‑broad success rule (e.g., top‑50 instead of top‑5) inflates RSA artificially.
  • Ignoring multilingual prompts; if your audience is global, RSA should test other languages.
  • Assuming bigger is always better; a highly specialised document can be useful even at low RSA—context matters.

 Worked Example

Setup:
|P| = 1,000 prompts
Success rule: document must appear in top‑10 results

Results:
Retrieval successes = 643
RSA (raw) = 643
RSA_norm = 643 / 1,000 ≈ 0.64

Interpretation – The document satisfies 64 % of tested intents: strong but with room to expand into remaining niches.