Fabled Sky Research | AIO v1.2.7
Last updated: April 2025
Definition
TYQ quantifies how much usable material a large‑language model actually lifts from a document each time it is prompted. Whereas RSA measures breadth of retrieval, TYQ measures depth: the average number of tokens an LLM cites, quotes, or paraphrases per query.
Mathematical formulation
TYQ = ( Σ tokens(d_p) ) / |P|
dₚ – token span drawn from document d when answering prompt p
P – prompt set used in the evaluation battery
Score range
Float ≥ 0.0 (higher = more content utilised)
Computation Pipeline
Prompt‑set execution
- Run each prompt in P through your RAG or retrieval‑augmented pipeline.
- Capture the final generation plus any intermediate retrieved context.
Token attribution
- Use the model’s token‑level attribution (e.g., OpenAI logprobs or Anthropic “source text mapping”) to identify which output tokens originate from the document.
- Count those tokens →
tokens(dₚ)
.
Aggregate
total_tokens = Σ tokens(d_p) for all p in P
TYQ = total_tokens / |P|\
Optional quality weighting
To avoid rewarding irrelevant padding, multiply each tokens(dₚ)
by a relevance score rₚ (e.g., semantic‑similarity ≥ τ):
TYQ_rel = ( Σ r_p * tokens(d_p) ) / |P|
Interpreting TYQ Values
Because TYQ is sensitive to both document length and prompt mix, interpret it relative to expected yield bands for your domain:
Typical length of d | Prompt type mix | Low TYQ | Moderate TYQ | High TYQ |
---|---|---|---|---|
~1 000 tokens (brief) | factual Q&A | < 40 | 40–120 | > 120 |
~5 000 tokens (report) | explanatory + how‑to | < 80 | 80–250 | > 250 |
~20 000 tokens (white paper) | deep‑dive queries | < 150 | 150–500 | > 500 |
(Adjust bands for your own corpus size and prompt battery.)
General Rule
High TYQ + high relevance → rich, reusable source.
High TYQ + low relevance → token stuffing; revise content.
Recommended Toolchain
Task | Suggested Tools |
---|---|
Token attribution | OpenAI logprobs , Anthropic token_logprobs , or a diffing aligner such as TracingLLM |
Prompt orchestration | LangChain or LlamaIndex evaluation modules |
Quality weighting | embedding similarity (e5-large-v2 , OpenAI text‑embedding‑3‑large) |
Windowed sampling | Sliding‑window or chunk‑based evaluators to avoid context‑window bias |
Best‑Practice Guidelines
- Publish prompt context length. Very long prompts inflate TYQ mechanically.
- Pair with relevance. Report
TYQ_rel
(quality‑weighted) alongside raw TYQ. - Watch for padding. Repeated boilerplate raises TYQ without adding value.
- Update after model upgrades. Newer models may quote less and summarise more, reducing TYQ even if coverage is unchanged.
- Combine with RSA and TIS. Together they describe breadth (RSA), trust (TIS), and depth (TYQ).
Common Pitfalls
- Counting prompt tokens instead of answer tokens. TYQ should reflect output usage, not input size.
- Ignoring paraphrase drift. If attribution relies solely on exact n‑gram matches, paraphrased content is missed. Use embedding similarity thresholds.
- Single‑query bias. TYQ computed on one flagship prompt is almost meaningless; always average across a representative set.
Worked Example
Setup:
|P| = 500 prompts
total_tokens = 62,000 tokens attributed to document d
Optional r_p = not applied (raw TYQ only)
TYQ = 62,000 / 500
= 124.0
Interpretation – The model extracts an average of 124 tokens per query from the document—healthy depth for a mid‑length technical report. Next step: compute TYQ_rel to confirm those tokens are on‑topic.