Token Yield per Query (TYQ)

Fabled Sky Research | AIO v1.2.7
Last updated: April 2025

Definition

TYQ quantifies how much usable material a large‑language model actually lifts from a document each time it is prompted. Whereas RSA measures breadth of retrieval, TYQ measures depth: the average number of tokens an LLM cites, quotes, or paraphrases per query.

Mathematical formulation

TYQ = ( Σ tokens(d_p) ) / |P|

dₚ – token span drawn from document d when answering prompt p

P – prompt set used in the evaluation battery

Score range

Float ≥ 0.0 (higher = more content utilised)

Computation Pipeline

Prompt‑set execution

Run each prompt in P through your RAG or retrieval‑augmented pipeline.
Capture the final generation plus any intermediate retrieved context.

Token attribution

Use the model’s token‑level attribution (e.g., OpenAI logprobs or Anthropic “source text mapping”) to identify which output tokens originate from the document.
Count those tokens → tokens(dₚ).

Aggregate

total_tokens = Σ tokens(d_p) for all p in P
TYQ = total_tokens / |P|\

Optional quality weighting

To avoid rewarding irrelevant padding, multiply each tokens(dₚ) by a relevance score rₚ (e.g., semantic‑similarity ≥ τ):

TYQ_rel = ( Σ r_p * tokens(d_p) ) / |P|

Interpreting TYQ Values

Because TYQ is sensitive to both document length and prompt mix, interpret it relative to expected yield bands for your domain:

Typical length of d	Prompt type mix	Low TYQ	Moderate TYQ	High TYQ
~1 000 tokens (brief)	factual Q&A	< 40	40–120	> 120
~5 000 tokens (report)	explanatory + how‑to	< 80	80–250	> 250
~20 000 tokens (white paper)	deep‑dive queries	< 150	150–500	> 500

(Adjust bands for your own corpus size and prompt battery.)

General Rule
High TYQ + high relevance → rich, reusable source.
High TYQ + low relevance → token stuffing; revise content.

Recommended Toolchain

Task	Suggested Tools
Token attribution	OpenAI `logprobs`, Anthropic `token_logprobs`, or a diffing aligner such as TracingLLM
Prompt orchestration	LangChain or LlamaIndex evaluation modules
Quality weighting	embedding similarity (`e5-large-v2`, OpenAI text‑embedding‑3‑large)
Windowed sampling	Sliding‑window or chunk‑based evaluators to avoid context‑window bias

Best‑Practice Guidelines

Publish prompt context length. Very long prompts inflate TYQ mechanically.
Pair with relevance. Report TYQ_rel (quality‑weighted) alongside raw TYQ.
Watch for padding. Repeated boilerplate raises TYQ without adding value.
Update after model upgrades. Newer models may quote less and summarise more, reducing TYQ even if coverage is unchanged.
Combine with RSA and TIS. Together they describe breadth (RSA), trust (TIS), and depth (TYQ).

Common Pitfalls

Counting prompt tokens instead of answer tokens. TYQ should reflect output usage, not input size.
Ignoring paraphrase drift. If attribution relies solely on exact n‑gram matches, paraphrased content is missed. Use embedding similarity thresholds.
Single‑query bias. TYQ computed on one flagship prompt is almost meaningless; always average across a representative set.

Worked Example

Setup:
|P| = 500 prompts
total_tokens = 62,000 tokens attributed to document d
Optional r_p = not applied (raw TYQ only)

TYQ = 62,000 / 500
= 124.0

Interpretation – The model extracts an average of 124 tokens per query from the document—healthy depth for a mid‑length technical report. Next step: compute TYQ_rel to confirm those tokens are on‑topic.

Fabled Sky Research

Token Yield per Query (TYQ)

Contents