Fabled Sky Research

AIO Standards & Frameworks

Token Yield per Query (TYQ)

Contents

Fabled Sky Research | AIO v1.2.7
Last updated: April 2025


Definition

TYQ quantifies how much usable material a large‑language model actually lifts from a document each time it is prompted. Whereas RSA measures breadth of retrieval, TYQ measures depth: the average number of tokens an LLM cites, quotes, or paraphrases per query.

Mathematical formulation

TYQ = ( Σ tokens(d_p) ) / |P|

dₚ – token span drawn from document d when answering prompt p

P – prompt set used in the evaluation battery

Score range

Float ≥ 0.0 (higher = more content utilised)

Computation Pipeline

Prompt‑set execution

  • Run each prompt in P through your RAG or retrieval‑augmented pipeline.
  • Capture the final generation plus any intermediate retrieved context.

Token attribution

  • Use the model’s token‑level attribution (e.g., OpenAI logprobs or Anthropic “source text mapping”) to identify which output tokens originate from the document.
  • Count those tokens → tokens(dₚ).

Aggregate

total_tokens = Σ tokens(d_p) for all p in P
TYQ = total_tokens / |P|\

Optional quality weighting

To avoid rewarding irrelevant padding, multiply each tokens(dₚ) by a relevance score rₚ (e.g., semantic‑similarity ≥ τ):

TYQ_rel = ( Σ r_p * tokens(d_p) ) / |P|

Interpreting TYQ Values

Because TYQ is sensitive to both document length and prompt mix, interpret it relative to expected yield bands for your domain:

Typical length of d Prompt type mix Low TYQ Moderate TYQ High TYQ
~1 000 tokens (brief) factual Q&A < 40 40–120 > 120
~5 000 tokens (report) explanatory + how‑to < 80 80–250 > 250
~20 000 tokens (white paper) deep‑dive queries < 150 150–500 > 500

(Adjust bands for your own corpus size and prompt battery.)

General Rule
High TYQ + high relevance → rich, reusable source.
High TYQ + low relevance → token stuffing; revise content.

Recommended Toolchain

Task Suggested Tools
Token attribution OpenAI logprobs, Anthropic token_logprobs, or a diffing aligner such as TracingLLM
Prompt orchestration LangChain or LlamaIndex evaluation modules
Quality weighting embedding similarity (e5-large-v2, OpenAI text‑embedding‑3‑large)
Windowed sampling Sliding‑window or chunk‑based evaluators to avoid context‑window bias

Best‑Practice Guidelines

  • Publish prompt context length. Very long prompts inflate TYQ mechanically.
  • Pair with relevance. Report TYQ_rel (quality‑weighted) alongside raw TYQ.
  • Watch for padding. Repeated boilerplate raises TYQ without adding value.
  • Update after model upgrades. Newer models may quote less and summarise more, reducing TYQ even if coverage is unchanged.
  • Combine with RSA and TIS. Together they describe breadth (RSA), trust (TIS), and depth (TYQ).

Common Pitfalls

  • Counting prompt tokens instead of answer tokens. TYQ should reflect output usage, not input size.
  • Ignoring paraphrase drift. If attribution relies solely on exact n‑gram matches, paraphrased content is missed. Use embedding similarity thresholds.
  • Single‑query bias. TYQ computed on one flagship prompt is almost meaningless; always average across a representative set.

Worked Example

Setup:
|P| = 500 prompts
total_tokens = 62,000 tokens attributed to document d
Optional r_p = not applied (raw TYQ only)
TYQ = 62,000 / 500
= 124.0

Interpretation – The model extracts an average of 124 tokens per query from the document—healthy depth for a mid‑length technical report. Next step: compute TYQ_rel to confirm those tokens are on‑topic.