The RQ Score
A scientific standard for AI evaluation, transparent, reproducible, and designed to never become obsolete. Every model on Rankly AI receives a Rankly Quotient derived from four orthogonal dimensions that no provider can simultaneously game.
IG, F, U, and f(A) measure independent aspects of a model. A breakthrough in intelligence cannot compensate for systematic hallucinations. The geometry enforces this.
Each additional RQ point requires exponentially greater capability. The scale has no upper bound, a superintelligent system would score above 2000, not merely reach 1000.
Models are penalized when their SWE-bench Verified score far exceeds their SWE-bench Pro score. Benchmark inflation is a measurable and punishable offence in RQ v2.
How the formula works
Three core dimensions combine via geometric mean, then pass through a logarithmic scale, then get adjusted by the Accessibilitas modulator. Each step has a scientific justification.
Architecture RQ v2.0, RQ_effective = 1000 × log₁₀(1 + 9 × (IG×F×U)^(1/3)) × f(A)
The 4 dimensions
Three structural dimensions (IG, F, U) contribute equally. One contextual modulator (A) adjusts by at most ±8%.
Raw cognitive capacity measured across multiple independent benchmarks. We use the geometric mean of GPQA Diamond, AIME, FrontierMath, and HLE, never a single benchmark that can be gamed.
Quarterly · GPQA Diamond · AIME · FrontierMath · HLE
Trustworthiness of outputs across three independent sub-dimensions: factual accuracy (SimpleQA + FActScore), output stability under variation, and epistemic honesty, does the model know what it doesn't know? The dimension no existing leaderboard measures seriously.
Weekly · SimpleQA · FActScore · Stability protocol
Real-world surplus utility: not what benchmarks predict a model should achieve, but what it actually delivers. Measured via SWE-bench Pro (long-horizon tasks), GDPval (economic value with expert panels), and an anti-gaming penalty for models that score high on contaminated benchmarks but fail on fresh tasks.
Quarterly · SWE-bench Pro · GDPval · Anti-gaming gap
Price, latency, and availability. Bounded between 0.92 and 1.05, accessibility adjusts the score but cannot create or destroy value that isn't there. A free, fast model gets a slight boost; an expensive, slow one a slight penalty.
8× daily · Artificial Analysis · Status pages
Why a brilliant but unreliable model scores low
Geometric aggregation means that any dimension near zero collapses the composite score. The table below compares what arithmetic vs. geometric averaging would give for the same models.
| Model type | IG | F | U | Arithmetic avg | Geometric (RQ) |
|---|---|---|---|---|---|
| Frontier model | 0.93 | 0.84 | 0.87 | 950.36 RQ | 950.02 RQ |
| Unreliable model | 0.93 | 0.05 | 0.87 | 816.24 RQ | 611.68 RQ |
| Narrow specialist | 0.50 | 0.50 | 0.50 | 740.36 RQ | 740.36 RQ |
A model that hallucinates systematically (F = 0.05) collapses from ~950 RQ to ~700 RQ under geometric aggregation. Under a naïve arithmetic mean it would still score ~920 RQ, masking a critical failure.
Score breakdown, current top models
RQ_intrinsic (RQi) is the raw formula result. RQ_effective (RQe) applies the f(A) modulator. The score displayed on the leaderboard is always RQe.
The Accessibilitas modulator
f(A) adjusts based on price, latency, and availability. Bounded at [0.92, 1.02], it cannot invert the ranking. A premium, fast frontier model gets a small penalty; a free, instant model gets a small boost.
f(A) range [0.92, 1.02], max impact on RQ: ±8%. Vertical line = neutral (1.00).
Scores go down, too
The RQ is not a marketing index. A model that released a breakthrough version eighteen months ago and has shipped nothing since will see its Momentum sub-score decline. Outages lower A3 (availability). Price increases lower A1. Discovered hallucination patterns lower F1 (factual accuracy).
This is intentional. The most informative moments on a model's history chart are often the drops, when a competitor surpassed it, when something went wrong, or when a benchmark was discovered to be contaminated and scores were revised.
Update schedule
Frequently asked questions
Can AI companies pay to improve their RQ score?
No. Paid placements are clearly labeled as 'Sponsored' and do not affect the algorithmic score. The RQ computation has no commercial inputs.
Why don't you publish the exact dimension weights?
The weights are equal (1/3 each for IG, F, U), and they are published in the formula below. What we don't publish is the sub-metric weighting inside each dimension, because published sub-weights become optimization targets. These are audited internally and recalibrated with each version.
Why does a model with more votes sometimes have a lower RQ?
Community votes are not part of the RQ v2.0 formula. They feed the communityRating (0–5 stars) separately. The RQ is based entirely on benchmarks, reliability measurements, and real-world task performance.
How do you handle brand-new models?
New models enter with a data_completeness flag. Their RQ is marked provisional for the first 4–6 weeks while benchmark and reliability data accumulates. The chart shows confidence intervals during this period.
Why did you change from 7 blocs to 4 dimensions?
The 7-bloc system had overlapping dimensions and arbitrary weights. B1 (Intelligence) and B3 (Quality) measured related things. B5 (Accessibility) had equal weight to B1 despite lower scientific grounding. The 4-dimension system is built on scientific orthogonality, each dimension measures something the others cannot. The formula is derived from first principles, not from editorial judgment.
Is the formula open source?
Yes. The complete formula, normalization functions, and data sources are documented in this page and in the whitepaper. You can reproduce any RQ score independently with public benchmark data.
The Formula
Every decision in the RQ v2.0 formula is justified below. You should be able to reproduce any score independently from public data.
// RQ Score v2.0, complete formula
RQ_intrinsic = 1000 × log₁₀(1 + 9 × (IG × F × U)1/3) × k_epoch
f(A) = 0.92 + 0.10 × (A1 × A2 × A3)1/3
A1 = price=0 ? 1.0 : max(0, 1 − log₁₀(1+price) / log₁₀(101))
A2 = max(0, 1 − log₁₀(1+TTFT_s) / log₁₀(31))
A3 = min(1, uptime × (1.1 if open_weights else 1.0))
RQ_effective = RQ_intrinsic × f(A)
rq_score = RQ_effective // displayed on leaderboard
Why geometric mean, not arithmetic?
A model that hallucinates systematically (F ≈ 0) cannot compensate with high benchmark scores. Under geometric aggregation, any dimension near zero collapses the composite. This is intentional, a brilliant but unreliable model is not a high-RQ model. The Human Development Index uses the same principle since 2010.
Why logarithmic scale?
Each additional RQ point requires exponentially greater capability gain. The scale has no upper bound, a superintelligent system would score above 2000, not merely reach 1000. Current frontier models range from approximately 850 to 965. The log also compresses the gap between weak models while preserving sensitivity at the frontier.
Why f(A) bounded at [0.92, 1.02]?
Accessibility matters but should not dominate. A free, fast, mediocre model should not outscore a premium frontier model. The bounds ensure accessibility adjusts by at most ±8%. This is intentional: accessibility is a contextual modulator, not a structural dimension of capability.
Why equal weights (1/3 each)?
In the absence of empirical evidence that one dimension predicts user satisfaction better than others, assigning unequal weights introduces unjustifiable bias. This is the parsimony axiom. Weights will be derived empirically in RQ v2.1 from regression on satisfaction data collected via rankly-ai.com.
Normalization
All sub-metrics are normalized to [0,1] using fixed theoretical bounds, not observed data bounds. This ensures scores remain stable when new extreme models appear, no score retroactively changes because a new frontier model was added.
// A1 (cost), logarithmic, reflects human perception of price
A1 = price=0 ? 1.0 : 1 - log10(1+price) / log10(101) // zero at $100/Mtok
// A2 (latency), zero at 30s TTFT
A2 = 1 - log10(1+TTFT_s) / log10(31)
// F2 (stability), coefficient of variation over 100 questions
F2 = 1 - (CV / 0.5) // CV = std/mean across repeated runs
Temporal calibration, k_epoch
RQ scores are comparable across time via a dual calibration mechanism:
- k_prog corrects for AI capability inflation using a fixed secret basket of pre-2020 tasks never used in model training.
- k_bench corrects for score discontinuities when a saturated benchmark is replaced by a harder one, computed as the ratio of mean scores on a 20-model reference panel before and after the transition.
k_epoch_2026 = 1.000 (reference year), temporal calibration activates after 6 months of data.
Data sources
| Dimension | Source | Frequency |
|---|---|---|
| IG (text) | GPQA Diamond, AIME, FrontierMath, HLE | Quarterly |
| F1 (accuracy) | SimpleQA, FActScore, internal curated QA | Weekly |
| F2 (stability) | Internal protocol, 100q × 3 days | Monthly |
| U1 (long tasks) | SWE-bench Pro | Quarterly |
| U2 (real utility) | GDPval-AA, expert panels | Semi-annual |
| A1 (cost) | Artificial Analysis | 8× daily |
| A2 (latency) | Artificial Analysis live TTFT | 8× daily |
| A3 (availability) | Status pages + independent measurement | Continuous |
Known limitations
We document our limitations explicitly, a mark of scientific integrity.
U sub-metrics are proxies, not ground truth.
SWE-bench Pro and GDPval cover coding and professional tasks well; creative and emotional tasks are underrepresented. U weights will be revised in v2.1.
Cross-modal comparability is assumed, not proven.
A text model's RQ and an image model's RQ are calculated with modality-specific instruments but normalized to the same scale. This assumption will be tested empirically in 2026.
k_epoch is currently 1.0.
Temporal calibration activates when 6+ months of data accumulate. Scores before that point are not cross-temporally normalized.
Whitepaper
The full methodology, axioms, governance process, and revision protocol are documented in the RQ Score Whitepaper v1.0 (June 2026).