RanklyAI
4 orthogonal dimensionsOpen logarithmic scaleAnti-gaming by designUpdated hourly on accessibility

The RQ Score

A scientific standard for AI evaluation, transparent, reproducible, and designed to never become obsolete. Every model on Rankly AI receives a Rankly Quotient derived from four orthogonal dimensions that no provider can simultaneously game.

4 orthogonal dimensions

IG, F, U, and f(A) measure independent aspects of a model. A breakthrough in intelligence cannot compensate for systematic hallucinations. The geometry enforces this.

Logarithmic, open scale

Each additional RQ point requires exponentially greater capability. The scale has no upper bound, a superintelligent system would score above 2000, not merely reach 1000.

Anti-gaming by design

Models are penalized when their SWE-bench Verified score far exceeds their SWE-bench Pro score. Benchmark inflation is a measurable and punishable offence in RQ v2.

How the formula works

Three core dimensions combine via geometric mean, then pass through a logarithmic scale, then get adjusted by the Accessibilitas modulator. Each step has a scientific justification.

IGFUGeometric mean(IG×F×U)^(1/3)Log scale1000×log₁₀(1+9×core)RQ_intrinsicf(A)[0.92–1.02]RQ_effective≈ 915 RQ×

Architecture RQ v2.0, RQ_effective = 1000 × log₁₀(1 + 9 × (IG×F×U)^(1/3)) × f(A)

The 4 dimensions

Three structural dimensions (IG, F, U) contribute equally. One contextual modulator (A) adjusts by at most ±8%.

IGIntelligence Generalis[0, 1]

Raw cognitive capacity measured across multiple independent benchmarks. We use the geometric mean of GPQA Diamond, AIME, FrontierMath, and HLE, never a single benchmark that can be gamed.

Quarterly · GPQA Diamond · AIME · FrontierMath · HLE

FFiabilitas[0, 1]

Trustworthiness of outputs across three independent sub-dimensions: factual accuracy (SimpleQA + FActScore), output stability under variation, and epistemic honesty, does the model know what it doesn't know? The dimension no existing leaderboard measures seriously.

Weekly · SimpleQA · FActScore · Stability protocol

UUtilitas Realis[0.10, 0.90]

Real-world surplus utility: not what benchmarks predict a model should achieve, but what it actually delivers. Measured via SWE-bench Pro (long-horizon tasks), GDPval (economic value with expert panels), and an anti-gaming penalty for models that score high on contaminated benchmarks but fail on fresh tasks.

Quarterly · SWE-bench Pro · GDPval · Anti-gaming gap

AAccessibilitasmodulator, not dimension[0.92, 1.02]

Price, latency, and availability. Bounded between 0.92 and 1.05, accessibility adjusts the score but cannot create or destroy value that isn't there. A free, fast model gets a slight boost; an expensive, slow one a slight penalty.

8× daily · Artificial Analysis · Status pages

Why a brilliant but unreliable model scores low

Geometric aggregation means that any dimension near zero collapses the composite score. The table below compares what arithmetic vs. geometric averaging would give for the same models.

Model typeIGFUArithmetic avgGeometric (RQ)
Frontier model0.930.840.87950.36 RQ950.02 RQ
Unreliable model0.930.050.87816.24 RQ611.68 RQ
Narrow specialist0.500.500.50740.36 RQ740.36 RQ

A model that hallucinates systematically (F = 0.05) collapses from ~950 RQ to ~700 RQ under geometric aggregation. Under a naïve arithmetic mean it would still score ~920 RQ, masking a critical failure.

Score breakdown, current top models

RQ_intrinsic (RQi) is the raw formula result. RQ_effective (RQe) applies the f(A) modulator. The score displayed on the leaderboard is always RQe.

Claude Code
RQi 948961RQ
IG
0.93
F
0.82
U
0.88
f(A)1.0140(+1.4%)
Midjourney V8
RQi 939925RQ
IG
0.91
F
0.86
U
0.80
f(A)0.9840(-1.6%)
Claude Mythos
RQi 951921RQ
IG
0.95
F
0.84
U
0.86
f(A)0.9680(-3.2%)
Cursor
RQi 932920RQ
IG
0.89
F
0.82
U
0.81
f(A)0.9880(-1.2%)
Gemini 3.1 Pro
RQi 922912RQ
IG
0.94
F
0.80
U
0.72
f(A)0.9890(-1.1%)

The Accessibilitas modulator

f(A) adjusts based on price, latency, and availability. Bounded at [0.92, 1.02], it cannot invert the ranking. A premium, fast frontier model gets a small penalty; a free, instant model gets a small boost.

0.92
1.00
1.02
1.018Free + instant
1.005Cheap + fast
0.990Standard frontier
0.942Expensive + slow

f(A) range [0.92, 1.02], max impact on RQ: ±8%. Vertical line = neutral (1.00).

Scores go down, too

The RQ is not a marketing index. A model that released a breakthrough version eighteen months ago and has shipped nothing since will see its Momentum sub-score decline. Outages lower A3 (availability). Price increases lower A1. Discovered hallucination patterns lower F1 (factual accuracy).

This is intentional. The most informative moments on a model's history chart are often the drops, when a competitor surpassed it, when something went wrong, or when a benchmark was discovered to be contaminated and scores were revised.

Update schedule

Hourly
f(A) recomputed, price, TTFT, and uptime updated from Artificial Analysis and status pages.
Weekly
F1 (factual accuracy) re-evaluated on fresh curated questions.
Quarterly
Full IG and U recomputation using new benchmark runs. Version bump issued if any sub-metric is replaced.
On event
Major releases, outages, and pricing changes trigger an early recomputation and a chart annotation.

Frequently asked questions

Can AI companies pay to improve their RQ score?

No. Paid placements are clearly labeled as 'Sponsored' and do not affect the algorithmic score. The RQ computation has no commercial inputs.

Why don't you publish the exact dimension weights?

The weights are equal (1/3 each for IG, F, U), and they are published in the formula below. What we don't publish is the sub-metric weighting inside each dimension, because published sub-weights become optimization targets. These are audited internally and recalibrated with each version.

Why does a model with more votes sometimes have a lower RQ?

Community votes are not part of the RQ v2.0 formula. They feed the communityRating (0–5 stars) separately. The RQ is based entirely on benchmarks, reliability measurements, and real-world task performance.

How do you handle brand-new models?

New models enter with a data_completeness flag. Their RQ is marked provisional for the first 4–6 weeks while benchmark and reliability data accumulates. The chart shows confidence intervals during this period.

Why did you change from 7 blocs to 4 dimensions?

The 7-bloc system had overlapping dimensions and arbitrary weights. B1 (Intelligence) and B3 (Quality) measured related things. B5 (Accessibility) had equal weight to B1 despite lower scientific grounding. The 4-dimension system is built on scientific orthogonality, each dimension measures something the others cannot. The formula is derived from first principles, not from editorial judgment.

Is the formula open source?

Yes. The complete formula, normalization functions, and data sources are documented in this page and in the whitepaper. You can reproduce any RQ score independently with public benchmark data.

For researchers and sceptics

The Formula

Every decision in the RQ v2.0 formula is justified below. You should be able to reproduce any score independently from public data.

// RQ Score v2.0, complete formula

RQ_intrinsic = 1000 × log₁₀(1 + 9 × (IG × F × U)1/3) × k_epoch

f(A) = 0.92 + 0.10 × (A1 × A2 × A3)1/3

A1 = price=0 ? 1.0 : max(0, 1 − log₁₀(1+price) / log₁₀(101))

A2 = max(0, 1 − log₁₀(1+TTFT_s) / log₁₀(31))

A3 = min(1, uptime × (1.1 if open_weights else 1.0))

RQ_effective = RQ_intrinsic × f(A)

rq_score = RQ_effective // displayed on leaderboard

Why geometric mean, not arithmetic?

A model that hallucinates systematically (F ≈ 0) cannot compensate with high benchmark scores. Under geometric aggregation, any dimension near zero collapses the composite. This is intentional, a brilliant but unreliable model is not a high-RQ model. The Human Development Index uses the same principle since 2010.

Why logarithmic scale?

Each additional RQ point requires exponentially greater capability gain. The scale has no upper bound, a superintelligent system would score above 2000, not merely reach 1000. Current frontier models range from approximately 850 to 965. The log also compresses the gap between weak models while preserving sensitivity at the frontier.

Why f(A) bounded at [0.92, 1.02]?

Accessibility matters but should not dominate. A free, fast, mediocre model should not outscore a premium frontier model. The bounds ensure accessibility adjusts by at most ±8%. This is intentional: accessibility is a contextual modulator, not a structural dimension of capability.

Why equal weights (1/3 each)?

In the absence of empirical evidence that one dimension predicts user satisfaction better than others, assigning unequal weights introduces unjustifiable bias. This is the parsimony axiom. Weights will be derived empirically in RQ v2.1 from regression on satisfaction data collected via rankly-ai.com.

Normalization

All sub-metrics are normalized to [0,1] using fixed theoretical bounds, not observed data bounds. This ensures scores remain stable when new extreme models appear, no score retroactively changes because a new frontier model was added.

// A1 (cost), logarithmic, reflects human perception of price

A1 = price=0 ? 1.0 : 1 - log10(1+price) / log10(101) // zero at $100/Mtok

// A2 (latency), zero at 30s TTFT

A2 = 1 - log10(1+TTFT_s) / log10(31)

// F2 (stability), coefficient of variation over 100 questions

F2 = 1 - (CV / 0.5) // CV = std/mean across repeated runs

Temporal calibration, k_epoch

RQ scores are comparable across time via a dual calibration mechanism:

  • k_prog corrects for AI capability inflation using a fixed secret basket of pre-2020 tasks never used in model training.
  • k_bench corrects for score discontinuities when a saturated benchmark is replaced by a harder one, computed as the ratio of mean scores on a 20-model reference panel before and after the transition.

k_epoch_2026 = 1.000 (reference year), temporal calibration activates after 6 months of data.

Data sources

DimensionSourceFrequency
IG (text)GPQA Diamond, AIME, FrontierMath, HLEQuarterly
F1 (accuracy)SimpleQA, FActScore, internal curated QAWeekly
F2 (stability)Internal protocol, 100q × 3 daysMonthly
U1 (long tasks)SWE-bench ProQuarterly
U2 (real utility)GDPval-AA, expert panelsSemi-annual
A1 (cost)Artificial Analysis8× daily
A2 (latency)Artificial Analysis live TTFT8× daily
A3 (availability)Status pages + independent measurementContinuous

Known limitations

We document our limitations explicitly, a mark of scientific integrity.

1

U sub-metrics are proxies, not ground truth.

SWE-bench Pro and GDPval cover coding and professional tasks well; creative and emotional tasks are underrepresented. U weights will be revised in v2.1.

2

Cross-modal comparability is assumed, not proven.

A text model's RQ and an image model's RQ are calculated with modality-specific instruments but normalized to the same scale. This assumption will be tested empirically in 2026.

3

k_epoch is currently 1.0.

Temporal calibration activates when 6+ months of data accumulate. Scores before that point are not cross-temporally normalized.

Whitepaper

The full methodology, axioms, governance process, and revision protocol are documented in the RQ Score Whitepaper v1.0 (June 2026).

Download PDF, coming soonView the leaderboard