MetaMay 15, 20254 min read

How We Rank AI Models: Our Updated Methodology

We revamped our scoring system to better handle multimodal models and updated our community vote weighting to reduce gaming. Here is what changed and why.

When we launched the first version of the Rankly scoring system in early 2024, we were tracking around thirty models. Today that number is over two hundred, and includes text models, image generators, video tools, and specialized coding assistants. The original methodology had gaps that were manageable at small scale and became serious problems at large scale. This post explains the updated system we deployed in March 2025 and the reasoning behind each decision.

The Five Scoring Criteria

Community Votes (30%)

Community votes remain the largest single component because Rankly is fundamentally a community product. No editorial team, however diligent, can cover every model across every use case. Actual users know things we do not. That said, community voting systems are vulnerable to manipulation, and we made substantial changes to how we weight votes. A vote from an account with a verified API key now carries more weight than a vote from a newly created account. We also apply a time decay that reduces the influence of vote spikes, which typically indicate coordinated activity.

Output Quality (25%)

This component is based on a standardized prompt suite we run against every model on a weekly basis. The suite covers writing, reasoning, coding, and instruction-following tasks. We use a combination of automated scoring (for tasks with deterministic correct answers) and blind human evaluation (for open-ended writing and reasoning). The human evaluation panel rotates monthly to avoid rater drift.

Speed (20%)

Speed is measured as median time-to-first-token and median time-to-complete for a standardized 500-token generation. We run these measurements from three geographic regions and average the results. Models that are fast in one region but slow in others score lower than models that are consistently fast everywhere. We also measure p99 latency, because a model that is usually fast but occasionally stalls is a worse production experience than one that is uniformly moderate.

Value for Money (15%)

Value is normalized output quality per dollar of API cost. A model that scores 80 on quality at $5 per million tokens scores higher on this dimension than a model that scores 90 at $75 per million tokens. We use list pricing rather than negotiated rates, since those are not publicly available. Free tiers are included separately, and a model with a generous free tier gets partial credit here.

Freshness (10%)

Freshness measures how recently a model was updated and whether its training data cutoff is current. A model with a training cutoff from two years ago will score lower than a comparable model with a recent cutoff. We also track changelog cadence: models that receive regular bug fixes and capability updates score better than those that have been abandoned since their initial release.

What Changed from 2024

The most significant change is that we reduced the weight of community votes from 40% to 30% and redistributed that weight to output quality and speed. The reason is straightforward: as Rankly grew, the community vote component became increasingly gameable, and we were spending significant time on fraud detection rather than improving the parts of the scoring that we actually control.

We also added the multimodal capability flag, which is not a scoring component but a filter that lets users see only models relevant to their task. A text-only model and a vision model are not directly comparable in most use cases, and the old system did not handle that distinction cleanly.

Rankly AI editorial team