ReviewJune 3, 20256 min read

DeepSeek R2: The Open-Source Model Shaking Up the Rankings

DeepSeek's latest release jumped 40 points on our RQ scale. We tested it against GPT-4o and Claude 3.5 on reasoning, math, and coding.

DeepSeek has done it again. Six months after their R1 model shocked the AI world by matching GPT-4 performance at a fraction of the cost, their R2 release has landed with an even bigger splash, jumping 40 points on our RQ scale in a single update cycle.

What Changed in R2

DeepSeek R2 brings a significantly larger context window (128k tokens, up from 32k), improved instruction following, and what the team describes as “chain-of-thought distillation”, essentially training the model to reason more explicitly before answering. On our internal coding benchmark, R2 scored 87/100 versus R1's 74/100.

Benchmarks Against GPT-4o and Claude 3.5

On math reasoning (our modified MATH dataset), R2 scored 91% accuracy, ahead of GPT-4o at 88% and Claude 3.5 Sonnet at 86%. On our creative writing test, however, R2 fell short: GPT-4o and Claude 3.5 both produce more nuanced prose with better tonal control.

The Verdict

DeepSeek R2 is the best open-weight model available today, and on pure reasoning and coding tasks, it challenges the best closed models from OpenAI and Anthropic. It is not yet competitive on creative writing or nuanced instruction following. For developers who want frontier-level reasoning without API costs, R2 is the obvious choice.

Rankly AI editorial team

What Changed in R2

Benchmarks Against GPT-4o and Claude 3.5

The Verdict

More from Rankly AI