The university side project that became the world’s referee of AI models
In the constantly shifting world of AI, founders often wake up to the same daily ritual. New models launch overnight, benchmarks appear on X, and yet the real question remains unanswered – which one should I actually use?
The founders behind AI evaluation platform Arena, formerly known as LMArena, have turned that confusion into a business now worth $1.7 billion.
What began as a university research project has since grown into a billion-dollar startup attempting to redefine how artificial intelligence models are measured – not by static, corporate-led benchmarks, but by ordinary users interacting with them in the wild.
The everyday pain behind the hype
For most people using AI today, the problem isn’t finding a capable model – it’s figuring out which one actually works best for what they want to do.
New versions come out constantly, each claiming top performance based on benchmark tests. But those tests usually look like school exams, while real use is messy, conversational, and context-heavy.
In practice, the differences often matter less than advertised, and the best choice depends more on the every day task than the leaderboard.
If you’re a founder choosing a model for customer support, or a solo developer wiring AI into your side project, it’s a familiar headache. One model might be great on a math benchmark, while another dominates a coding test, and a third looks cheaper on paper.
Two roommates, one side project
In 2023, UC Berkeley PhD students and roommates Anastasios Angelopoulos and Wei-Lin Chiang decided to attack that gap from a different angle. Instead of yet another lab-designed test, they launched Chatbot Arena, a simple website where anyone could pit AI models against each other.
The idea was disarmingly straightforward. You type a question and the site sends it to two anonymous models. You see two answers side-by-side, pick the one you prefer, and only after you vote do you see which systems you just judged.
Behind the scenes, a rating system turns millions of these “blind taste tests” into a live leaderboard of which models people actually like to use.
It may have started as yet another research project in Berkeley’s Sky Computing Lab, but it wasn’t long before it escaped the campus bubble. By May 2025, Chatbot Arena had already reached around 1 million monthly users.
Developers, researchers and every day curious consumers flocked to it as the most intuitive way to compare which model worked best.
From lab project to industry scoreboard
As usage grew, something important also shifted. AI labs themselves began to care deeply about where they ranked on the Arena leaderboard.
OpenAI, Google, Anthropic and others supplied models so the community could put them through real-world prompts. The site became a kind of public scoreboard for the AI arms race – watched not just by enthusiasts, but by executives and investors alike.
In 2025, the team spun the project out of the university as LMArena, with Angelopoulos and Chiang joined by Berkeley professor Ion Stoica as co-founder. A seed round of $100 million valued the fledgling startup at about $600 million, a rare leap for a fresh spinout – and a signal that “who can you trust?” had become one of AI’s defining business questions.
Just months later, LMArena launched AI Evaluations, a commercial service where companies pay to have their own use cases run through the same crowd-driven gauntlet. Within about four months, that product had reached an annualised revenue rate of $30 million.
By January 2026, investors doubled down. LMArena raised $150 million in Series A funding at a $1.7 billion valuation, bringing total capital to roughly $250 million in about seven months.
Today, the platform – recently rebranded simply as Arena – serves more than 5 million monthly users across 150 countries and processes around 60 million conversations each month.
That kind of visibility, however, has also brought criticism. Researchers and commentators have questioned whether any single leaderboard should wield so much influence, and whether its rankings can be subtly “gamed” by labs that optimise for Arena scores.
The founders have responded, added more safeguards, and published data to defend its methods – but the debate still underscores how high the stakes have become.
Why this matters beyond AI labs
Perhaps the Arena’s story from university side project to one of the latest unicorn’s hitting the headlines isn’t just about producing another unicorn. It’s about who gets to decide what “good AI” means.
While traditional benchmarks measure what’s easy to score with code, the system Arena built tries to capture something closer to how AI feels in real life. Helpfulness, clarity, nuance – the things that matter when you’re fixing a spreadsheet, answering a customer, or drafting a contract.
Because in a world where it’s getting harder to see through the day-to-day noise, Angelopoulos, Chiang and their team seem to be betting that trust will come by letting millions of ordinary people vote, one comparison at a time, on which models actually earn a place in their daily workflows.





