🏆 LMSys AI Model Performance by Category

PLUS: Llama3-70B Category Rankings


LMSys AI Leaderboard Ratings by Prompt Categories

The Summary: The LMSys Elo Leaderboard has recently introduced a filter by prompt category, revealing insights into the strengths and weaknesses of different AI models based on the type of questions asked by users. New data on Llama 3 70B has been released today. Let’s unpack…

Key insights:

  • Claude 3 Opus ties with GPT-4 Turbo when refused answers are excluded

  • GPT-4 Turbo leads in coding prompts

  • Claude 3 Opus ranks highest for Chinese

  • Llama 3 70B excels in English writing but underperforms in complex prompts and non-English languages.

Charts and Analysis: 

To visualize this data, we plotted the top 10 models’ scores for each prompt category.

  • Overall: This represents the standard ranking without category filtering. GPT-4 Turbo is the leader, followed by Gemini 1.5 Pro API Preview and Claude 3.

Data: LMSys filtered by Overall category (10 may 2024)

  • Coding: The lead of GPT-4 Turbo over Claude 3 is more pronounced on coding tasks.

Data: LMSys filtered by Coding category (10 may 2024)

  • Longer Query: We notice a degradation of Llama 3-70B score when the prompts increase in complexity, while GPT-4 and Claude 3 maintain their strength.

Data: LMSys filtered by Longer Query category (10 may 2024)

  • English: GPT-4 Turbo keeps its lead in English writing but two very strong contenders are Gemini 1.5 Pro and Llama 3. The degradation of Claude 3 is puzzling here, as writing is considered one of its strengths.

Data: LMSys filtered by English category (10 may 2024)

  • Chinese: Claude 3 Opus has by far the best score in Chinese writing. GPT-4 Turbo and Gemini 1.5 Pro API follow closely tied. Llama 3-70B does not rank in the top 10.

Data: LMSys filtered by Chinese category (10 may 2024)

  • French: Gemini 1.5 Pro API gets an excellent score in French writing. Claude 3 and GPT-4 Turbo follow closely tied.

Data: LMSys filtered by French category (10 may 2024)

  • Excluding refusals: In this chart, where all refused answers are excluded (for example, for safety reasons, with Claude 3 being the most cautious), GPT-4 Turbo and Claude 3 have nearly identical scores.

Data: LMSys filtered by Exclude refusals category (10 may 2024)

LMSys just published a new analysis of Llama3 70B performance across even more fine-grained categories. They expect to introduce these new categories to the leaderboard soon.

LMSys, Llama 3-70b's win rate (excluding ties) against top 5 models

Why it matters: The prompt category filtering on the LMSys Elo Leaderboard offers valuable insights into the strengths and weaknesses of various AI models. The latest data on Llama 3 can help users choose the most suitable model for specific tasks, to optimize performance and efficiency. Additionally, it identifies areas where models may require further improvement, guiding developers in their efforts to improve AI capabilities across categories.