🏆 LMSys AI Model Performance by Category

PLUS: Llama3-70B Category Rankings

INSIGHT

LMSys AI leaderboard ratings by Prompt Categories

The Summary: The LMSys Elo Leaderboard has recently introduced a filter by prompt category, revealing new insights into the strengths and weaknesses of different AI models based on the type of questions asked by users. Fresh data on Llama-3 70B has been released. Let’s unpack…

Key insights:

  • Claude 3 Opus ties with GPT-4 Turbo when refused answers are excluded

  • GPT-4 Turbo leads in coding prompts

  • Claude 3 Opus ranks highest for Chinese

  • Llama-3 70B excels in English writing but underperforms in complex prompts and non-English languages

Charts and Analysis: 

To visualize this data, we plotted the top 10 models’ scores for each prompt category.

  • Overall: This represents the standard ranking without category filtering. GPT-4 Turbo is the leader, followed by Gemini 1.5 Pro API Preview and Claude 3.

Data: LMSys filtered by Overall category (10 may 2024)

  • Coding: The lead of GPT-4 Turbo over Claude 3 is more pronounced on coding tasks.

Data: LMSys filtered by Coding category (10 may 2024)

  • Longer Query: We notice a degradation of Llama 3-70B score when the prompts increase in complexity, while GPT-4 and Claude 3 maintain their strength.

Data: LMSys filtered by Longer Query category (10 may 2024)

  • English: GPT-4 Turbo keeps its lead in English writing but two strong contenders, Gemini 1.5 Pro and Llama 3, are close behind. The degradation of Claude 3 in this category is puzzling, as writing is considered one of its main strengths.

Data: LMSys filtered by English category (10 may 2024)

  • Chinese: Claude 3 Opus has by far the best score in Chinese writing. GPT-4 Turbo and Gemini 1.5 Pro API follow closely, tied. Llama 3-70B does not rank in the top 10.

Data: LMSys filtered by Chinese category (10 may 2024)

  • French: Gemini 1.5 Pro API scores well in French writing, with Claude 3 and GPT-4 Turbo closely tied behind.

Data: LMSys filtered by French category (10 may 2024)

  • Excluding refusals: In this chart, where all refused answers are excluded (for example, for safety reasons, with Claude 3 being the most cautious), GPT-4 Turbo and Claude 3 have nearly identical scores.

Data: LMSys filtered by Exclude refusals category (10 may 2024)

LMSys also published a new analysis of Llama3 70B performance across more fine-grained categories.

Source: LMSys, Llama 3-70b's win rate (excluding ties) against top 5 models

Why it matters: The prompt category filtering on the LMSys Elo Leaderboard offers valuable insights into the strengths and weaknesses of various AI models. The latest data on Llama 3 helps users choose the most suitable model for specific tasks, optimizing performance and efficiency. Additionally, it identifies areas where models may require further improvement, guiding developers in their efforts to enhance AI capabilities.