Anthropic has introduced the new large language model Claude 3.5 Sonnet. It is now available on the ClaudeAI chatbot, Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. Claude 3.5 Sonnet outperforms GPT-4o in key benchmarks, including GPQA, MMLU, and HumanEval. As of the writing of this article, the model is not listed on the chatbot arena. The model context is 200k tokens.

Claude 3.5 Sonnet Test Results

Claude 3.5 Sonnet has shown impressive results in key benchmarks:

  • Graduate-Level Reasoning (GPQA): Achieved 59.4% success in complex reasoning tasks, outperforming Claude 3 Opus by 9% and GPT-4o by 5.8%.
  • Undergraduate-Level Knowledge (MMLU): Scored 88.7% in tests covering a wide range of knowledge areas, surpassing Claude 3 Opus by 2% and matching GPT-4o.
  • Programming Skills (HumanEval): Achieved 92.0% accuracy in programming tasks, exceeding Claude 3 Opus by 7.2% and GPT-4o by 1.8%.
  • Extended Text Reasoning (DROP, F1 score): Scored 87.1% in extended text reasoning tests, outperforming Claude 3 Opus by 4% and GPT-4o by 3.7%.
  • Mixed Evaluations (BIG-Bench-Hard): Scored 93.1% in mixed evaluation tests, surpassing Claude 3 Opus by 6.3% and GPT-4o by 3.9%.
  • Elementary Math (GSM8K): Achieved 96.4% accuracy in elementary math tasks, exceeding Claude 3 Opus by 1.4% and GPT-4o by 5.6%.
  • Scientific Diagrams (AI2D): Scored 94.7% in interpreting scientific diagrams, outperforming Claude 3 Opus by 6.6% and GPT-4o by 0.5%.

Artifacts — A New Way to Use Claude AI

Anthropic introduces Artifacts on, a new feature enhancing user interaction with Claude. When users request code snippets, text documents, or web design, these artifacts appear in a separate window next to the dialogue. This creates a dynamic workspace where users can work on the content generated by Claude.

This preview feature represents the evolution of Claude from conversational AI to a full-fledged work environment. In a broader concept for, future support will include collaborative teamwork. Teams and organizations will be able to centralize their knowledge, documents, and ongoing work in one shared space, where Claude will act as an assistant.

