From 92965cd1428090842da5368e3724698ff11ecb3b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Erik=20Bj=C3=A4reholt?= Date: Tue, 25 Feb 2025 20:09:04 +0000 Subject: [PATCH] docs: update the evals page to mention recommended model (sonnet) and available evals (#451) * docs: update the evals page to mention recommended model (sonnet) and available evals * Update docs/evals.rst Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> --------- Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> --- docs/evals.rst | 56 +++++++++++++++++++++++++++++++++++++++++----- docs/providers.rst | 4 +++- 2 files changed, 53 insertions(+), 7 deletions(-) diff --git a/docs/evals.rst b/docs/evals.rst index a0332b5b8..c1a9df73f 100644 --- a/docs/evals.rst +++ b/docs/evals.rst @@ -3,19 +3,36 @@ Evals gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention? -To answer these questions, we have created a evaluation suite that tests the capabilities of LLMs on a wide variety of tasks. +To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks. .. note:: - The evaluation suite is still under development, but the eval harness is mostly complete. + The evaluation suite is still tiny and under development, but the eval harness is fully functional. + +Recommended Model +----------------- + +The recommended model is **Claude 3.7 Sonnet** (``anthropic/claude-3-7-sonnet-20250219``) for its: + +- Strong coder capabilities +- Strong performance across all tool types +- Reasoning capabilities +- Vision & computer use capabilities + +Decent alternatives include: + +- GPT-4o (``openai/gpt-4o``) +- Llama 3.1 405B (``openrouter/meta-llama/llama-3.1-405b-instruct``) +- DeepSeek V3 (``deepseek/deepseek-chat``) +- DeepSeek R1 (``deepseek/deepseek-reasoner``) Usage ----- -You can run the simple ``hello`` eval with gpt-4o like this: +You can run the simple ``hello`` eval with Claude 3.7 Sonnet like this: .. code-block:: bash - gptme-eval hello --model openai/gpt-4o + gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219 However, we recommend running it in Docker to improve isolation and reproducibility: @@ -23,9 +40,36 @@ However, we recommend running it in Docker to improve isolation and reproducibil make build-docker docker run \ - -e "OPENAI_API_KEY=" \ + -e "ANTHROPIC_API_KEY=" \ -v $(pwd)/eval_results:/app/eval_results \ - gptme-eval hello --model openai/gpt-4o + gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219 + +Available Evals +--------------- + +The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction. + +.. This is where we want to get to: + + The evaluation suite tests models on: + + 1. Tool Usage + - Shell commands and file operations + - Git operations + - Web browsing and data extraction + - Project navigation and understanding + + 2. Programming Tasks + - Code completion and generation + - Bug fixing and debugging + - Documentation writing + - Test creation + + 3. Reasoning + - Multi-step problem solving + - Tool selection and sequencing + - Error handling and recovery + - Self-correction Results diff --git a/docs/providers.rst b/docs/providers.rst index d5548d9aa..fee41bafd 100644 --- a/docs/providers.rst +++ b/docs/providers.rst @@ -1,7 +1,9 @@ Providers ========= -We support several LLM providers, including OpenAI, Anthropic, OpenRouter, Deepseek, Azure, and any OpenAI-compatible server (e.g. ``ollama``, ``llama-cpp-python``). +We support LLMs from several providers, including OpenAI, Anthropic, OpenRouter, Deepseek, Azure, and any OpenAI-compatible server (e.g. ``ollama``, ``llama-cpp-python``). + +You can find our model recommendations on the :doc:`evals` page. To select a provider and model, run ``gptme`` with the ``-m``/``--model`` flag set to ``/``, for example: