docs: update the evals page to mention recommended model (sonnet) and…

… available evals (#451) * docs: update the evals page to mention recommended model (sonnet) and available evals * Update docs/evals.rst Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com> --------- Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
ErikBjare · Feb 25, 2025 · 92965cd · 92965cd
1 parent 161451e
commit 92965cd
Show file tree

Hide file tree

Showing 2 changed files with 53 additions and 7 deletions.
diff --git a/docs/evals.rst b/docs/evals.rst
@@ -3,29 +3,73 @@ Evals
 
 gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?
 
-To answer these questions, we have created a evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
+To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
 
 .. note::
-    The evaluation suite is still under development, but the eval harness is mostly complete.
+    The evaluation suite is still tiny and under development, but the eval harness is fully functional.
+
+Recommended Model
+-----------------
+
+The recommended model is **Claude 3.7 Sonnet** (``anthropic/claude-3-7-sonnet-20250219``) for its:
+
+- Strong coder capabilities
+- Strong performance across all tool types
+- Reasoning capabilities
+- Vision & computer use capabilities
+
+Decent alternatives include:
+
+- GPT-4o (``openai/gpt-4o``)
+- Llama 3.1 405B (``openrouter/meta-llama/llama-3.1-405b-instruct``)
+- DeepSeek V3 (``deepseek/deepseek-chat``)
+- DeepSeek R1 (``deepseek/deepseek-reasoner``)
 
 Usage
 -----
 
-You can run the simple ``hello`` eval with gpt-4o like this:
+You can run the simple ``hello`` eval with Claude 3.7 Sonnet like this:
 
 .. code-block:: bash
 
-    gptme-eval hello --model openai/gpt-4o
+    gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219
 
 However, we recommend running it in Docker to improve isolation and reproducibility:
 
 .. code-block:: bash
 
     make build-docker
     docker run \
-        -e "OPENAI_API_KEY=<your api key>" \
+        -e "ANTHROPIC_API_KEY=<your api key>" \
         -v $(pwd)/eval_results:/app/eval_results \
-        gptme-eval hello --model openai/gpt-4o
+        gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219
+
+Available Evals
+---------------
+
+The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction.
+
+.. This is where we want to get to:
+
+    The evaluation suite tests models on:
+
+    1. Tool Usage
+       - Shell commands and file operations
+       - Git operations
+       - Web browsing and data extraction
+       - Project navigation and understanding
+
+    2. Programming Tasks
+       - Code completion and generation
+       - Bug fixing and debugging
+       - Documentation writing
+       - Test creation
+
+    3. Reasoning
+       - Multi-step problem solving
+       - Tool selection and sequencing
+       - Error handling and recovery
+       - Self-correction
 
 
 Results

diff --git a/docs/providers.rst b/docs/providers.rst
@@ -1,7 +1,9 @@
 Providers
 =========
 
-We support several LLM providers, including OpenAI, Anthropic, OpenRouter, Deepseek, Azure, and any OpenAI-compatible server (e.g. ``ollama``, ``llama-cpp-python``).
+We support LLMs from several providers, including OpenAI, Anthropic, OpenRouter, Deepseek, Azure, and any OpenAI-compatible server (e.g. ``ollama``, ``llama-cpp-python``).
+
+You can find our model recommendations on the :doc:`evals` page.
 
 To select a provider and model, run ``gptme`` with the ``-m``/``--model`` flag set to ``<provider>/<model>``, for example: