Skip to content

Commit

Permalink
docs: update the evals page to mention recommended model (sonnet) and…
Browse files Browse the repository at this point in the history
… available evals (#451)

* docs: update the evals page to mention recommended model (sonnet) and available evals

* Update docs/evals.rst

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>

---------

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
  • Loading branch information
ErikBjare and ellipsis-dev[bot] authored Feb 25, 2025
1 parent 161451e commit 92965cd
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 7 deletions.
56 changes: 50 additions & 6 deletions docs/evals.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,29 +3,73 @@ Evals

gptme provides LLMs with a wide variety of tools, but how well do models make use of them? Which tasks can they complete, and which ones do they struggle with? How far can they get on their own, without any human intervention?

To answer these questions, we have created a evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.
To answer these questions, we have created an evaluation suite that tests the capabilities of LLMs on a wide variety of tasks.

.. note::
The evaluation suite is still under development, but the eval harness is mostly complete.
The evaluation suite is still tiny and under development, but the eval harness is fully functional.

Recommended Model
-----------------

The recommended model is **Claude 3.7 Sonnet** (``anthropic/claude-3-7-sonnet-20250219``) for its:

- Strong coder capabilities
- Strong performance across all tool types
- Reasoning capabilities
- Vision & computer use capabilities

Decent alternatives include:

- GPT-4o (``openai/gpt-4o``)
- Llama 3.1 405B (``openrouter/meta-llama/llama-3.1-405b-instruct``)
- DeepSeek V3 (``deepseek/deepseek-chat``)
- DeepSeek R1 (``deepseek/deepseek-reasoner``)

Usage
-----

You can run the simple ``hello`` eval with gpt-4o like this:
You can run the simple ``hello`` eval with Claude 3.7 Sonnet like this:

.. code-block:: bash
gptme-eval hello --model openai/gpt-4o
gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219
However, we recommend running it in Docker to improve isolation and reproducibility:

.. code-block:: bash
make build-docker
docker run \
-e "OPENAI_API_KEY=<your api key>" \
-e "ANTHROPIC_API_KEY=<your api key>" \
-v $(pwd)/eval_results:/app/eval_results \
gptme-eval hello --model openai/gpt-4o
gptme-eval hello --model anthropic/claude-3-7-sonnet-20250219
Available Evals
---------------

The current evaluations test basic tool use in gptme, such as the ability to: read, write, patch files; run code in ipython, commands in the shell; use git and create new projects with npm and cargo. It also has basic tests for web browsing and data extraction.

.. This is where we want to get to:
The evaluation suite tests models on:
1. Tool Usage
- Shell commands and file operations
- Git operations
- Web browsing and data extraction
- Project navigation and understanding
2. Programming Tasks
- Code completion and generation
- Bug fixing and debugging
- Documentation writing
- Test creation
3. Reasoning
- Multi-step problem solving
- Tool selection and sequencing
- Error handling and recovery
- Self-correction
Results
Expand Down
4 changes: 3 additions & 1 deletion docs/providers.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
Providers
=========

We support several LLM providers, including OpenAI, Anthropic, OpenRouter, Deepseek, Azure, and any OpenAI-compatible server (e.g. ``ollama``, ``llama-cpp-python``).
We support LLMs from several providers, including OpenAI, Anthropic, OpenRouter, Deepseek, Azure, and any OpenAI-compatible server (e.g. ``ollama``, ``llama-cpp-python``).

You can find our model recommendations on the :doc:`evals` page.

To select a provider and model, run ``gptme`` with the ``-m``/``--model`` flag set to ``<provider>/<model>``, for example:

Expand Down

0 comments on commit 92965cd

Please sign in to comment.