Skip to content

Simple and blazingly fast dynamic evaluation platform for benchmarking Large Language Models

License

Notifications You must be signed in to change notification settings

lavantien/llm-tournament

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ† LLM Tournament Arena

A comprehensive benchmarking platform for evaluating and comparing Large Language Models
Real-time scoring โ€ข Test suite management โ€ข Collaborative evaluation โ€ข Advanced analytics

๐Ÿ“ฆ Single Binary Deployment โ€ข โšก WebSocket Real-Time Updates โ€ข ๐Ÿ“Š Interactive Dashboards

Program Screenshots (expand)

UI Results: UI Results UI Evaluate: UI Evaluate UI Stats: UI Stats UI Prompts: UI Prompts UI Edit Prompt: UI Edit Prompt UI Profiles: UI Profiles

๐Ÿš€ Quick Start

# Clone & Run
git clone https://github.com/lavantien/llm-tournament.git
cd llm-tournament
make run

Access at http://localhost:8080

๐ŸŒŸ Key Features

๐Ÿงช Evaluation Engine

  • ๐ŸŽฏ Real-time scoring with WebSocket updates (0-100 scale with 5 levels)
  • ๐Ÿ“ˆ Automatic model ranking with real-time leaderboard
  • ๐Ÿงฎ Granular scoring system (0/5, 1/5, 2/5, 3/5, 4/5, 5/5)
  • ๐Ÿ“‰ Pass percentage calculations and visualization
  • ๐Ÿ”„ Instant updates across all connected clients
  • ๐Ÿ”€ Random score generation for prototyping
  • โช State backup and restore functionality

๐Ÿ“š Test Suite Management

  • ๐Ÿ—‚๏ธ Create/rename/delete independent prompt suites
  • ๐Ÿ”— Isolated profiles and results per suite
  • โšก One-click suite switching with instant UI updates
  • ๐Ÿ“ฆ Complete suite export/import (JSON)
  • ๐Ÿท๏ธ Profile-based prompt categorization and filtering

โœ๏ธ Prompt Workshop

  • ๐Ÿ“ Rich Markdown editing with live preview
  • ๐Ÿ–‡๏ธ Profile assignment for prompt categorization
  • ๐Ÿงฉ Bulk selection, deletion, and export operations
  • ๐ŸŽš๏ธ Drag-and-drop reordering with automatic saving
  • ๐Ÿ” Real-time search and multi-criteria filtering
  • ๐Ÿ“‹ One-click copy functionality for prompt text
  • ๐Ÿ“ค JSON export/import with validation

๐Ÿค– Model Arena

  • โž• Quick model addition with automatic score initialization
  • โœ๏ธ In-place model renaming with result preservation
  • ๐Ÿ—‘๏ธ Model deletion with confirmation
  • ๐Ÿ“Š Color-coded scoring visualization (red to blue gradient)
  • ๐Ÿ”„ Consistent state persistence across sessions
  • ๐Ÿ” Model search and filtering capabilities

๐Ÿ‘ค Profile System

  • ๐Ÿ“‹ Create reusable evaluation profiles
  • ๐Ÿ”– Associate profiles with prompts for categorization
  • ๐Ÿ”„ Automatic prompt updates when profiles are renamed
  • ๐Ÿ” Profile-based filtering in prompt view
  • ๐Ÿ“ Markdown description support with preview

๐Ÿ“Š Analytics Suite

  • ๐Ÿ“Š Detailed score breakdowns with Chart.js visualizations
  • ๐Ÿ† Comprehensive tier classification system:
    • Transcendent (1900-2000) ๐ŸŒŒ
    • Super-Grandmaster (1800-1899) ๐ŸŒŸ
    • Grandmaster (1700-1799) ๐Ÿฅ‡
    • International Master (1600-1699) ๐ŸŽ–๏ธ
    • Master (1500-1599) ๐Ÿ…
    • Expert (1400-1499) ๐ŸŽ“
    • Pro Player (1200-1399) ๐ŸŽฎ
    • Advanced Player (1000-1199) ๐ŸŽฏ
    • Intermediate Player (800-999) ๐Ÿ“ˆ
    • Veteran (600-799) ๐Ÿ‘จโ€๐Ÿ’ผ
    • Beginner (0-599) ๐Ÿฃ
  • ๐Ÿ“ˆ Score distribution visualization
  • ๐Ÿ“‹ Tier-based model grouping
  • ๐Ÿ“‘ Performance comparison across models

๐Ÿ’ป Evaluation Interface

  • ๐ŸŽฏ Streamlined scoring with color-coded buttons
  • ๐Ÿ“ Full prompt and solution display with Markdown rendering
  • โฌ…๏ธโžก๏ธ Previous/Next navigation between prompts
  • ๐Ÿ“‹ One-click copying of raw prompt text
  • ๐Ÿ” Clear visualization of current scores
  • ๐Ÿƒโ€โ™‚๏ธ Rapid evaluation workflow

๐Ÿ”„ Real-Time Collaboration

  • ๐ŸŒ WebSocket-based instant updates across all clients
  • ๐Ÿ“ค Simultaneous editing with conflict resolution
  • ๐Ÿ”„ Broadcast of all changes to connected users
  • ๐Ÿ“ก Connection status monitoring
  • ๐Ÿ”„ Automatic reconnection handling

๐Ÿ› ๏ธ Tech Stack

Backend
Go 1.21+ โ€ข Gorilla WebSocket โ€ข Blackfriday โ€ข Bluemonday

Frontend
HTML5 โ€ข CSS3 โ€ข JavaScript ES6+ โ€ข Chart.js 4.x โ€ข Marked.js

Data
JSON Storage โ€ข File-based Persistence โ€ข JSON Import/Export โ€ข State Versioning

Security
XSS Sanitization โ€ข CORS Protection โ€ข Input Validation โ€ข Error Handling

๐Ÿงฐ Complementary Tools

Text-to-Speech
tools/tts/podcast.py - Generate podcast audio from text scripts using Kokoro ONNX models

Background Removal
tools/bg_batch_eraser/main.py - Remove backgrounds from images using BEN2 model
tools/bg_batch_eraser/vidseg.py - Extract foreground from videos with alpha channel support
tools/bg_batch_eraser/BEN2.py - Core background eraser neural network implementation

LLM Integration
tools/openwebui/pipes/anthropic_claude_thinking_96k.py - OpenWebUI pipe for Claude with thinking mode (96k context)
tools/ragweb_agent - RAG capabilities for web-based content

๐Ÿ Getting Started

Prerequisites

  • Go 1.21+
  • Make
  • Git

Installation & Running

# Development mode
./dev.sh

# Production build
make build
./release/llm-tournament

๐Ÿ“š Usage Guide

  1. Set Up Test Suites

    • Create a new suite for your evaluation task
    • Configure profiles for different prompt categories
    • Import existing prompts or create new ones
  2. Configure Models

    • Add each model you want to evaluate
    • Models can represent different LLMs, versions, or configurations
  3. Prepare Prompts

    • Write prompts with appropriate solutions
    • Assign profiles for categorization
    • Arrange prompts in desired evaluation order
  4. Run Evaluations

    • Navigate through prompts and assess each model
    • Use the 0-5 scoring system (0, 20, 40, 60, 80, 100 points)
    • Copy prompts directly to your LLM for testing
  5. Analyze Results

    • View the results page for summary scores
    • Examine tier classifications in the stats page
    • Compare performance across different prompt types
    • Export results for external analysis

๐Ÿ”ง Advanced Features

  • Bulk Operations: Select multiple prompts for deletion or other actions
  • Drag-and-Drop: Reorder prompts with intuitive drag-and-drop interface
  • State Preservation: Previous state can be restored with the "Previous" button
  • Mock Data: Generate random scores to prototype and test visualizations
  • Search & Filter: Find specific prompts, models, or profiles quickly

๐Ÿค Contribution

We welcome contributions!
๐Ÿ“Œ First time? Try good first issue labeled tickets
๐Ÿ”ง Core areas needing help:

  • Evaluation workflow enhancements
  • Additional storage backends
  • Advanced visualization
  • CI/CD pipeline improvements

Contribution Process:

  1. Fork repository
  2. Create feature branch
  3. Submit PR with description
  4. Address review comments
  5. Merge after approval

๐Ÿ—บ Roadmap

Q2 2025

  • ๐Ÿง  Multi-LLM consensus scoring
  • ๐ŸŒ Distributed evaluation mode
  • ๐Ÿ” Advanced search syntax
  • ๐Ÿ“ฑ Responsive mobile design

Q3 2025

  • ๐Ÿ“Š Custom metric definitions
  • ๐Ÿค– Auto-evaluation agents
  • ๐Ÿ”„ CI/CD integration
  • ๐Ÿ” User authentication

๐Ÿ“œ License

MIT License - See LICENSE for details

๐Ÿ“ฌ Contact

My work email: cariyaputta@gmail.com