A comprehensive benchmarking platform for evaluating and comparing Large Language Models
Real-time scoring โข Test suite management โข Collaborative evaluation โข Advanced analytics
๐ฆ Single Binary Deployment โข โก WebSocket Real-Time Updates โข ๐ Interactive Dashboards
Program Screenshots (expand)
UI Results:
UI Evaluate:
UI Stats:
UI Prompts:
UI Edit Prompt:
UI Profiles:
# Clone & Run
git clone https://github.com/lavantien/llm-tournament.git
cd llm-tournament
make run
Access at http://localhost:8080
- ๐ฏ Real-time scoring with WebSocket updates (0-100 scale with 5 levels)
- ๐ Automatic model ranking with real-time leaderboard
- ๐งฎ Granular scoring system (0/5, 1/5, 2/5, 3/5, 4/5, 5/5)
- ๐ Pass percentage calculations and visualization
- ๐ Instant updates across all connected clients
- ๐ Random score generation for prototyping
- โช State backup and restore functionality
- ๐๏ธ Create/rename/delete independent prompt suites
- ๐ Isolated profiles and results per suite
- โก One-click suite switching with instant UI updates
- ๐ฆ Complete suite export/import (JSON)
- ๐ท๏ธ Profile-based prompt categorization and filtering
- ๐ Rich Markdown editing with live preview
- ๐๏ธ Profile assignment for prompt categorization
- ๐งฉ Bulk selection, deletion, and export operations
- ๐๏ธ Drag-and-drop reordering with automatic saving
- ๐ Real-time search and multi-criteria filtering
- ๐ One-click copy functionality for prompt text
- ๐ค JSON export/import with validation
- โ Quick model addition with automatic score initialization
- โ๏ธ In-place model renaming with result preservation
- ๐๏ธ Model deletion with confirmation
- ๐ Color-coded scoring visualization (red to blue gradient)
- ๐ Consistent state persistence across sessions
- ๐ Model search and filtering capabilities
- ๐ Create reusable evaluation profiles
- ๐ Associate profiles with prompts for categorization
- ๐ Automatic prompt updates when profiles are renamed
- ๐ Profile-based filtering in prompt view
- ๐ Markdown description support with preview
- ๐ Detailed score breakdowns with Chart.js visualizations
- ๐ Comprehensive tier classification system:
- Transcendent (1900-2000) ๐
- Super-Grandmaster (1800-1899) ๐
- Grandmaster (1700-1799) ๐ฅ
- International Master (1600-1699) ๐๏ธ
- Master (1500-1599) ๐
- Expert (1400-1499) ๐
- Pro Player (1200-1399) ๐ฎ
- Advanced Player (1000-1199) ๐ฏ
- Intermediate Player (800-999) ๐
- Veteran (600-799) ๐จโ๐ผ
- Beginner (0-599) ๐ฃ
- ๐ Score distribution visualization
- ๐ Tier-based model grouping
- ๐ Performance comparison across models
- ๐ฏ Streamlined scoring with color-coded buttons
- ๐ Full prompt and solution display with Markdown rendering
- โฌ ๏ธโก๏ธ Previous/Next navigation between prompts
- ๐ One-click copying of raw prompt text
- ๐ Clear visualization of current scores
- ๐โโ๏ธ Rapid evaluation workflow
- ๐ WebSocket-based instant updates across all clients
- ๐ค Simultaneous editing with conflict resolution
- ๐ Broadcast of all changes to connected users
- ๐ก Connection status monitoring
- ๐ Automatic reconnection handling
Backend
Go 1.21+
โข Gorilla WebSocket
โข Blackfriday
โข Bluemonday
Frontend
HTML5
โข CSS3
โข JavaScript ES6+
โข Chart.js 4.x
โข Marked.js
Data
JSON Storage
โข File-based Persistence
โข JSON Import/Export
โข State Versioning
Security
XSS Sanitization
โข CORS Protection
โข Input Validation
โข Error Handling
Text-to-Speech
tools/tts/podcast.py
- Generate podcast audio from text scripts using Kokoro ONNX models
Background Removal
tools/bg_batch_eraser/main.py
- Remove backgrounds from images using BEN2 model
tools/bg_batch_eraser/vidseg.py
- Extract foreground from videos with alpha channel support
tools/bg_batch_eraser/BEN2.py
- Core background eraser neural network implementation
LLM Integration
tools/openwebui/pipes/anthropic_claude_thinking_96k.py
- OpenWebUI pipe for Claude with thinking mode (96k context)
tools/ragweb_agent
- RAG capabilities for web-based content
- Go 1.21+
- Make
- Git
# Development mode
./dev.sh
# Production build
make build
./release/llm-tournament
-
Set Up Test Suites
- Create a new suite for your evaluation task
- Configure profiles for different prompt categories
- Import existing prompts or create new ones
-
Configure Models
- Add each model you want to evaluate
- Models can represent different LLMs, versions, or configurations
-
Prepare Prompts
- Write prompts with appropriate solutions
- Assign profiles for categorization
- Arrange prompts in desired evaluation order
-
Run Evaluations
- Navigate through prompts and assess each model
- Use the 0-5 scoring system (0, 20, 40, 60, 80, 100 points)
- Copy prompts directly to your LLM for testing
-
Analyze Results
- View the results page for summary scores
- Examine tier classifications in the stats page
- Compare performance across different prompt types
- Export results for external analysis
- Bulk Operations: Select multiple prompts for deletion or other actions
- Drag-and-Drop: Reorder prompts with intuitive drag-and-drop interface
- State Preservation: Previous state can be restored with the "Previous" button
- Mock Data: Generate random scores to prototype and test visualizations
- Search & Filter: Find specific prompts, models, or profiles quickly
We welcome contributions!
๐ First time? Try good first issue
labeled tickets
๐ง Core areas needing help:
- Evaluation workflow enhancements
- Additional storage backends
- Advanced visualization
- CI/CD pipeline improvements
Contribution Process:
- Fork repository
- Create feature branch
- Submit PR with description
- Address review comments
- Merge after approval
- ๐ง Multi-LLM consensus scoring
- ๐ Distributed evaluation mode
- ๐ Advanced search syntax
- ๐ฑ Responsive mobile design
- ๐ Custom metric definitions
- ๐ค Auto-evaluation agents
- ๐ CI/CD integration
- ๐ User authentication
MIT License - See LICENSE for details
My work email: cariyaputta@gmail.com