Thabit

Open source platform to evaluate multiple LLMs and find the best one for your use case.

The story

With each new announcement of a new LLM, there will be lots of discussions and social media posts about how good/bad the new model is.

The idea came to my mind after seeing a few posts from @1littlecoder on X and @AICodeKing on Youtube comparing results of LLMs based on their own dataset.

I wanted to create a tool that would allow me to evaluate multiple LLMs and find the best one for my use case. And keeping the dataset for future checks and evaluations.

How it works?

Create a dataset (using UI).
Add the LLMs you want to evaluate to the config.json file.
Evaluate the LLMs using CLI and see results.

Demo

Installation

pip install thabit

Ensure you have .env variable in the same folder, as well as config.json file.

How to run

To evaluate all models in your config file:

thabit eval --dataset <dataset_name>

To evaluate GPT-4-o vs DeepSeek-Chat models, you can use:

thabit eval --dataset <dataset_name> --models gpt-4o,deepseek-chat

Sample Config file

{
  "models": [
    {
      "provider": "OpenAI",
      "model": "gpt-4o",
      "model_name": "GPT-4o",
      "endpoint": "https://api.openai.com/v1/chat/completions",
      "api_key_env_var": "OPEN_AI_API_KEY"
    },
    {
      "provider": "OpenAI",
      "model": "gpt-4o-mini",
      "model_name": "GPT-4o-mini",
      "endpoint": "https://api.openai.com/v1/chat/completions",
      "api_key_env_var": "OPEN_AI_API_KEY"
    },
    {
      "provider": "DeepSeek",
      "model": "deepseek-chat",
      "model_name": "DeepSeek-Chat",
      "endpoint": "https://api.deepseek.com/v1/chat/completions",
      "api_key_env_var": "DEEP_SEEK_API_KEY"
    },
    {
      "provider": "FireworksAI",
      "model": "accounts/fireworks/models/llama-v3p1-405b-instruct",
      "model_name": "Llama 3.1 405b",
      "endpoint": "https://api.fireworks.ai/inference/v1/chat/completions",
      "api_key_env_var": "FIREWORKS_API_KEY"
    },
    {
      "provider": "Anthropic",
      "model": "claude-3-5-sonnet-20240620",
      "model_name": "Claude 3.5 Sonnet",
      "endpoint": "https://api.anthropic.com/v1/messages",
      "api_key_env_var": "CLAUDE_API_KEY"
    },
    {
      "provider": "Cohere",
      "model": "command-r",
      "model_name": "Cohere Command R",
      "endpoint": "https://api.cohere.com/v1/chat",
      "api_key_env_var": "COHERE_API_KEY"
    },
    {
      "provider": "Cohere",
      "model": "command-r-plus",
      "model_name": "Cohere Command R+",
      "endpoint": "https://api.cohere.com/v1/chat",
      "api_key_env_var": "COHERE_API_KEY"
    }
  ],
  "global_parameters": {
    "temperature": 1,
    "max_tokens": 200,
    "top_p": 1,
    "frequency_penalty": 0,
    "presence_penalty": 0
  }
}

Evaluation

Configuration

Datasets

Supported Providers