Thabit
Open source platform to evaluate multiple LLMs and find the best one for your use case.
The story
With each new announcement of a new LLM, there will be lots of discussions and social media posts about how good/bad the new model is.
The idea came to my mind after seeing a few posts from @1littlecoder on X and @AICodeKing on Youtube comparing results of LLMs based on their own dataset.
I wanted to create a tool that would allow me to evaluate multiple LLMs and find the best one for my use case. And keeping the dataset for future checks and evaluations.
How it works?
- Create a dataset (using UI).
- Add the LLMs you want to evaluate to the
config.json
file. - Evaluate the LLMs using CLI and see results.
Installation
pip install thabit
Ensure you have .env
variable in the same folder, as well as config.json
file.
How to run
To evaluate all models in your config file:
thabit eval --dataset <dataset_name>
To evaluate GPT-4-o
vs DeepSeek-Chat
models, you can use:
thabit eval --dataset <dataset_name> --models gpt-4o,deepseek-chat
Sample Config file
{
"models": [
{
"provider": "OpenAI",
"model": "gpt-4o",
"model_name": "GPT-4o",
"endpoint": "https://api.openai.com/v1/chat/completions",
"api_key_env_var": "OPEN_AI_API_KEY"
},
{
"provider": "OpenAI",
"model": "gpt-4o-mini",
"model_name": "GPT-4o-mini",
"endpoint": "https://api.openai.com/v1/chat/completions",
"api_key_env_var": "OPEN_AI_API_KEY"
},
{
"provider": "DeepSeek",
"model": "deepseek-chat",
"model_name": "DeepSeek-Chat",
"endpoint": "https://api.deepseek.com/v1/chat/completions",
"api_key_env_var": "DEEP_SEEK_API_KEY"
},
{
"provider": "FireworksAI",
"model": "accounts/fireworks/models/llama-v3p1-405b-instruct",
"model_name": "Llama 3.1 405b",
"endpoint": "https://api.fireworks.ai/inference/v1/chat/completions",
"api_key_env_var": "FIREWORKS_API_KEY"
},
{
"provider": "Anthropic",
"model": "claude-3-5-sonnet-20240620",
"model_name": "Claude 3.5 Sonnet",
"endpoint": "https://api.anthropic.com/v1/messages",
"api_key_env_var": "CLAUDE_API_KEY"
},
{
"provider": "Cohere",
"model": "command-r",
"model_name": "Cohere Command R",
"endpoint": "https://api.cohere.com/v1/chat",
"api_key_env_var": "COHERE_API_KEY"
},
{
"provider": "Cohere",
"model": "command-r-plus",
"model_name": "Cohere Command R+",
"endpoint": "https://api.cohere.com/v1/chat",
"api_key_env_var": "COHERE_API_KEY"
}
],
"global_parameters": {
"temperature": 1,
"max_tokens": 200,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0
}
}