Efficiently Evaluating LLMs Locally: A Modern Framework for Technical Teams

Efficiently Evaluating LLMs Locally: A Modern Framework for Technical Teams

Cut through the noise of 100s of new LLMs with a systematic local testing strategy.

The Challenge of LLM Evaluation in 2024

Large Language Models (LLMs) are evolving at breakneck speed, with thousands of new models released weekly. For technical teams, this poses a critical problem: How do you quickly assess which model works best for your use case without burning resources on cloud APIs or unreliable benchmarks?

The answer lies in local testing—a cost-effective, secure, and iterative approach to evaluating LLMs. This blog outlines a battle-tested framework to compare models offline using cutting-edge tools like Ollama, OpenWebUI, and beyond.

Why Local Testing? Key Advantages

  1. Cost Control: Avoid API fees and GPU rental costs.
  2. Data Privacy: Keep sensitive data on-premises.
  3. Customization: Test against domain-specific tasks and datasets.
  4. Speed: Iterate faster without waiting for cloud deployments.

Step 1: Set Up a Local LLM Playground

Tool 1: Ollama (The Engine)

Ollama is a lightweight, open-source tool for running LLMs locally. It supports models like Llama 3, Mistral, Phi-3, and custom quantized variants.

Installation(bash):

curl -fsSL https://ollama.ai/install.sh | sh
ollama run llama3  # Test Meta's Llama 3

Why Ollama?

  • Runs on CPU/GPU with optimized quantization (e.g., 4-bit GGUF).
  • Manages model versions and prompts via a simple CLI/API.

Tool 2: OpenWebUI (The Interface)

OpenWebUI is a self-hosted UI that integrates with Ollama, offering:

  • Chat playgrounds for qualitative testing.
  • Model performance metrics (latency, token throughput).
  • Side-by-side comparison of multiple models.

Advanced Tip: Use OpenWebUI’s REST API to automate testing

import requests
response = requests.post(  
   "http://localhost:11434/api/generate",  
   json={"model": "llama3", "prompt": "Explain quantum computing"}  
)

Step 2: Quantitative Evaluation with Modern Techniques

1. Automated Benchmarking with LM Evaluation Harness

The EleutherAI LM Evaluation Harness provides standardized benchmarks (e.g., HellaSwag, MMLU) to measure LLM capabilities:

  1. Install:
git clone https://github.com/EleutherAI/lm-evaluation-harness  
pip install -e lm-evaluation-harness  
  1. Run Tests:
lmeval --model hf \  
   --modelargs pretrained=meta-llama/Meta-Llama-3-8B-Instruct \  
   --tasks hellaswag,gsm8k \  
   --device cpu

Custom Tasks: Modify lmeval/tasks to add domain-specific datasets (e.g., legal contracts, medical Q&A).

2. Quantization & Hardware Optimization

  • llama.cpp: Run 4-bit quantized models on CPUs. Ideal for older hardware.
  • MLX (Apple Silicon): Leverage Apple’s MLX framework for GPU-accelerated inference on M1/M2/M3 chips.
  • TensorRT-LLM: Optimize models for NVIDIA GPUs with kernel fusion and FP8 quantization.

Step 3: Advanced Strategies for Real-World Use Cases

1. A/B Testing with Litellm

LiteLLM normalizes APIs and provides abstraction for 100+ LLMs. Use it to proxy requests between local and cloud models and use it for A/B testing

from litellm import completion  
response = completion(  
    model="ollama/llama3",  # Local model  
    messages=[{"content": "Write a Python function for quicksort", "role": "user"}]  
)

2. Evaluate Embeddings with MTEB

For retrieval-augmented generation (RAG) use cases, evaluate the performance of embedding models using the MTEB Leaderboard:

from mteb import MTEB  
task = MTEB(tasks=["Banking77Classification"])  
results = task.run("local-model-name", outputfolder="results")

The Decision Matrix: Choosing Your LLM

Criteria Tool/Metric Example
Speed Tokens/sec (OpenWebUI) Mistral-7B: 45 t/s vs. Llama3: 32 t/s
Accuracy MMLU Score (Eval Harness) Phi-3-mini: 69% vs. Gemma-7B: 64%
Memory Footprint Model Size (4-bit quantized) Llama3-8B: 4.2GB vs. TinyLlama: 1.1GB
Domain Fit Custom Task Accuracy Meditron-7B: 88% on medical QA

Conclusion: Build Your Own Evaluation Pipeline

Local testing isn’t just a cost-saving measure—it’s a competitive advantage. By combining Ollama for deployment, OpenWebUI for qualitative analysis, and automated evaluation harnesses for quantitative metrics, teams can cut through the LLM noise in hours, not weeks.

  1. Start with Ollama + OpenWebUI for rapid prototyping.
  2. Integrate evaluation harnesses for critical tasks.
  3. Optimize inference with quantization and hardware-specific frameworks.

The best LLM isn’t the one with the most hype—it’s the one that aligns with your latency, accuracy, and infrastructure constraints. Test locally, deploy confidently.

Read more