AI Tools Compared

Running code generation locally keeps your source code off external servers, eliminates per-token costs, and gives you offline capability. In 2026, the best local models have closed much of the gap with cloud APIs for everyday coding tasks.

Models Worth Running Locally in 2026

Model Parameters VRAM (Q4_K_M) HumanEval Notes
DeepSeek Coder V3 7B ~5GB 78% Best quality/size tradeoff
Qwen2.5-Coder 7B 7B ~5GB 79% Strong for Python/JS/TS
Qwen2.5-Coder 14B 14B ~10GB 85% Near-cloud quality
CodeLlama 13B 13B ~9GB 62% Older baseline, still usable
Starcoder2 15B 15B ~10GB 65% Good for less common languages
WizardCoder 33B 33B ~22GB 79% Quality boost at high VRAM cost

HumanEval scores measure Python code generation on 164 problems. They’re directionally useful but don’t capture performance on real codebases.

Hardware Requirements

Consumer GPU (gaming tier):

CPU-only inference: 7B models at ~2-5 tokens/second — usable for batch tasks, frustrating for interactive use.

Setting Up Ollama

brew install ollama
ollama serve

ollama pull deepseek-coder-v2:7b
ollama pull qwen2.5-coder:7b
ollama pull qwen2.5-coder:14b

ollama run deepseek-coder-v2:7b "Write a Python function to parse JSON with error handling"

The Ollama API mirrors the OpenAI API format:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

def generate_code(prompt: str, model: str = "qwen2.5-coder:7b") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are an expert software engineer. Generate clean, correct code.",
            },
            {"role": "user", "content": prompt},
        ],
        temperature=0.1,
        max_tokens=2048,
    )
    return response.choices[0].message.content

models = ["deepseek-coder-v2:7b", "qwen2.5-coder:7b"]
prompt = "Write a TypeScript function that debounces async calls with cancellation support"

for model in models:
    print(f"\n--- {model} ---")
    print(generate_code(prompt, model))

Model Deep Dives

Qwen2.5-Coder 7B and 14B

Alibaba’s Qwen2.5-Coder models are the current best-in-class for their size tier. The 14B model approaches GPT-4o mini quality on many tasks.

Quality comparison on implementing a LRU cache:

# Qwen2.5-Coder 14B output — correct and idiomatic:
from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity: int):
        self.cache: OrderedDict[int, int] = OrderedDict()
        self.capacity = capacity

    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)
        return self.cache[key]

    def put(self, key: int, value: int) -> None:
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

The 14B model uses OrderedDict with move_to_end (correct, efficient). The 7B model’s output was correct but used a manual doubly-linked list implementation — verbose but functional.

Integrating with VS Code via Continue.dev

Continue.dev is an open-source VS Code extension that works with any LLM:

{
  "models": [
    {
      "title": "Qwen2.5-Coder 14B",
      "provider": "ollama",
      "model": "qwen2.5-coder:14b",
      "contextLength": 8192
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 7B (autocomplete)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

With this configuration: Tab autocomplete using the 7B model (fast for interactive use), chat completions using the 14B model (better quality). No external API calls.

Model Comparison: DeepSeek Coder V3 vs Qwen2.5-Coder

DeepSeek Coder V3 (7B)

DeepSeek is maintained by a Chinese team and is aggressively optimized for code generation. The 7B model often outperforms larger open models.

# Prompt: Write a TypeScript function that debounces async functions
# with cancellation support

# DeepSeek output:
function debounceAsync<T extends (...args: any[]) => Promise<any>>(
    fn: T,
    delayMs: number
): (...args: Parameters<T>) => Promise<Awaited<ReturnType<T>> | null> {
    let timeout: NodeJS.Timeout | null = null;
    let lastAbortController: AbortController | null = null;

    return async (...args: Parameters<T>) => {
        if (timeout) clearTimeout(timeout);
        if (lastAbortController) lastAbortController.abort();

        lastAbortController = new AbortController();
        const controller = lastAbortController;

        return new Promise((resolve, reject) => {
            timeout = setTimeout(async () => {
                try {
                    const result = await fn(...args);
                    resolve(result);
                } catch (error) {
                    if (!controller.signal.aborted) reject(error);
                }
            }, delayMs);
        });
    };
}

Correct implementation with abort controller integration and proper type inference.

Qwen2.5-Coder 14B

Alibaba’s Qwen is tuned for breadth across languages. The 14B model excels at multi-language projects.

Same prompt produces a similar implementation, but sometimes adds extra utility methods:

// Qwen output also includes optional metrics
function debounceAsync<T extends (...args: any[]) => Promise<any>>(
    fn: T,
    delayMs: number,
    options: { onAbort?: () => void } = {}
) {
    let timeout: NodeJS.Timeout | null = null;
    let controller: AbortController | null = null;

    return async (...args: Parameters<T>) => {
        if (timeout) clearTimeout(timeout);
        if (controller?.signal.aborted === false) {
            controller.abort();
            options.onAbort?.();
        }

        controller = new AbortController();
        // ... rest of implementation
    };
}

Qwen’s version includes lifecycle hooks (onAbort callback). For production code, this is helpful. For quick prototyping, extra features can be noise.

Running Quantized Models

The Q4_K_M quantization (4-bit) reduces model size by 75% with minimal quality loss. Practical VRAM:

# Check your GPU memory
nvidia-smi

# M2/M3 Mac: unified memory is shared CPU/GPU
# 14GB unified: run 7B models comfortably, 13B slowly

# Download and run a 14B model on 16GB total RAM:
ollama pull qwen2.5-coder:14b-q4_k_m
ollama run qwen2.5-coder:14b-q4_k_m "Write a Python decorator for retry logic"

Batch Code Generation with Local Models

For generating many code snippets, local models enable batch processing without per-token API costs:

import subprocess
import json

def generate_code_locally(prompts: list[str], model: str = "qwen2.5-coder:7b") -> list[str]:
    results = []
    for prompt in prompts:
        response = subprocess.run(
            ["ollama", "run", model, prompt],
            capture_output=True,
            text=True,
            timeout=120
        )
        results.append(response.stdout.strip())
    return results

# Generate 100 function stubs
prompts = [
    f"Write a Python function stub for {fn_name}. Add docstring with args/returns."
    for fn_name in ["validate_email", "parse_json", "fetch_data", ...]  # 100 items
]

generated = generate_code_locally(prompts)
# Cost: $0. Time: ~5 minutes on RTX 4090

Fine-Tuning Local Models

Some teams fine-tune local models on their codebase for better domain-specific suggestions.

# Using Ollama's fine-tuning (experimental):
ollama create custom-coder -f <<EOF
FROM qwen2.5-coder:7b

# Add base patterns from your codebase
PARAMETER num_ctx 8192
PARAMETER temperature 0.1

# System prompt baked into the model
SYSTEM "Generate code matching this style: [samples of your code]"
EOF

ollama run custom-coder "Write a validator for user profiles"

Fine-tuning requires collecting representative code samples from your repo (500-1000 samples), which takes time but pays off for teams with very specific patterns.

Offline Setup for Secure Environments

Local models are required in air-gapped environments (government, defense contractors, financial institutions).

# One-time setup in connected environment
ollama pull qwen2.5-coder:7b  # Downloads 4.7GB

# Export the model
ollama export qwen2.5-coder:7b > model.tar

# Transfer model.tar to air-gapped system
# On air-gapped system:
ollama import model.tar
ollama serve  # Runs locally only

Now all code generation happens on your hardware without internet connectivity.

Latency Reality Check

Model GPU Tokens/sec Time for 200-token response
Qwen2.5-Coder 7B RTX 4090 ~80 tok/s ~2.5 seconds
Qwen2.5-Coder 14B RTX 4090 ~45 tok/s ~4.5 seconds
Qwen2.5-Coder 7B M3 Max (36GB) ~30 tok/s ~7 seconds
Qwen2.5-Coder 14B M3 Max (36GB) ~15 tok/s ~13 seconds

For autocomplete (10-20 tokens), even the slower speeds are interactive. For chat with longer responses, the 14B on a Mac is noticeable but not painful.

Configure Model Context Size

cat > ~/.ollama/Modelfile_qwen << 'EOF'
FROM qwen2.5-coder:7b
PARAMETER num_ctx 8192
PARAMETER temperature 0.1
SYSTEM "You are an expert software engineer. Generate production-quality code."
EOF
ollama create qwen-code-custom -f ~/.ollama/Modelfile_qwen

When Local Makes Sense

Use local models when:

Stick with cloud APIs when:

Built by theluckystrike — More at zovo.one