How to Set Up Ollama as Private AI Coding Assistant

Running AI-powered code assistance locally gives you complete control over your data. When working with proprietary algorithms, internal APIs, or regulated codebases, sending context to cloud-based AI services may violate compliance requirements or security policies. Ollama provides a solution by running large language models directly on your machine, enabling intelligent code completion and assistance without any data leaving your local environment.

This guide walks through setting up Ollama as a private coding assistant, configuring it with appropriate models, and integrating it into your development workflow.

Installing Ollama

Ollama supports macOS, Linux, and Windows. The installation process is straightforward across all platforms.

macOS and Linux

Download and install Ollama using the official installation script:

curl -fsSL https://ollama.com/install.sh | sh

Verify the installation by checking the version:

ollama --version

Windows

Windows users can install Ollama through the Microsoft Store or by downloading the installer from the official website. After installation, restart your terminal to ensure the ollama command is available.

Selecting and Downloading Models

Ollama supports various models optimized for different tasks. For coding assistance, several models perform well:

CodeLlama: Designed specifically for code generation and understanding
DeepSeek-Coder: Strong performance on code completion and debugging
Qwen2.5-Coder: Balanced capabilities for coding tasks

Pull a model using the ollama pull command:

ollama pull codellama
ollama pull deepseek-coder

Check available models:

ollama list

Models download once and remain cached locally. The storage requirement varies by model—CodeLlama typically needs 3-8GB depending on the specific variant.

Model Comparison for Sensitive Codebases

Choosing the right model matters for both quality and performance. Here is how the leading local coding models compare on the metrics that matter most for private deployments:

Model	Size	RAM Required	Code Quality	Context Window	Best Use Case
CodeLlama 7B	3.8 GB	8 GB	Good	16K tokens	General completion
CodeLlama 13B	7.3 GB	16 GB	Very good	16K tokens	Complex refactoring
DeepSeek-Coder 6.7B	3.8 GB	8 GB	Very good	16K tokens	Multi-language support
Qwen2.5-Coder 7B	4.1 GB	8 GB	Excellent	32K tokens	Long-file edits
Qwen2.5-Coder 14B	8.2 GB	16 GB	Excellent	128K tokens	Large codebase tasks

For most sensitive-codebase scenarios, Qwen2.5-Coder 7B delivers the best quality-to-resource ratio. Its 32K token context window lets you paste entire files without truncation, which is critical when you cannot send code to a cloud API and need the model to see the full picture.

Running Ollama as a Local Server

By default, Ollama runs as a local server on port 11434. Start the server:

ollama serve

Keep this terminal window open, or run it in the background:

ollama serve &

The server listens on http://localhost:11434. All API calls remain local—no external network requests occur during model inference.

Integrating with Code Editors

Several editors support integration with local LLMs through Ollama.

VS Code

Install the Continue extension or Ollama extension from the VS Code marketplace. Configure the extension to connect to your local Ollama instance:

{
  "continue.enableTabAutocomplete": true,
  "continue.llm": {
    "provider": "ollama",
    "model": "codellama"
  }
}

Neovim

For Neovim users, the ollama.nvim plugin provides integration:

require("ollama").setup({
  host = "http://localhost:11434",
  model = "codellama",
  })

Zed Editor

Zed includes native Ollama support. Open settings and enable the Ollama integration:

{
  "ai": {
    "provider": "ollama",
    "model": "codellama"
  }
}

Using Ollama via Command Line

Beyond editor integration, you can interact with Ollama directly through the command line for quick queries:

ollama run codellama "Explain this function: function processData(input) { return input.map(x => x * 2); }"

For more complex interactions, use the REST API:

curl http://localhost:11434/api/generate -d '{
  "model": "codellama",
  "prompt": "Write a Python function to calculate Fibonacci numbers",
  "stream": false
}'

Security Considerations for Sensitive Codebases

Running AI assistance locally provides inherent security benefits, but several practices strengthen your setup:

Network Isolation

Configure your firewall to block outbound connections from the Ollama port. On Linux with UFW:

sudo ufw deny out on lo to 127.0.0.1 port 11434

This ensures Ollama cannot make unexpected network calls.

Model Verification

Review model provenance before use. Ollama models come from Hugging Face and can be verified:

ollama show codellama

This displays the model’s metadata, including SHA256 checksums.

Memory Considerations

LLMs load entirely into RAM during use. Ensure your machine has sufficient memory—16GB is a practical minimum for coding models, with 32GB recommended for larger models or simultaneous use with other applications.

Data Handling

Even with local inference, be mindful of what you paste into prompts. Avoid pasting API keys, credentials, or proprietary algorithms directly into AI queries. Use environment variables or configuration files for sensitive data.

Performance Optimization

Local inference depends heavily on your hardware. Optimize performance with these approaches:

GPU Acceleration

Ollama automatically uses CUDA on NVIDIA GPUs or Metal on Apple Silicon. Verify GPU usage:

ollama list

Model Selection

Smaller models (7B parameters) provide faster responses with less memory. Larger models (13B-34B) offer better reasoning but require more resources. Test different models to find your optimal balance.

Context Window

Limit context window size for faster responses:

curl http://localhost:11434/api/generate -d '{
  "model": "codellama",
  "prompt": "Your prompt here",
  "options": {
    "num_ctx": 2048
  }
}'

Full Setup Walkthrough: Secure Team Environment

This workflow covers setting up Ollama for a small engineering team where all code must stay on-premises—common in fintech, defense contracting, and healthcare software shops.

Step 1: Install on a dedicated inference server. Instead of each developer running their own model, provision a shared Linux server with a GPU. Install Ollama and expose it on an internal network address:

OLLAMA_HOST=0.0.0.0:11434 ollama serve

Restrict access with your corporate firewall to the VPN subnet only. This way, engineers with lower-spec laptops get GPU-accelerated inference without each maintaining their own model cache.

Step 2: Pull the model once, share via NFS. Ollama stores models in ~/.ollama/models. Mount this directory from a NAS share on each machine so model downloads happen once across the team:

# On each developer workstation
sudo mount -t nfs nas.internal:/ollama-models /home/user/.ollama/models
ollama pull qwen2.5-coder:7b  # downloads to NAS, shared by all

Step 3: Configure Continue.dev in VS Code for each developer. The Continue extension is the most flexible client for team setups. Create a shared ~/.continue/config.json that engineers copy:

{
  "models": [
    {
      "title": "Qwen2.5-Coder (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "http://gpu-server.internal:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b",
    "apiBase": "http://gpu-server.internal:11434"
  }
}

Step 4: Add a Modelfile for your codebase conventions. Ollama supports custom system prompts via Modelfiles. Create /etc/ollama/Modelfile.internal:

FROM qwen2.5-coder:7b

SYSTEM """
You are a coding assistant for internal tooling built on Python 3.12 and FastAPI.
Always use Pydantic v2 models. Prefer async/await patterns. Never suggest
importing third-party packages not in requirements.txt. Do not include
explanatory prose unless explicitly requested.
"""

Build and register the model: ollama create internal-coder -f /etc/ollama/Modelfile.internal

Step 5: Audit inference logs for compliance. Ollama logs prompts and responses to stdout by default. Redirect to a secured log file and rotate daily:

OLLAMA_HOST=0.0.0.0:11434 ollama serve >> /var/log/ollama/inference.log 2>&1

For regulated environments, store these logs for the retention period required by your compliance framework (SOC 2, HIPAA, etc.).

Practical Example: Code Review Workflow

Here’s how a typical code review session works with local Ollama:

Start Ollama in the background: ollama serve &
Paste a function into the CLI or editor extension
Request analysis: “Identify potential bugs and suggest improvements”
Review suggestions locally

The entire interaction happens offline. No code leaves your machine.

FAQ

Q: How does Ollama compare to running llama.cpp directly for coding tasks?

Ollama wraps llama.cpp and adds a REST API, model management, and automatic quantization selection. For coding use cases, Ollama is strictly easier to operate. Use llama.cpp directly only if you need fine-grained quantization control (e.g., GGUF Q4_K_M vs Q5_K_S) or want to integrate into a custom inference pipeline.

Q: Can I use Ollama with JetBrains IDEs like IntelliJ or PyCharm?

Yes. Install the AI Assistant plugin (it supports custom endpoints) or use the Grazie plugin configured for local inference. Alternatively, the Continue plugin for JetBrains supports Ollama directly and mirrors the VS Code configuration.

Q: What happens to model performance when running on CPU only?

Inference on CPU is 10-30x slower than GPU. A 7B model generating a 200-token response takes about 2-3 minutes on a modern laptop CPU versus 5-10 seconds on an M2 Pro or NVIDIA RTX 3080. For interactive autocomplete, CPU-only inference is too slow to be practical. Consider cloud-hosted private inference (Replicate, Modal, or AWS Bedrock with private VPC endpoints) if GPU hardware is unavailable.

Q: Is there a way to prevent Ollama from phoning home for model metadata?

Set OLLAMA_NOPRUNE=1 and OLLAMA_SKIP_UPDATE=1 environment variables. For fully airgapped environments, download GGUF model files directly from Hugging Face on a connected machine, copy them to the isolated server, and import with ollama create mymodel -f ./Modelfile pointing to the local GGUF path.

Built by theluckystrike — More at zovo.one