AI Tools Compared

Open source AI coding assistants have evolved significantly, now offering the ability to learn from your private codebase patterns. These tools can analyze your existing code to provide context-aware suggestions that understand your project’s architecture, coding conventions, and unique patterns. This guide explores practical implementations for developers who want to keep their code private while leveraging AI assistance.

Table of Contents

Why Local Codebase Analysis Matters

When you work with AI assistants that can learn from your codebase, you get suggestions that actually fit your project. Instead of generic recommendations, the AI understands your specific patterns—how you name variables, structure your modules, handle errors, and organize imports. This leads to more relevant completions and fewer manual corrections.

The privacy benefit is equally important. Your proprietary code never leaves your infrastructure when you use local, open source solutions. This makes these tools suitable for enterprise environments with strict data policies or developers working on sensitive projects.

Setting Up Codebase-Aware AI Assistance

Several open source tools now support codebase indexing and pattern learning. Here’s how to implement this with popular options.

Using Continue With Local Models

Continue is an open source copilot that works with local models and supports codebase indexing:

// ~/.continue/config.json
{
  "models": [
    {
      "model": "llama3",
      "provider": "ollama",
      "contextWindow": 128000
    }
  ],
  "embeddings": [
    {
      "provider": "sentence-transformers",
      "model": "all-MiniLM-L6-v2"
    }
  ],
  "indexes": [
    {
      "name": "my-project",
      "repo": "/path/to/your/project",
      "ignore": ["node_modules/", "*.log", "__pycache__/"]
    }
  ]
}

After configuring the index, the assistant can retrieve relevant code context from your project when you ask questions or request code generation.

CodeQwen With Local Indexing

CodeQwen provides strong code understanding capabilities and can be paired with local embedding models:

# Install codeqwen with local embedding support
pip install codeqwen[local]

# Index your codebase
codeqwen index ./your-project \
  --embeddings all-MiniLM-L6-v2 \
  --output .codeqwen-index

The indexing process analyzes your code structure, function signatures, class hierarchies, and comment patterns. Subsequent queries use this index to provide context-aware responses.

Implementing Custom Pattern Learning

For projects with unique requirements, you can build custom pattern learning into your AI workflow.

Extracting Code Patterns Programmatically

import ast
from pathlib import Path
from collections import defaultdict

class CodePatternExtractor:
    def __init__(self, project_path: str):
        self.project_path = Path(project_path)
        self.patterns = defaultdict(list)

    def extract_function_patterns(self) -> dict:
        """Extract common function signatures and patterns."""
        for py_file in self.project_path.rglob("*.py"):
            try:
                tree = ast.parse(py_file.read_text())
                for node in ast.walk(tree):
                    if isinstance(node, ast.FunctionDef):
                        pattern = {
                            'name': node.name,
                            'args': len(node.args.args),
                            'has_return': any(isinstance(n, ast.Return) for n in ast.walk(node)),
                            'has_docstring': ast.get_docstring(node) is not None,
                            'decorators': [d.id for d in node.decorator_list if isinstance(d, ast.Name)]
                        }
                        self.patterns['functions'].append(pattern)
            except:
                continue
        return dict(self.patterns)

    def get_naming_conventions(self) -> dict:
        """Analyze variable and function naming patterns."""
        conventions = {'snake_case': 0, 'camelCase': 0, 'PascalCase': 0}
        # Pattern analysis logic here
        return conventions

# Usage
extractor = CodePatternExtractor("/path/to/your/codebase")
patterns = extractor.extract_function_patterns()
print(f"Found {len(patterns['functions'])} function patterns")

This pattern data can be fed to your AI assistant as context, helping it generate code that matches your project’s style.

Integration With LM Studio

LM Studio enables running local LLMs with context from your codebase:

// Use LM Studio's context injection
const context = await generateContext({
  projectFiles: await scanProject('./src'),
  patterns: extractedPatterns,
  maxTokens: 32000
});

const response = await ollama.chat({
  model: 'codellama',
  messages: [
    { role: 'system', content: context },
    { role: 'user', content: 'Generate a new service module following our patterns' }
  ]
});

Practical Applications

API Client Generation

When your codebase has established API client patterns, an AI that understands these patterns can generate new clients that integrate :

# Your existing pattern
class UserAPIClient:
    def __init__(self, base_url: str, auth_token: str):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({'Authorization': f'Bearer {auth_token}'})

    def get_user(self, user_id: int) -> dict:
        response = self.session.get(f"{self.base_url}/users/{user_id}")
        response.raise_for_status()
        return response.json()

# AI can now generate similar clients for other resources
# following your exact pattern

Test File Creation

AI assistants that learned your testing patterns can generate test files matching your existing test structure, naming conventions, and assertion styles.

Best Practices for Pattern Learning

Keep your pattern indexes updated by re-indexing after significant code changes. Use .gitignore patterns to exclude generated files, dependencies, and sensitive configuration from analysis. For highly sensitive projects, consider running everything in an air-gapped environment.

When selecting embedding models, balance accuracy against resource usage. Smaller models like all-MiniLM-L6-v2 work well for most projects while requiring less memory.

Building a RAG Pipeline for Deep Codebase Understanding

Retrieval-Augmented Generation (RAG) is the foundational technique behind all codebase-aware AI assistants. Understanding how it works lets you build custom pipelines optimized for your specific codebase structure rather than relying entirely on a general-purpose tool’s indexing decisions.

The core pipeline has four stages: chunking your codebase into meaningful units, embedding those units into vector representations, storing embeddings in a searchable database, and retrieving the most relevant chunks at query time. Here is a minimal implementation using ChromaDB and sentence-transformers:

import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
import ast

# Initialize embedding model and vector store
embedder = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.PersistentClient(path=".codebase-index")
collection = client.get_or_create_collection("codebase")

def chunk_python_file(file_path: str) -> list:
    """Split a Python file into function-level chunks for better retrieval."""
    source = Path(file_path).read_text()
    chunks = []
    try:
        tree = ast.parse(source)
        lines = source.splitlines()
        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
                start = node.lineno - 1
                end = node.end_lineno
                chunk_source = "\n".join(lines[start:end])
                chunks.append({
                    "id": f"{file_path}:{node.name}:{start}",
                    "text": chunk_source,
                    "metadata": {
                        "file": file_path,
                        "name": node.name,
                        "type": type(node).__name__,
                        "line": node.lineno
                    }
                })
    except SyntaxError:
        # Fall back to file-level chunking for non-parseable files
        chunks.append({
            "id": file_path,
            "text": source[:4000],  # Truncate very large files
            "metadata": {"file": file_path, "type": "file"}
        })
    return chunks

def index_codebase(project_path: str):
    """Index all Python files in a project."""
    all_chunks = []
    for fp in Path(project_path).rglob("*.py"):
        if any(p in str(fp) for p in ["__pycache__", ".venv", "node_modules"]):
            continue
        all_chunks.extend(chunk_python_file(str(fp)))

    # Batch embed and store
    batch_size = 50
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i:i + batch_size]
        embeddings = embedder.encode([c["text"] for c in batch]).tolist()
        collection.add(
            ids=[c["id"] for c in batch],
            documents=[c["text"] for c in batch],
            embeddings=embeddings,
            metadatas=[c["metadata"] for c in batch]
        )
    print(f"Indexed {len(all_chunks)} code chunks from {project_path}")

def retrieve_context(query: str, n_results: int = 5) -> str:
    """Retrieve the most relevant code chunks for a query."""
    query_embedding = embedder.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=n_results
    )
    context_pieces = results["documents"][0]
    return "\n\n---\n\n".join(context_pieces)

With this pipeline, every AI query first retrieves the most semantically similar code from your project before generating a response. A question like “how do we handle database connection errors?” retrieves your actual error handling code as context, producing suggestions that match your established patterns exactly.

Keeping Codebase Indexes Current

A stale index is worse than no index because it generates confident suggestions based on outdated patterns. Build an incremental indexing strategy that updates the index as files change rather than reindexing the entire codebase on each update:

import hashlib
import json
from pathlib import Path

def get_file_hash(file_path: str) -> str:
    """Compute MD5 hash of file contents for change detection."""
    return hashlib.md5(Path(file_path).read_bytes()).hexdigest()

def incremental_update(project_path: str, hash_store: str = ".index-hashes.json"):
    """Update only changed files in the codebase index."""
    hashes = {}
    if Path(hash_store).exists():
        hashes = json.loads(Path(hash_store).read_text())

    updated = []
    deleted = []

    for fp in Path(project_path).rglob("*.py"):
        path_str = str(fp)
        current_hash = get_file_hash(path_str)
        if hashes.get(path_str) != current_hash:
            # File is new or changed — re-index it
            chunks = chunk_python_file(path_str)
            # Delete old chunks for this file from vector store
            old_ids = [c["id"] for c in collection.get(
                where={"file": path_str}
            )["ids"]]
            if old_ids:
                collection.delete(ids=old_ids)
            # Insert updated chunks
            if chunks:
                embeddings = embedder.encode([c["text"] for c in chunks]).tolist()
                collection.add(
                    ids=[c["id"] for c in chunks],
                    documents=[c["text"] for c in chunks],
                    embeddings=embeddings,
                    metadatas=[c["metadata"] for c in chunks]
                )
            hashes[path_str] = current_hash
            updated.append(path_str)

    Path(hash_store).write_text(json.dumps(hashes))
    print(f"Updated {len(updated)} files in index")
    return updated

Run this incremental update as a pre-commit hook or integrate it into your file watcher setup. The index stays current without the full reindex cost, keeping suggestions accurate even during rapid development cycles.

Built by theluckystrike — More at zovo.one