Open source AI coding assistants have evolved significantly, now offering the ability to learn from your private codebase patterns. These tools can analyze your existing code to provide context-aware suggestions that understand your project’s architecture, coding conventions, and unique patterns. This guide explores practical implementations for developers who want to keep their code private while leveraging AI assistance.
Table of Contents
- Why Local Codebase Analysis Matters
- Setting Up Codebase-Aware AI Assistance
- Implementing Custom Pattern Learning
- Practical Applications
- Best Practices for Pattern Learning
- Building a RAG Pipeline for Deep Codebase Understanding
- Keeping Codebase Indexes Current
Why Local Codebase Analysis Matters
When you work with AI assistants that can learn from your codebase, you get suggestions that actually fit your project. Instead of generic recommendations, the AI understands your specific patterns—how you name variables, structure your modules, handle errors, and organize imports. This leads to more relevant completions and fewer manual corrections.
The privacy benefit is equally important. Your proprietary code never leaves your infrastructure when you use local, open source solutions. This makes these tools suitable for enterprise environments with strict data policies or developers working on sensitive projects.
Setting Up Codebase-Aware AI Assistance
Several open source tools now support codebase indexing and pattern learning. Here’s how to implement this with popular options.
Using Continue With Local Models
Continue is an open source copilot that works with local models and supports codebase indexing:
// ~/.continue/config.json
{
"models": [
{
"model": "llama3",
"provider": "ollama",
"contextWindow": 128000
}
],
"embeddings": [
{
"provider": "sentence-transformers",
"model": "all-MiniLM-L6-v2"
}
],
"indexes": [
{
"name": "my-project",
"repo": "/path/to/your/project",
"ignore": ["node_modules/", "*.log", "__pycache__/"]
}
]
}
After configuring the index, the assistant can retrieve relevant code context from your project when you ask questions or request code generation.
CodeQwen With Local Indexing
CodeQwen provides strong code understanding capabilities and can be paired with local embedding models:
# Install codeqwen with local embedding support
pip install codeqwen[local]
# Index your codebase
codeqwen index ./your-project \
--embeddings all-MiniLM-L6-v2 \
--output .codeqwen-index
The indexing process analyzes your code structure, function signatures, class hierarchies, and comment patterns. Subsequent queries use this index to provide context-aware responses.
Implementing Custom Pattern Learning
For projects with unique requirements, you can build custom pattern learning into your AI workflow.
Extracting Code Patterns Programmatically
import ast
from pathlib import Path
from collections import defaultdict
class CodePatternExtractor:
def __init__(self, project_path: str):
self.project_path = Path(project_path)
self.patterns = defaultdict(list)
def extract_function_patterns(self) -> dict:
"""Extract common function signatures and patterns."""
for py_file in self.project_path.rglob("*.py"):
try:
tree = ast.parse(py_file.read_text())
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
pattern = {
'name': node.name,
'args': len(node.args.args),
'has_return': any(isinstance(n, ast.Return) for n in ast.walk(node)),
'has_docstring': ast.get_docstring(node) is not None,
'decorators': [d.id for d in node.decorator_list if isinstance(d, ast.Name)]
}
self.patterns['functions'].append(pattern)
except:
continue
return dict(self.patterns)
def get_naming_conventions(self) -> dict:
"""Analyze variable and function naming patterns."""
conventions = {'snake_case': 0, 'camelCase': 0, 'PascalCase': 0}
# Pattern analysis logic here
return conventions
# Usage
extractor = CodePatternExtractor("/path/to/your/codebase")
patterns = extractor.extract_function_patterns()
print(f"Found {len(patterns['functions'])} function patterns")
This pattern data can be fed to your AI assistant as context, helping it generate code that matches your project’s style.
Integration With LM Studio
LM Studio enables running local LLMs with context from your codebase:
// Use LM Studio's context injection
const context = await generateContext({
projectFiles: await scanProject('./src'),
patterns: extractedPatterns,
maxTokens: 32000
});
const response = await ollama.chat({
model: 'codellama',
messages: [
{ role: 'system', content: context },
{ role: 'user', content: 'Generate a new service module following our patterns' }
]
});
Practical Applications
API Client Generation
When your codebase has established API client patterns, an AI that understands these patterns can generate new clients that integrate :
# Your existing pattern
class UserAPIClient:
def __init__(self, base_url: str, auth_token: str):
self.base_url = base_url
self.session = requests.Session()
self.session.headers.update({'Authorization': f'Bearer {auth_token}'})
def get_user(self, user_id: int) -> dict:
response = self.session.get(f"{self.base_url}/users/{user_id}")
response.raise_for_status()
return response.json()
# AI can now generate similar clients for other resources
# following your exact pattern
Test File Creation
AI assistants that learned your testing patterns can generate test files matching your existing test structure, naming conventions, and assertion styles.
Best Practices for Pattern Learning
Keep your pattern indexes updated by re-indexing after significant code changes. Use .gitignore patterns to exclude generated files, dependencies, and sensitive configuration from analysis. For highly sensitive projects, consider running everything in an air-gapped environment.
When selecting embedding models, balance accuracy against resource usage. Smaller models like all-MiniLM-L6-v2 work well for most projects while requiring less memory.
Building a RAG Pipeline for Deep Codebase Understanding
Retrieval-Augmented Generation (RAG) is the foundational technique behind all codebase-aware AI assistants. Understanding how it works lets you build custom pipelines optimized for your specific codebase structure rather than relying entirely on a general-purpose tool’s indexing decisions.
The core pipeline has four stages: chunking your codebase into meaningful units, embedding those units into vector representations, storing embeddings in a searchable database, and retrieving the most relevant chunks at query time. Here is a minimal implementation using ChromaDB and sentence-transformers:
import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
import ast
# Initialize embedding model and vector store
embedder = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.PersistentClient(path=".codebase-index")
collection = client.get_or_create_collection("codebase")
def chunk_python_file(file_path: str) -> list:
"""Split a Python file into function-level chunks for better retrieval."""
source = Path(file_path).read_text()
chunks = []
try:
tree = ast.parse(source)
lines = source.splitlines()
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
start = node.lineno - 1
end = node.end_lineno
chunk_source = "\n".join(lines[start:end])
chunks.append({
"id": f"{file_path}:{node.name}:{start}",
"text": chunk_source,
"metadata": {
"file": file_path,
"name": node.name,
"type": type(node).__name__,
"line": node.lineno
}
})
except SyntaxError:
# Fall back to file-level chunking for non-parseable files
chunks.append({
"id": file_path,
"text": source[:4000], # Truncate very large files
"metadata": {"file": file_path, "type": "file"}
})
return chunks
def index_codebase(project_path: str):
"""Index all Python files in a project."""
all_chunks = []
for fp in Path(project_path).rglob("*.py"):
if any(p in str(fp) for p in ["__pycache__", ".venv", "node_modules"]):
continue
all_chunks.extend(chunk_python_file(str(fp)))
# Batch embed and store
batch_size = 50
for i in range(0, len(all_chunks), batch_size):
batch = all_chunks[i:i + batch_size]
embeddings = embedder.encode([c["text"] for c in batch]).tolist()
collection.add(
ids=[c["id"] for c in batch],
documents=[c["text"] for c in batch],
embeddings=embeddings,
metadatas=[c["metadata"] for c in batch]
)
print(f"Indexed {len(all_chunks)} code chunks from {project_path}")
def retrieve_context(query: str, n_results: int = 5) -> str:
"""Retrieve the most relevant code chunks for a query."""
query_embedding = embedder.encode([query]).tolist()
results = collection.query(
query_embeddings=query_embedding,
n_results=n_results
)
context_pieces = results["documents"][0]
return "\n\n---\n\n".join(context_pieces)
With this pipeline, every AI query first retrieves the most semantically similar code from your project before generating a response. A question like “how do we handle database connection errors?” retrieves your actual error handling code as context, producing suggestions that match your established patterns exactly.
Keeping Codebase Indexes Current
A stale index is worse than no index because it generates confident suggestions based on outdated patterns. Build an incremental indexing strategy that updates the index as files change rather than reindexing the entire codebase on each update:
import hashlib
import json
from pathlib import Path
def get_file_hash(file_path: str) -> str:
"""Compute MD5 hash of file contents for change detection."""
return hashlib.md5(Path(file_path).read_bytes()).hexdigest()
def incremental_update(project_path: str, hash_store: str = ".index-hashes.json"):
"""Update only changed files in the codebase index."""
hashes = {}
if Path(hash_store).exists():
hashes = json.loads(Path(hash_store).read_text())
updated = []
deleted = []
for fp in Path(project_path).rglob("*.py"):
path_str = str(fp)
current_hash = get_file_hash(path_str)
if hashes.get(path_str) != current_hash:
# File is new or changed — re-index it
chunks = chunk_python_file(path_str)
# Delete old chunks for this file from vector store
old_ids = [c["id"] for c in collection.get(
where={"file": path_str}
)["ids"]]
if old_ids:
collection.delete(ids=old_ids)
# Insert updated chunks
if chunks:
embeddings = embedder.encode([c["text"] for c in chunks]).tolist()
collection.add(
ids=[c["id"] for c in chunks],
documents=[c["text"] for c in chunks],
embeddings=embeddings,
metadatas=[c["metadata"] for c in chunks]
)
hashes[path_str] = current_hash
updated.append(path_str)
Path(hash_store).write_text(json.dumps(hashes))
print(f"Updated {len(updated)} files in index")
return updated
Run this incremental update as a pre-commit hook or integrate it into your file watcher setup. The index stays current without the full reindex cost, keeping suggestions accurate even during rapid development cycles.
Related Articles
- How to Use AI Assistants for Codebase
- Fine Tune Open Source Code Models for Your Codebase
- How to Audit What Source Code AI Coding Tools Transmit
- Effective Workflow for AI-Assisted Open Source Contribution
- Best AI Assistant for Drafting Open Source Partnership
Built by theluckystrike — More at zovo.one