AI Tools for Detecting Duplicate GitHub Issues Using

Duplicate GitHub issues clutter repositories, confuse contributors, and make it harder for maintainers to prioritize fixes. When multiple users report the same bug or request the same feature using different wording, these scattered reports split the discussion and delay resolution. AI tools that use semantic similarity matching can automatically detect when new issues are likely duplicates of existing ones, helping teams keep their issue trackers organized and actionable.

The Duplicate Issue Problem

Open source projects often receive duplicate reports without any malicious intent. One user describes a bug as “the app crashes when I upload a large file,” while another reports “file upload fails with memory error for big documents.” To a human, these clearly describe the same underlying issue, but traditional keyword-based search fails to connect them because they use different vocabulary.

The traditional approach to handling duplicates relies on maintainers manually reviewing new issues and searching for similar existing ones. This process scales poorly as projects grow. A popular repository might receive hundreds of new issues weekly, making it impossible for maintainers to catch every duplicate. The result is fragmented discussions, duplicated effort in debugging, and frustrated users who file reports that get closed as duplicates without acknowledgment.

How Semantic Similarity Matching Works

Semantic similarity goes beyond simple keyword matching. It uses machine learning models to understand the meaning behind text, not just the specific words used. When you submit a new GitHub issue, an AI tool can compare it against all existing issues and calculate a similarity score based on semantic understanding rather than exact word matches.

For example, the phrases “app freezes when clicking the button” and “interface becomes unresponsive after user interaction with submit control” convey the same core problem despite sharing no common keywords. A semantic similarity model trained on codebases and technical language can recognize that both describe an UI freezing issue.

Modern embedding models convert text into numerical vectors that capture semantic meaning. When two pieces of text have vectors that are close together in the embedding space, they are semantically similar. This approach works even when the wording differs substantially, making it ideal for detecting duplicate GitHub issues across diverse user descriptions.

Implementing Duplicate Detection

Several approaches exist for adding AI-powered duplicate detection to your GitHub workflow. Here is a practical implementation using Python and sentence transformers:

from sentence_transformers import SentenceTransformer
import github3
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def get_issue_embeddings(issue):
    title_embedding = model.encode(issue.title)
    body_embedding = model.encode(issue.body or "")
    return np.concatenate([title_embedding, body_embedding])

def find_similar_issues(new_issue, existing_issues, threshold=0.75):
    new_embedding = get_issue_embeddings(new_issue)
    similarities = []

    for existing in existing_issues:
        existing_embedding = get_issue_embeddings(existing)
        score = cosine_similarity(
            [new_embedding],
            [existing_embedding]
        )[0][0]

        if score >= threshold:
            similarities.append((existing, score))

    return sorted(similarities, key=lambda x: x[1], reverse=True)

# Usage with GitHub API
repo = github3.repository('your-username', 'your-repo')
new_issue = repo.issue(123)
all_issues = [issue for issue in repo.issues() if issue.number != 123]

similar = find_similar_issues(new_issue, all_issues)
for issue, score in similar:
    print(f"#{issue.number}: {issue.title} (similarity: {score:.2f})")

This script uses the all-MiniLM-L6-v2 model, which provides a good balance between accuracy and performance for technical text. The cosine similarity threshold of 0.75 works well for most projects, though you can adjust it based on your tolerance for false positives.

GitHub Actions Integration

You can automate duplicate detection as part of your issue workflow using GitHub Actions:

name: Detect Duplicate Issues
on:
  issues:
    types: [opened, edited]

jobs:
  check-duplicates:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install sentence-transformers scikit-learn github3-py
      - name: Run duplicate detection
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: python .github/scripts/detect_duplicates.py

The action triggers when new issues are opened or edited, automatically checking for duplicates and commenting with similar issues.

Choosing the Right Embedding Model

The performance of semantic similarity matching depends heavily on the embedding model you choose. For technical text like GitHub issues, models trained on code and technical documentation outperform general-purpose models.

The all-MiniLM-L6-v2 model used in the example above works well for most use cases. It processes text quickly while maintaining good accuracy for detecting semantically similar issues. For more demanding applications, larger models like all-mpnet-base-v2 provide better semantic understanding at the cost of slower processing.

Fine-tuning embedding models on your specific repository’s issue history can improve accuracy further. If your project uses domain-specific terminology, a model adapted to your vocabulary will catch duplicates that generic models miss.

Practical Considerations

While semantic similarity matching significantly improves duplicate detection, some challenges remain. Issues that describe the same problem but in completely different contexts—such as two unrelated features that happen to use similar wording—can generate false positives. Setting an appropriate similarity threshold helps balance catching real duplicates against incorrectly flagging unrelated issues.

Another consideration is processing time. Computing embeddings for all existing issues takes time in large repositories. Caching embeddings and only recalculating them for new or modified issues makes the system practical for real-world use.

Response time also matters for user experience. If you want to provide immediate feedback when users file issues, consider using a faster but slightly less accurate model, or implementing an asynchronous check that comments on the issue shortly after submission.

Maintaining a Clean Issue Tracker

Automating duplicate detection is just one part of maintaining a healthy issue tracker. Even with AI assistance, clearly labeling duplicate issues and linking them to the original helps future users find solutions. A good practice is to close duplicates with a comment that references the canonical issue, explaining why they are related.

Over time, the data from detected duplicates can reveal patterns. If certain types of issues frequently get duplicated, consider whether the original issue could be more clearly written, or whether a FAQ document would help users find existing solutions before filing new reports.

AI-powered duplicate detection reduces manual effort while improving the experience for both contributors and maintainers. By automatically surfacing similar issues when new reports come in, you help users discover relevant discussions faster and keep your project’s issue tracker organized.

Real-World Implementation Results

Organizations implementing semantic similarity detection see measurable improvements. A mid-size open source project with 5,000+ issues reduced duplicate reports by 35-40% after deploying an AI duplicate detector. The system catches duplicates as new issues are filed, with maintainers reporting 10-15 hours per month saved on duplicate triage.

GitHub itself uses semantic similarity for issue recommendations, showing related issues to users as they create new reports. This simple feature reduces intentional duplicates by making users aware that their issue might already exist before they hit submit.

For enterprise deployment, consider processing batches of existing issues through duplicate detection to identify and merge historical duplicates. This cleanup effort typically reveals 5-15% of your issue corpus consists of duplicates that have accumulated over years of operation.

Choosing Embedding Models for Your Domain

The all-MiniLM-L6-v2 model used in earlier examples works for general purpose issue tracking, but domain-specific improvements matter. For projects with specialized terminology—medical devices, finance systems, scientific computing—fine-tuning embedding models on your issue history improves accuracy significantly.

Here’s how to evaluate embedding model performance:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Test multiple models
models_to_test = [
    'all-MiniLM-L6-v2',           # Fast, general purpose
    'all-mpnet-base-v2',           # Larger, more accurate
    'sentence-transformers/code-search-distilspell-multilingual-v1',  # Code-focused
]

def evaluate_model(model_name, test_pairs):
    """Evaluate how well a model identifies duplicates."""
    model = SentenceTransformer(model_name)

    correct = 0
    for issue1, issue2, is_duplicate in test_pairs:
        emb1 = model.encode(issue1)
        emb2 = model.encode(issue2)
        score = cosine_similarity([emb1], [emb2])[0][0]

        predicted_duplicate = score > 0.75
        if predicted_duplicate == is_duplicate:
            correct += 1

    return correct / len(test_pairs)

# Test with a set of known duplicates from your repository
for model_name in models_to_test:
    accuracy = evaluate_model(model_name, your_test_pairs)
    print(f"{model_name}: {accuracy:.2%} accuracy")

For code-focused projects, consider code-search-distilspell-multilingual-v1 which specializes in programming terminology. For documentation or feature request tracking, the standard all-mpnet-base-v2 often suffices.

Advanced Detection: Multi-Field Matching

Simple title-and-body matching misses some duplicates. A system compares multiple fields:

def get_issue_vector(issue, model):
    """Create a comprehensive vector representing an issue."""
    # Weight different fields by importance
    title_weight = 0.5
    body_weight = 0.3
    labels_weight = 0.1
    comments_weight = 0.1

    title_emb = model.encode(issue['title']) * title_weight
    body_emb = model.encode(issue['body'] or "") * body_weight

    # For labels: concatenate and encode
    labels_text = " ".join(issue.get('labels', []))
    labels_emb = model.encode(labels_text) * labels_weight if labels_text else np.zeros(384)

    # For comments: encode recent high-engagement comments
    recent_comments = issue.get('recent_comments', [])[:3]
    comments_text = " ".join(recent_comments)
    comments_emb = model.encode(comments_text) * comments_weight if comments_text else np.zeros(384)

    # Combine weighted vectors
    combined = title_emb + body_emb + labels_emb + comments_emb
    return combined / np.linalg.norm(combined)  # Normalize

This approach captures context better than title-only matching. A duplicate issue might use different terminology but reference the same labels, indicating similarity.

Handling False Positives and False Negatives

No automatic system achieves 100% accuracy. False positives (marking unrelated issues as duplicates) damage user experience by closing legitimate reports. False negatives (missing actual duplicates) waste time on triage.

Adjust your similarity threshold to balance these risks:

Threshold 0.90: Conservative, catches only obvious duplicates, many false negatives
Threshold 0.75: Balanced, good for most projects
Threshold 0.60: Aggressive, catches subtle duplicates, higher false positive risk

Rather than purely automatic closing, consider a two-stage approach:

Stage 1: Automatically comment on high-similarity matches (>0.85) without closing
Stage 2: Require maintainer approval for closing (<0.85)

This gives maintainers the opportunity to review before any action:

async def handle_new_issue(issue_number):
    new_issue = await get_issue(issue_number)
    similar = find_similar_issues(new_issue, threshold=0.75)

    if not similar:
        return

    high_confidence = [s for s in similar if s[1] > 0.85]
    medium_confidence = [s for s in similar if 0.75 <= s[1] <= 0.85]

    if high_confidence:
        await comment_and_request_review(issue_number, high_confidence)
    elif medium_confidence:
        await comment_with_suggestions(issue_number, medium_confidence)

Performance at Scale

Embedding computation scales linearly with issue count. For 10,000 issues with an average of 200 tokens each using a mid-size model, generating embeddings takes roughly 10-15 minutes on standard hardware. Caching embeddings in a database makes subsequent duplicate detection instant.

Production systems should:

Pre-compute embeddings for all existing issues once
Cache embeddings in a database table with issue_id, vector, and compute timestamp
Recompute embeddings only when issues are edited
Run duplicate detection asynchronously after cache lookup

class DuplicateDetectionService:
    def __init__(self, model, cache_db):
        self.model = SentenceTransformer(model)
        self.cache = cache_db

    async def check_issue(self, issue_id):
        # Check cache first
        embedding = await self.cache.get_embedding(issue_id)

        if not embedding:
            # Compute and cache
            issue = await get_issue(issue_id)
            embedding = self.model.encode(issue_to_text(issue))
            await self.cache.set_embedding(issue_id, embedding)

        # Compare against cached embeddings
        all_cached = await self.cache.get_all_embeddings()
        similarities = cosine_similarity([embedding], all_cached)

        return self.score_results(similarities)

For repositories with 100,000+ issues, consider approximate nearest neighbor (ANN) indexes for faster similarity search. Libraries like FAISS or Annoy reduce search time from O(n) to O(log n):

import faiss
import numpy as np

# Build index on all issue embeddings
vectors = np.array([embedding for _, embedding in all_issues])
index = faiss.IndexFlatL2(vectors.shape[1])
index.add(vectors)

# Search: k nearest neighbors to new issue
new_embedding = model.encode(new_issue_text)
distances, indices = index.search(np.array([new_embedding]), k=10)

Integration with GitHub’s Native Features

GitHub’s own duplicate detection suggests related issues as users write. Complement this with your AI system for automated detection, which GitHub can’t do without explicit system access.

Consider integration patterns:

GitHub Actions Bot: Auto-comment on issues with AI-detected duplicates, allowing manual closing
CI/CD Check: Fail PR reviews that close issues without addressing related reports
Issue Template: Prompt users to search for duplicates before submitting
GitHub API Webhooks: Listen for issue creation and run detection asynchronously

The most effective approach combines all layers: user-facing search assistance, automated commenting, and maintainer workflows that make closing duplicates easy once identified.

Measuring Success

Track these metrics to assess your duplicate detection system:

Detection Rate: Percentage of actual duplicates caught automatically
False Positive Rate: Percentage of flagged duplicates that were incorrect
Time Saved: Hours spent on duplicate triage before vs. after
User Satisfaction: Feedback from contributors on false closures

Most projects see 30-40% reduction in duplicate issues within the first month, with further improvements as the system learns from your specific terminology and issue patterns.

Built by theluckystrike — More at zovo.one