How to Monitor the Dark Web for Data Breaches
When a company’s database is breached, the stolen data ends up on dark web markets, paste sites, and breach-trading forums within hours to weeks. Your credentials may be circulating for months before the company notifies you — or they never do. Monitoring these sources lets you react before attackers do.
Understanding Breach Data
A typical breach dump contains one or more of:
- Email + password hash: The most common — email and bcrypt/MD5/SHA1 hash of password
- Email + plaintext password: Worst case — plaintext if the site stored them poorly
- Email + personal data: Name, address, phone, date of birth, SSN
- API tokens and secrets: From developer credential breaches
- Session tokens: Short-lived but exploitable immediately after breach
Breach data flows from:
- Initial breach → hacker’s private possession
- Sold on dark web markets (days to months later)
- Shared on breach forums
- Indexed by aggregators (HIBP, DeHashed, etc.)
Have I Been Pwned (HIBP)
HIBP is the most reliable public breach index, operated by Troy Hunt. It indexes breaches and lets you check without exposing your full password.
# Quick check via web
# https://haveibeenpwned.com — enter email
# API access (free with rate limiting, or paid for higher limits)
# Check if email is in any breach:
curl -s "https://haveibeenpwned.com/api/v3/breachedaccount/you@example.com" \
-H "hibp-api-key: YOUR_API_KEY" \
-H "User-Agent: YourMonitoringScript/1.0" | python3 -m json.tool
Check Password Hashes (k-Anonymity)
HIBP’s password check uses k-anonymity — you send only the first 5 characters of a SHA-1 hash, so HIBP never sees your full password:
#!/usr/bin/env python3
"""Check if a password appears in HIBP breach database."""
import hashlib
import requests
import sys
def check_password_hibp(password: str) -> int:
"""Returns the number of times this password appears in breaches."""
sha1 = hashlib.sha1(password.encode()).hexdigest().upper()
prefix = sha1[:5]
suffix = sha1[5:]
response = requests.get(
f"https://api.pwnedpasswords.com/range/{prefix}",
headers={"User-Agent": "PasswordChecker/1.0"}
)
response.raise_for_status()
# Look for our suffix in the response
for line in response.text.splitlines():
hash_suffix, count = line.split(":")
if hash_suffix == suffix:
return int(count)
return 0
if __name__ == "__main__":
password = sys.argv[1] if len(sys.argv) > 1 else input("Password to check: ")
count = check_password_hibp(password)
if count > 0:
print(f"FOUND: This password appears {count:,} times in breach data.")
print("Change it immediately everywhere it's used.")
else:
print("Not found in known breach data.")
Automated Email Monitoring
#!/usr/bin/env python3
"""Monitor multiple email addresses for new breach appearances."""
import requests
import time
import json
from pathlib import Path
API_KEY = "your-hibp-api-key"
EMAILS = [
"primary@example.com",
"backup@example.com",
"work@example.com"
]
STATE_FILE = Path("/home/user/.hibp-state.json")
def get_breaches(email):
"""Get list of breach names for an email."""
url = f"https://haveibeenpwned.com/api/v3/breachedaccount/{email}"
headers = {
"hibp-api-key": API_KEY,
"User-Agent": "PersonalBreachMonitor/1.0"
}
resp = requests.get(url, headers=headers)
if resp.status_code == 404:
return []
resp.raise_for_status()
return [b["Name"] for b in resp.json()]
def load_state():
if STATE_FILE.exists():
return json.loads(STATE_FILE.read_text())
return {}
def save_state(state):
STATE_FILE.write_text(json.dumps(state, indent=2))
STATE_FILE.chmod(0o600)
def check_all():
state = load_state()
new_breaches = {}
for email in EMAILS:
current = set(get_breaches(email))
known = set(state.get(email, []))
new = current - known
if new:
new_breaches[email] = list(new)
print(f"NEW BREACH for {email}: {', '.join(new)}")
state[email] = list(current)
time.sleep(1.5) # Respect rate limit
save_state(state)
if new_breaches:
# Send notification (customize for your notification method)
print("\nACTION REQUIRED:")
for email, breaches in new_breaches.items():
print(f" {email}: change password on sites related to {', '.join(breaches)}")
else:
print("No new breaches found.")
if __name__ == "__main__":
check_all()
DeHashed (Paid, More Detail)
DeHashed indexes breach data including plaintext passwords, IP addresses, names, and phone numbers — more detailed than HIBP but costs money.
# API access requires subscription (from $5/month)
curl -s "https://api.dehashed.com/search?query=email%3Ayou%40example.com" \
-H "Authorization: Basic $(echo -n 'youremail:your-api-key' | base64)" \
-H "Content-Type: application/json" | python3 -m json.tool
DeHashed response includes:
- Source breach name
- Email, username, password (hash or plaintext if available)
- IP address associated with account
- Name and phone if in the breach
Self-Hosted Breach Monitoring
For maximum control, download HIBP’s password database locally:
# Download HIBP password hash list (8GB+ compressed)
# Get the torrent from haveibeenpwned.com/Passwords
# This is SHA-1 hashes of 847 million+ compromised passwords
# After download, search locally (no API calls):
# Sort by SHA-1 prefix for binary search efficiency
# Using ripgrep for fast searching:
sha1=$(echo -n "password123" | sha1sum | awk '{print toupper($1)}')
echo "Looking for: $sha1"
rg "^${sha1:5}" pwned-passwords-sha1-ordered-by-hash-v8.txt
# Much faster than per-query API and completely offline
Monitor Paste Sites
Attackers often dump credentials on Pastebin and similar sites before selling on dark web markets:
#!/usr/bin/env python3
"""Monitor Pastebin for references to your domain or email."""
import requests
import re
import time
def check_pastebin_for_domain(domain: str):
"""Search Google cache of Pastebin for domain references."""
# Note: Pastebin's own API requires a subscription for paste search
# This uses a public monitoring approach via HIBP's paste API:
email = f"admin@{domain}" # adjust to your email
url = f"https://haveibeenpwned.com/api/v3/pasteaccount/{email}"
headers = {
"hibp-api-key": "your-api-key",
"User-Agent": "PasteMonitor/1.0"
}
resp = requests.get(url, headers=headers)
if resp.status_code == 404:
print(f"No pastes found for {email}")
return []
pastes = resp.json()
for paste in pastes:
print(f"Found in paste: {paste.get('Source')} - {paste.get('Id')}")
print(f" Posted: {paste.get('Date')}")
print(f" Email count: {paste.get('EmailCount')}")
return pastes
check_pastebin_for_domain("yourdomain.com")
What to Do When You Find Your Data
The priority order matters — act quickly:
1. Identify which password was breached
- Check the breach name → when it happened → which service
- Most passwords are old — you may have already changed them
2. Change the breached password immediately
- Every site where you used that password (password reuse is the real risk)
- Use a password manager to generate unique passwords
3. Enable 2FA everywhere possible
- Even if attacker has your password, 2FA blocks them
- Priority: email, banking, work accounts
4. Check for account activity
- Log into the affected service → check login history
- Look for: unknown locations, unknown devices, recent changes
5. Watch for phishing
- Attackers with your email know your username
- Be skeptical of any email asking you to click/login after a breach
6. If sensitive data exposed (SSN, address, DOB):
- Consider credit freeze (free in US): equifax.com, experian.com, transunion.com
- Monitor credit report for new accounts
Automated Monitoring Setup
# Add to crontab to run weekly
# crontab -e
0 9 * * 1 /usr/local/bin/check-hibp.py >> /var/log/hibp-monitor.log 2>&1
# Runs every Monday at 9am, appends to log
Related Reading
- Anonymous Email Over Tor Setup Guide
- Best Encrypted Messaging for Journalists
- How to Audit Your Cloud Storage Privacy
Built by theluckystrike — More at zovo.one
Frequently Asked Questions
How long does it take to monitor the dark web for data breaches?
For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.
What are the most common mistakes to avoid?
The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.
Do I need prior experience to follow this guide?
Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.
Can I adapt this for a different tech stack?
Yes, the underlying concepts transfer to other stacks, though the specific implementation details will differ. Look for equivalent libraries and patterns in your target stack. The architecture and workflow design remain similar even when the syntax changes.
Where can I get help if I run into issues?
Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.