Ethical Web Scraping with Chrome Extensions: A Developer's Guide
46 min readEthical Web Scraping with Chrome Extensions: A Developer’s Guide
Web scraping remains one of the most powerful uses for Chrome extensions. Running inside the browser gives you access to fully rendered pages, authenticated sessions, and dynamic content that server-side scrapers struggle with. But with that power comes responsibility: scraping that ignores rate limits, terms of service, or user privacy damages the ecosystem for everyone.
This guide takes an ethics-first approach. Every technique is paired with guidance on when and how to use it responsibly. You will learn to extract data with content scripts, observe DOM mutations in single-page applications, export data in CSV and JSON formats, and implement rate limiting that respects the servers you interact with.
Table of Contents
- Content Script Data Extraction Patterns
- DOM Traversal Techniques
- MutationObserver for Dynamic Content
- Handling Single-Page Applications
- Rate Limiting and Polite Scraping
- Data Export Formats
- Legal and Ethical Considerations
- Responsible Resource Management
- Complete Example: Job Listing Aggregator
Content Script Data Extraction Patterns
Content scripts are the workhorse of extension-based scraping. They execute in the context of web pages and can read anything visible in the DOM. The key difference from server-side scrapers is that content scripts see the page after JavaScript has finished rendering it, which makes them ideal for modern web applications.
Declarative Injection via Manifest
The simplest way to inject a content script is through manifest.json. The extension runtime handles injection automatically when the user navigates to a matching URL:
{
"manifest_version": 3,
"name": "Data Extractor",
"version": "1.0.0",
"content_scripts": [
{
"matches": ["https://example.com/products/*"],
"js": ["content/extractor.ts"],
"run_at": "document_idle"
}
],
"permissions": ["storage", "activeTab"]
}
The run_at: "document_idle" value ensures the script runs after the initial DOM is complete and the page is unlikely to change dramatically. This avoids race conditions where you scrape a half-loaded page.
Programmatic Injection for On-Demand Scraping
When scraping should only happen on user request, inject scripts programmatically from the background service worker:
// background.ts
chrome.action.onClicked.addListener(async (tab) => {
if (!tab.id || !tab.url) return;
// Only inject into allowed origins
const allowedOrigins = ['https://example.com', 'https://shop.example.com'];
const tabOrigin = new URL(tab.url).origin;
if (!allowedOrigins.includes(tabOrigin)) {
console.warn(`Skipping injection for disallowed origin: ${tabOrigin}`);
return;
}
const results = await chrome.scripting.executeScript({
target: { tabId: tab.id },
func: extractPageData,
});
if (results?.[0]?.result) {
await chrome.storage.local.set({
[`scrape_${Date.now()}`]: results[0].result,
});
}
});
function extractPageData(): Record<string, unknown> {
return {
title: document.title,
url: location.href,
timestamp: new Date().toISOString(),
// ... extraction logic
};
}
Programmatic injection gives you fine-grained control and pairs well with the activeTab permission, which avoids broad host permission requests that might alarm users or trigger Chrome Web Store review scrutiny.
Typed Data Models
Define TypeScript interfaces for the data you extract. This prevents accidental schema drift and makes downstream processing predictable:
// models/ScrapedProduct.ts
interface ScrapedProduct {
name: string;
price: number;
currency: string;
availability: 'in_stock' | 'out_of_stock' | 'preorder';
imageUrl: string | null;
rating: number | null;
reviewCount: number;
scrapedAt: string; // ISO 8601
sourceUrl: string;
}
interface ScrapeResult<T> {
success: boolean;
data: T[];
errors: string[];
duration: number; // milliseconds
pageUrl: string;
}
DOM Traversal Techniques
Extracting structured data from the DOM requires reliable selectors and defensive coding. Pages change layouts frequently, so your selectors need to be resilient.
Selector Strategies Ranked by Reliability
Not all CSS selectors are equally durable. Here they are ranked from most to least reliable:
- Data attributes (
[data-product-id]) - Explicitly semantic, rarely change - ARIA attributes (
[aria-label="Add to cart"]) - Accessibility-driven, stable - Semantic HTML (
article,main,nav) - Structurally meaningful - Stable class names (
.product-card,.price-display) - Often stable in component-based apps - Generated class names (
.css-1a2b3c,.sc-hKgILt) - Unstable, change on every build
Always prefer selectors higher in this list. When forced to use lower-ranked selectors, combine them with structural context to increase resilience:
// Fragile - breaks when class names regenerate
const price = document.querySelector('.css-1a2b3c')?.textContent;
// Resilient - uses data attribute with structural fallback
function extractPrice(container: Element): number | null {
// Strategy 1: data attribute
const dataPrice = container.querySelector('[data-price]');
if (dataPrice) {
return parseFloat(dataPrice.getAttribute('data-price')!);
}
// Strategy 2: itemprop microdata
const microdata = container.querySelector('[itemprop="price"]');
if (microdata) {
return parseFloat(
microdata.getAttribute('content') ?? microdata.textContent ?? ''
);
}
// Strategy 3: structural heuristic
const priceElement = container.querySelector(
'.price, [class*="price"], [class*="Price"]'
);
if (priceElement?.textContent) {
const match = priceElement.textContent.match(/[\d,]+\.?\d*/);
return match ? parseFloat(match[0].replace(/,/g, '')) : null;
}
return null;
}
Extracting Tabular Data
Tables are among the easiest structures to scrape reliably because the HTML spec defines their semantics:
function extractTable(tableSelector: string): Record<string, string>[] {
const table = document.querySelector(tableSelector);
if (!table) return [];
const headers = Array.from(table.querySelectorAll('thead th')).map(
(th) => th.textContent?.trim().toLowerCase().replace(/\s+/g, '_') ?? ''
);
return Array.from(table.querySelectorAll('tbody tr')).map((row) => {
const cells = Array.from(row.querySelectorAll('td'));
const record: Record<string, string> = {};
headers.forEach((header, i) => {
record[header] = cells[i]?.textContent?.trim() ?? '';
});
return record;
});
}
Walking the DOM Tree
Sometimes you need to traverse the DOM manually. TreeWalker is more efficient than recursive querySelectorAll for deep trees:
function extractTextNodes(root: Element): string[] {
const texts: string[] = [];
const walker = document.createTreeWalker(
root,
NodeFilter.SHOW_TEXT,
{
acceptNode(node) {
const text = node.textContent?.trim();
if (!text || text.length < 3) return NodeFilter.FILTER_REJECT;
// Skip script/style content
const parent = node.parentElement;
if (parent?.tagName === 'SCRIPT' || parent?.tagName === 'STYLE') {
return NodeFilter.FILTER_REJECT;
}
return NodeFilter.FILTER_ACCEPT;
},
}
);
while (walker.nextNode()) {
texts.push(walker.currentNode.textContent!.trim());
}
return texts;
}
MutationObserver for Dynamic Content
Modern web pages constantly update their DOM. Content loads lazily, infinite scroll adds elements, and SPAs swap entire sections. MutationObserver lets you react to these changes without polling.
Basic Observer Setup
class ContentObserver {
private observer: MutationObserver;
private extractedIds = new Set<string>();
constructor(
private containerSelector: string,
private itemSelector: string,
private onNewItems: (items: Element[]) => void
) {
this.observer = new MutationObserver(this.handleMutations.bind(this));
}
start(): void {
const container = document.querySelector(this.containerSelector);
if (!container) {
// Container not yet in DOM - wait for it
this.waitForContainer();
return;
}
// Process existing items first
this.processExistingItems(container);
this.observer.observe(container, {
childList: true,
subtree: true,
});
}
stop(): void {
this.observer.disconnect();
}
private handleMutations(mutations: MutationRecord[]): void {
const newItems: Element[] = [];
for (const mutation of mutations) {
for (const node of mutation.addedNodes) {
if (node.nodeType !== Node.ELEMENT_NODE) continue;
const element = node as Element;
// Check if the added node itself matches
if (element.matches(this.itemSelector)) {
newItems.push(element);
}
// Check children of the added node
const children = element.querySelectorAll(this.itemSelector);
newItems.push(...Array.from(children));
}
}
// Deduplicate
const unique = newItems.filter((item) => {
const id = this.getItemId(item);
if (this.extractedIds.has(id)) return false;
this.extractedIds.add(id);
return true;
});
if (unique.length > 0) {
this.onNewItems(unique);
}
}
private processExistingItems(container: Element): void {
const items = Array.from(container.querySelectorAll(this.itemSelector));
const unique = items.filter((item) => {
const id = this.getItemId(item);
if (this.extractedIds.has(id)) return false;
this.extractedIds.add(id);
return true;
});
if (unique.length > 0) {
this.onNewItems(unique);
}
}
private waitForContainer(): void {
const bodyObserver = new MutationObserver(() => {
const container = document.querySelector(this.containerSelector);
if (container) {
bodyObserver.disconnect();
this.start();
}
});
bodyObserver.observe(document.body, { childList: true, subtree: true });
}
private getItemId(item: Element): string {
return (
item.getAttribute('data-id') ??
item.id ??
item.textContent?.slice(0, 50) ??
''
);
}
}
Observing Infinite Scroll
Infinite scroll is one of the most common dynamic patterns. Combine IntersectionObserver with MutationObserver for reliable extraction:
function observeInfiniteScroll(
feedSelector: string,
itemSelector: string,
onExtract: (items: Element[]) => void,
options: { maxItems?: number; scrollDelay?: number } = {}
): () => void {
const { maxItems = Infinity, scrollDelay = 1500 } = options;
let totalExtracted = 0;
const contentObserver = new ContentObserver(
feedSelector,
itemSelector,
(items) => {
if (totalExtracted >= maxItems) return;
const batch = items.slice(0, maxItems - totalExtracted);
totalExtracted += batch.length;
onExtract(batch);
if (totalExtracted >= maxItems) {
contentObserver.stop();
}
}
);
contentObserver.start();
// Return cleanup function
return () => contentObserver.stop();
}
Handling Single-Page Applications
SPAs present unique challenges because navigation happens without full page reloads. The URL changes, content swaps, but no new document is created. Your content script stays active across “navigations,” but needs to detect route changes.
Detecting SPA Navigation
class SPANavigationDetector {
private currentUrl = location.href;
private callbacks: Array<(oldUrl: string, newUrl: string) => void> = [];
constructor() {
// Intercept History API calls
this.patchHistoryMethod('pushState');
this.patchHistoryMethod('replaceState');
// Handle back/forward
window.addEventListener('popstate', () => this.checkNavigation());
}
onNavigate(callback: (oldUrl: string, newUrl: string) => void): void {
this.callbacks.push(callback);
}
private patchHistoryMethod(method: 'pushState' | 'replaceState'): void {
const original = history[method].bind(history);
history[method] = (...args: Parameters<typeof history.pushState>) => {
original(...args);
this.checkNavigation();
};
}
private checkNavigation(): void {
const newUrl = location.href;
if (newUrl !== this.currentUrl) {
const oldUrl = this.currentUrl;
this.currentUrl = newUrl;
this.callbacks.forEach((cb) => cb(oldUrl, newUrl));
}
}
}
// Usage
const detector = new SPANavigationDetector();
detector.onNavigate((oldUrl, newUrl) => {
console.log(`SPA navigation: ${oldUrl} -> ${newUrl}`);
// Re-run extraction logic for the new "page"
waitForContent('[data-page-loaded]').then(() => extractPageData());
});
Waiting for Content to Stabilize
After a SPA navigation, the DOM often goes through several intermediate states before settling. Instead of arbitrary setTimeout calls, wait for the DOM to stabilize:
function waitForStableDOM(
targetSelector: string,
stabilityMs: number = 500,
timeoutMs: number = 10000
): Promise<Element> {
return new Promise((resolve, reject) => {
const startTime = Date.now();
let stabilityTimer: ReturnType<typeof setTimeout>;
const observer = new MutationObserver(() => {
clearTimeout(stabilityTimer);
if (Date.now() - startTime > timeoutMs) {
observer.disconnect();
reject(new Error(`Timeout waiting for ${targetSelector} to stabilize`));
return;
}
stabilityTimer = setTimeout(() => {
const target = document.querySelector(targetSelector);
if (target) {
observer.disconnect();
resolve(target);
}
}, stabilityMs);
});
// Check if already present and stable
const existing = document.querySelector(targetSelector);
if (existing) {
stabilityTimer = setTimeout(() => {
observer.disconnect();
resolve(existing);
}, stabilityMs);
}
observer.observe(document.body, { childList: true, subtree: true });
// Overall timeout
setTimeout(() => {
observer.disconnect();
clearTimeout(stabilityTimer);
const fallback = document.querySelector(targetSelector);
if (fallback) {
resolve(fallback);
} else {
reject(new Error(`Timeout: ${targetSelector} never appeared`));
}
}, timeoutMs);
});
}
Shadow DOM Traversal
Some frameworks (and all Web Components) use Shadow DOM, which hides content from normal querySelector calls. You need to traverse shadow roots explicitly:
function queryShadowDom(
root: Element | ShadowRoot | Document,
selector: string
): Element | null {
// Try normal query first
const result = root.querySelector(selector);
if (result) return result;
// Walk into shadow roots
const elements = root.querySelectorAll('*');
for (const el of elements) {
if (el.shadowRoot) {
const shadowResult = queryShadowDom(el.shadowRoot, selector);
if (shadowResult) return shadowResult;
}
}
return null;
}
Rate Limiting and Polite Scraping
This is where ethics and engineering intersect. Scraping without rate limiting can degrade service for other users, trigger anti-bot measures, and may constitute a denial-of-service attack. Responsible scraping treats the target server as a shared resource.
Token Bucket Rate Limiter
The token bucket algorithm is the gold standard for rate limiting. It allows short bursts while enforcing an average rate:
class TokenBucketRateLimiter {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number, // Burst capacity
private refillRate: number, // Tokens per second
private minDelayMs: number = 200 // Minimum delay between requests
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
async acquire(): Promise<void> {
this.refill();
if (this.tokens < 1) {
const waitMs = ((1 - this.tokens) / this.refillRate) * 1000;
await this.delay(Math.max(waitMs, this.minDelayMs));
this.refill();
}
this.tokens -= 1;
await this.delay(this.minDelayMs);
}
private refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
private delay(ms: number): Promise<void> {
return new Promise((resolve) => setTimeout(resolve, ms));
}
}
// Conservative settings: 2 requests/second burst, 1 request/second sustained
const limiter = new TokenBucketRateLimiter(2, 1, 500);
Respecting robots.txt
Even though your extension runs in the browser, checking robots.txt is an important ethical signal:
class RobotsTxtChecker {
private rulesCache = new Map<string, RobotsRules>();
async isAllowed(url: string, userAgent: string = '*'): Promise<boolean> {
const origin = new URL(url).origin;
if (!this.rulesCache.has(origin)) {
try {
const response = await fetch(`${origin}/robots.txt`);
if (response.ok) {
const text = await response.text();
this.rulesCache.set(origin, this.parseRobotsTxt(text));
} else {
// No robots.txt means everything is allowed
this.rulesCache.set(origin, { disallowed: [], crawlDelay: 0 });
}
} catch {
this.rulesCache.set(origin, { disallowed: [], crawlDelay: 0 });
}
}
const rules = this.rulesCache.get(origin)!;
const path = new URL(url).pathname;
return !rules.disallowed.some((pattern) => path.startsWith(pattern));
}
getCrawlDelay(origin: string): number {
return this.rulesCache.get(origin)?.crawlDelay ?? 0;
}
private parseRobotsTxt(text: string): RobotsRules {
const lines = text.split('\n');
const disallowed: string[] = [];
let crawlDelay = 0;
let relevantSection = false;
for (const line of lines) {
const trimmed = line.trim().toLowerCase();
if (trimmed.startsWith('user-agent:')) {
const agent = trimmed.replace('user-agent:', '').trim();
relevantSection = agent === '*';
} else if (relevantSection && trimmed.startsWith('disallow:')) {
const path = trimmed.replace('disallow:', '').trim();
if (path) disallowed.push(path);
} else if (relevantSection && trimmed.startsWith('crawl-delay:')) {
crawlDelay = parseInt(trimmed.replace('crawl-delay:', '').trim(), 10) || 0;
}
}
return { disallowed, crawlDelay };
}
}
interface RobotsRules {
disallowed: string[];
crawlDelay: number;
}
Backoff on Errors
When a server returns 429 (Too Many Requests) or 5xx errors, back off exponentially:
async function fetchWithBackoff(
url: string,
options: RequestInit = {},
maxRetries: number = 3
): Promise<Response> {
let lastError: Error | null = null;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
const response = await fetch(url, options);
if (response.status === 429) {
const retryAfter = response.headers.get('Retry-After');
const delayMs = retryAfter
? parseInt(retryAfter, 10) * 1000
: Math.pow(2, attempt) * 1000;
console.warn(`Rate limited. Waiting ${delayMs}ms before retry.`);
await new Promise((r) => setTimeout(r, delayMs));
continue;
}
if (response.status >= 500 && attempt < maxRetries) {
const delayMs = Math.pow(2, attempt) * 1000;
console.warn(`Server error ${response.status}. Retrying in ${delayMs}ms.`);
await new Promise((r) => setTimeout(r, delayMs));
continue;
}
return response;
} catch (error) {
lastError = error as Error;
if (attempt < maxRetries) {
const delayMs = Math.pow(2, attempt) * 1000;
await new Promise((r) => setTimeout(r, delayMs));
}
}
}
throw lastError ?? new Error(`Failed to fetch ${url} after ${maxRetries} retries`);
}
Data Export Formats
Scraped data is only useful if you can get it out of the extension and into tools like spreadsheets, databases, or analysis scripts.
CSV Export
CSV is the most universally compatible format. Every spreadsheet application, database, and data tool can import it:
function toCSV<T extends Record<string, unknown>>(
data: T[],
columns?: (keyof T)[]
): string {
if (data.length === 0) return '';
const headers = columns ?? (Object.keys(data[0]) as (keyof T)[]);
const escapeField = (value: unknown): string => {
const str = String(value ?? '');
if (str.includes(',') || str.includes('"') || str.includes('\n')) {
return `"${str.replace(/"/g, '""')}"`;
}
return str;
};
const headerLine = headers.map(String).map(escapeField).join(',');
const rows = data.map((row) =>
headers.map((h) => escapeField(row[h])).join(',')
);
return [headerLine, ...rows].join('\n');
}
function downloadCSV(csv: string, filename: string): void {
const blob = new Blob(['\uFEFF' + csv], { type: 'text/csv;charset=utf-8;' });
const url = URL.createObjectURL(blob);
const link = document.createElement('a');
link.href = url;
link.download = filename;
link.click();
URL.revokeObjectURL(url);
}
The \uFEFF BOM prefix ensures Excel opens the CSV with correct UTF-8 encoding.
JSON Export with Metadata
JSON preserves types and nesting. Include metadata so you know where the data came from:
interface ExportEnvelope<T> {
version: string;
exportedAt: string;
source: {
url: string;
title: string;
extensionVersion: string;
};
recordCount: number;
data: T[];
}
function toJSONExport<T>(data: T[], sourceUrl: string): string {
const manifest = chrome.runtime.getManifest();
const envelope: ExportEnvelope<T> = {
version: '1.0',
exportedAt: new Date().toISOString(),
source: {
url: sourceUrl,
title: document.title,
extensionVersion: manifest.version,
},
recordCount: data.length,
data,
};
return JSON.stringify(envelope, null, 2);
}
function downloadJSON(json: string, filename: string): void {
const blob = new Blob([json], { type: 'application/json' });
const url = URL.createObjectURL(blob);
const link = document.createElement('a');
link.href = url;
link.download = filename;
link.click();
URL.revokeObjectURL(url);
}
Streaming Large Datasets to IndexedDB
For datasets that exceed chrome.storage.local limits, use IndexedDB as a staging area:
class ScrapedDataStore {
private dbName = 'scraped_data';
private dbVersion = 1;
private async openDB(): Promise<IDBDatabase> {
return new Promise((resolve, reject) => {
const request = indexedDB.open(this.dbName, this.dbVersion);
request.onupgradeneeded = () => {
const db = request.result;
if (!db.objectStoreNames.contains('records')) {
const store = db.createObjectStore('records', {
keyPath: 'id',
autoIncrement: true,
});
store.createIndex('source', 'sourceUrl', { unique: false });
store.createIndex('timestamp', 'scrapedAt', { unique: false });
}
};
request.onsuccess = () => resolve(request.result);
request.onerror = () => reject(request.error);
});
}
async addRecords<T>(records: T[], sourceUrl: string): Promise<void> {
const db = await this.openDB();
const tx = db.transaction('records', 'readwrite');
const store = tx.objectStore('records');
for (const record of records) {
store.add({
...record,
sourceUrl,
scrapedAt: new Date().toISOString(),
});
}
return new Promise((resolve, reject) => {
tx.oncomplete = () => resolve();
tx.onerror = () => reject(tx.error);
});
}
async exportAll(): Promise<unknown[]> {
const db = await this.openDB();
const tx = db.transaction('records', 'readonly');
const store = tx.objectStore('records');
return new Promise((resolve, reject) => {
const request = store.getAll();
request.onsuccess = () => resolve(request.result);
request.onerror = () => reject(request.error);
});
}
async clear(): Promise<void> {
const db = await this.openDB();
const tx = db.transaction('records', 'readwrite');
tx.objectStore('records').clear();
return new Promise((resolve, reject) => {
tx.oncomplete = () => resolve();
tx.onerror = () => reject(tx.error);
});
}
}
Legal and Ethical Considerations
Web scraping sits at the intersection of technology, law, and ethics. As extension developers, we operate in a space that is especially sensitive because our code runs in users’ browsers and interacts directly with third-party websites.
Legal Landscape
The legality of scraping varies by jurisdiction and circumstance:
-
United States: The Computer Fraud and Abuse Act (CFAA) prohibits unauthorized access to computers. The 2022 hiQ Labs v. LinkedIn Supreme Court ruling clarified that scraping publicly accessible data is generally not a CFAA violation, but scraping behind authentication or in violation of explicit terms may be.
-
European Union: GDPR imposes strict rules on collecting personal data. If your scraper collects names, email addresses, or other PII from EU residents, you need a lawful basis for processing and must handle data subject requests.
-
General principle: Scraping publicly available, non-personal data for personal or research purposes carries the lowest legal risk. Commercial scraping of copyrighted or personal data carries the highest.
Ethical Guidelines for Extension Developers
-
Minimize data collection: Only extract what you actually need. If you need product prices, do not also scrape user reviews, profile images, or metadata you will not use.
-
Respect explicit opt-outs: If a site’s
robots.txtdisallows scraping or their ToS explicitly prohibit it, respect that. Your extension should checkrobots.txtbefore scraping a new domain. -
Rate limit aggressively: A scraper running in the browser shares the user’s connection. Aggressive scraping can slow down the user’s browsing, trigger CAPTCHAs on their session, or get their IP temporarily banned. Conservative rate limits (one request per second or slower) protect both the server and the user.
-
Be transparent with users: Your extension description and privacy policy should clearly state what data is collected, how it is stored, and whether it is transmitted to external servers.
-
Never bypass authentication or access controls: Accessing content behind login walls, paywalls, or other access controls is both ethically wrong and likely illegal.
-
Handle errors gracefully: If a site blocks your scraper, do not attempt to circumvent the block. Back off, notify the user, and suggest manual alternatives.
-
Do not scrape personal data without consent: Collecting email addresses, phone numbers, or other PII from websites for unsolicited contact or data brokering is a violation of privacy laws in most jurisdictions.
Chrome Web Store Compliance
Google’s Chrome Web Store policies add another layer of requirements:
- Extensions must request only the minimum permissions needed
- Data collection must be disclosed in the extension’s privacy practices
- Extensions that scrape personal data may be rejected or removed
- The extension’s stated purpose must be honest about scraping functionality
Responsible Resource Management
Scraping extensions consume browser resources: memory for storing extracted data, CPU for DOM traversal, and network bandwidth for any API calls. Responsible resource management ensures your extension does not degrade the user’s browsing experience.
Memory Management
Scraping can accumulate large amounts of data in memory. Flush extracted data to storage regularly and release references:
class MemoryAwareScraper {
private buffer: unknown[] = [];
private readonly flushThreshold = 100;
async addItem(item: unknown): Promise<void> {
this.buffer.push(item);
if (this.buffer.length >= this.flushThreshold) {
await this.flush();
}
}
async flush(): Promise<void> {
if (this.buffer.length === 0) return;
const batch = this.buffer.splice(0);
await chrome.storage.local.set({
[`batch_${Date.now()}`]: batch,
});
}
async cleanup(): Promise<void> {
await this.flush();
this.buffer.length = 0;
}
}
Tab and CPU Awareness
Extensions like Tab Suspender Pro demonstrate responsible resource management by automatically suspending inactive tabs to reduce memory and CPU usage. Your scraping extension should adopt similar principles:
- Pause scraping when the tab is not visible (
document.hidden) - Yield to the main thread with
requestIdleCallbackduring heavy DOM traversal - Stop all scraping if the system is under memory pressure
function createResponsiveScraper(
extractFn: () => unknown[],
options: { pauseWhenHidden?: boolean } = {}
): { start: () => void; stop: () => void } {
let active = false;
const { pauseWhenHidden = true } = options;
function onVisibilityChange(): void {
if (document.hidden && pauseWhenHidden) {
console.log('Tab hidden - pausing scraper');
} else if (!document.hidden && active) {
console.log('Tab visible - resuming scraper');
}
}
function scheduleExtraction(): void {
if (!active) return;
if (document.hidden && pauseWhenHidden) {
// Wait for tab to become visible again
return;
}
if ('requestIdleCallback' in window) {
requestIdleCallback(
(deadline) => {
if (deadline.timeRemaining() > 10) {
extractFn();
}
if (active) scheduleExtraction();
},
{ timeout: 5000 }
);
} else {
setTimeout(() => {
extractFn();
if (active) scheduleExtraction();
}, 100);
}
}
return {
start() {
active = true;
document.addEventListener('visibilitychange', onVisibilityChange);
scheduleExtraction();
},
stop() {
active = false;
document.removeEventListener('visibilitychange', onVisibilityChange);
},
};
}
Complete Example: Job Listing Aggregator
Let us bring everything together with a practical example: a Chrome extension that ethically scrapes job listings from a careers page, respects rate limits, handles SPA navigation, and exports data in both CSV and JSON formats.
Manifest
{
"manifest_version": 3,
"name": "Job Listing Aggregator",
"version": "1.0.0",
"description": "Extract and export job listings from career pages",
"permissions": ["activeTab", "storage"],
"action": {
"default_popup": "popup.html",
"default_icon": "icon-48.png"
},
"content_scripts": [
{
"matches": ["<all_urls>"],
"js": ["content.js"],
"run_at": "document_idle"
}
],
"background": {
"service_worker": "background.js"
}
}
Content Script
// content.ts
interface JobListing {
title: string;
company: string;
location: string;
salary: string | null;
url: string;
postedDate: string | null;
scrapedAt: string;
}
const rateLimiter = new TokenBucketRateLimiter(3, 0.5, 1000);
async function extractJobListings(): Promise<JobListing[]> {
// Common job listing selectors across major career sites
const selectors = [
'[data-job-id]',
'.job-listing',
'.job-card',
'article[class*="job"]',
'[itemtype="http://schema.org/JobPosting"]',
];
let jobElements: Element[] = [];
for (const selector of selectors) {
const found = document.querySelectorAll(selector);
if (found.length > 0) {
jobElements = Array.from(found);
break;
}
}
if (jobElements.length === 0) {
console.log('No job listings found on this page');
return [];
}
const listings: JobListing[] = [];
for (const el of jobElements) {
await rateLimiter.acquire();
const listing: JobListing = {
title: extractText(el, '[class*="title"], h2, h3, [itemprop="title"]'),
company: extractText(el, '[class*="company"], [itemprop="hiringOrganization"]'),
location: extractText(el, '[class*="location"], [itemprop="jobLocation"]'),
salary: extractText(el, '[class*="salary"], [itemprop="baseSalary"]') || null,
url: extractLink(el) ?? location.href,
postedDate: extractText(el, '[class*="date"], time, [itemprop="datePosted"]') || null,
scrapedAt: new Date().toISOString(),
};
if (listing.title) {
listings.push(listing);
}
}
return listings;
}
function extractText(container: Element, selectors: string): string {
for (const selector of selectors.split(',')) {
const el = container.querySelector(selector.trim());
if (el?.textContent?.trim()) {
return el.textContent.trim();
}
}
return '';
}
function extractLink(container: Element): string | null {
const anchor = container.querySelector('a[href]');
return anchor ? (anchor as HTMLAnchorElement).href : null;
}
// Listen for extraction requests from popup
chrome.runtime.onMessage.addListener((message, _sender, sendResponse) => {
if (message.type === 'EXTRACT_JOBS') {
extractJobListings()
.then((listings) => sendResponse({ success: true, data: listings }))
.catch((error) => sendResponse({ success: false, error: error.message }));
return true;
}
if (message.type === 'OBSERVE_JOBS') {
const cleanup = observeInfiniteScroll(
'main, [role="main"], #content',
'[data-job-id], .job-listing, .job-card',
(newItems) => {
chrome.runtime.sendMessage({
type: 'NEW_JOBS_FOUND',
count: newItems.length,
});
},
{ maxItems: message.maxItems ?? 200 }
);
// Clean up after 5 minutes max
setTimeout(cleanup, 5 * 60 * 1000);
sendResponse({ success: true });
return false;
}
});
Background Service Worker
// background.ts
const dataStore = new ScrapedDataStore();
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
if (message.type === 'SAVE_JOBS') {
dataStore
.addRecords(message.data, message.sourceUrl)
.then(() => sendResponse({ success: true }))
.catch((err) => sendResponse({ success: false, error: err.message }));
return true;
}
if (message.type === 'EXPORT_ALL') {
dataStore
.exportAll()
.then((records) => {
const format = message.format ?? 'json';
if (format === 'csv') {
const csv = toCSV(records as Record<string, unknown>[]);
sendResponse({ success: true, data: csv, format: 'csv' });
} else {
const json = JSON.stringify(records, null, 2);
sendResponse({ success: true, data: json, format: 'json' });
}
})
.catch((err) => sendResponse({ success: false, error: err.message }));
return true;
}
});
This example demonstrates every principle covered in this guide: typed data models, resilient selectors, rate limiting, MutationObserver-based infinite scroll handling, dual export formats, and a background service worker that coordinates storage.
Conclusion
Ethical web scraping with Chrome extensions is about more than just extracting data. It requires a commitment to respecting server resources, user privacy, and legal boundaries. The techniques in this guide – content script injection, DOM traversal, MutationObserver patterns, SPA navigation detection, and rate limiting – give you the tools to build powerful scrapers. The ethical guidelines give you the framework to use those tools responsibly.
When building scraping extensions, always ask: would the website owner consider my scraping reasonable? If the answer is not a clear yes, reconsider your approach.
For related topics, see our Content Script Injection Patterns guide and the Chrome Extension Web Scraping reference for additional provider-specific examples.
Built by Zovo - Open-source tools and guides for extension developers.