Claude Skills Guide

When you need to extract structured data from HTML documents, Cheerio provides a fast, jQuery-like API that pairs perfectly with Claude Code. Whether you’re building a web scraper, monitoring competitor prices, or aggregating content from multiple sources, this guide shows you how to create efficient HTML parsing workflows that use Claude Code’s automation capabilities.

Setting Up Cheerio in Your Project

Before diving into parsing workflows, you need to install and configure Cheerio properly. Create a new Node.js project or use an existing one:

npm install cheerio axios

The axios package helps you fetch HTML content from websites, while Cheerio handles the parsing. Set up a basic parser module that you can reuse across your project:

// src/lib/html-parser.js
import * as cheerio from 'cheerio';
import axios from 'axios';

export async function fetchAndParse(url) {
  const response = await axios.get(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyBot/1.0)',
    },
  });
  return cheerio.load(response.data);
}

This foundation lets you focus on data extraction rather than boilerplate setup.

Basic Element Selection and Extraction

Cheerio’s selector API mirrors jQuery, making it intuitive for developers familiar with frontend JavaScript. The key is understanding how to target elements precisely while handling edge cases gracefully.

For simple text extraction, use the .text() method on selected elements:

// Extract all article titles from a blog listing
const $ = await fetchAndParse('https://example.com/blog');

const titles = [];
$('article h2, .post-title, .entry-title').each((i, el) => {
  titles.push($(el).text().trim());
});

console.log(titles);

When dealing with attributes rather than text content, chain .attr() to pull specific values:

// Extract all image URLs from a gallery
const images = [];
$('img.gallery-image').each((i, el) => {
  const src = $(el).attr('src');
  const alt = $(el).attr('alt') || '';
  if (src) {
    images.push({ src, alt });
  }
});

Claude Code can help you iterate on selectors when initial attempts don’t capture the right elements. Describe the structure you’re seeing and ask for alternative selector strategies.

Handling Dynamic Tables and Lists

Tabular data requires different extraction strategies. When scraping product listings, financial data, or comparison tables, you’ll often need to combine multiple selectors and handle variable row structures.

// Extract product data from an e-commerce listing
async function scrapeProducts(html) {
  const $ = cheerio.load(html);
  const products = [];

  $('.product-card').each((i, card) => {
    const product = {
      name: $(card).find('.product-name').text().trim(),
      price: $(card).find('.price .amount').text().trim(),
      currency: $(card).find('.price .currency').text().trim(),
      rating: $(card).find('.rating').attr('data-rating'),
      inStock: $(card).find('.stock-status').hasClass('in-stock'),
      url: $(card).find('a.product-link').attr('href'),
    };
    products.push(product);
  });

  return products;
}

This pattern works well for structured data with consistent markup. For pages with inconsistent layouts, consider using fallback selectors:

// Handle multiple possible selector paths
function extractTitle($) {
  return (
    $('.product-title').text().trim() ||
    $('.entry-title').text().trim() ||
    $('h1').text().trim() ||
    'Unknown Title'
  );
}

Pagination and Multi-Page Scraping

Real-world scraping often requires traversing multiple pages. Build a paginated workflow that handles navigation automatically:

async function scrapeAllPages(baseUrl, maxPages = 10) {
  const allProducts = [];
  
  for (let page = 1; page <= maxPages; page++) {
    const url = `${baseUrl}?page=${page}`;
    console.log(`Scraping page ${page}: ${url}`);
    
    const $ = await fetchAndParse(url);
    const products = await scrapeProducts($.html());
    
    if (products.length === 0) {
      console.log('No more products found, stopping.');
      break;
    }
    
    allProducts.push(...products);
    
    // Respect rate limits between requests
    await new Promise(resolve => setTimeout(resolve, 1000));
  }
  
  return allProducts;
}

Claude Code can help you extend this pattern to handle infinite scroll pages, API-based pagination, or sessions that require authentication.

Handling Malformed HTML

Cheerio is forgiving with malformed HTML, but you’ll encounter edge cases. Use error boundaries and validation to prevent crashes:

async function safeParse(html) {
  try {
    if (!html || html.trim().length === 0) {
      throw new Error('Empty HTML received');
    }
    return cheerio.load(html);
  } catch (error) {
    console.error('Parse error:', error.message);
    return null;
  }
}

async function extractWithFallback(url, selectors) {
  const $ = await fetchAndParse(url);
  if (!$);
  
  const results = {};
  
  for (const [key, selector] of Object.entries(selectors)) {
    const element = $(selector);
    results[key] = element.length > 0 
      ? element.text().trim() 
      : null;
  }
  
  return results;
}

This approach ensures your scraper continues running even when pages change structure or return unexpected content.

Integrating with Claude Code Workflows

The real power emerges when you combine Cheerio parsing with Claude Code’s task orchestration. Create reusable skills that encapsulate your parsing logic:

// claude-skills/cheerio-parser/skill.md
# Cheerio HTML Parser Skill

Use this skill when you need to:
- Extract structured data from HTML documents
- Build web scraping pipelines
- Parse dynamically generated content

Available functions:
- scrapeProductListings(url, options)
- extractTableData(html, tableSelector)
- findElements(html, selector)
- extractMetadata(html)

This lets you delegate HTML parsing tasks to Claude Code while maintaining clean, testable extraction logic in your codebase.

Best Practices for Production Scrapers

When moving from development to production, follow these guidelines:

Respect robots.txt and rate limits. Always check a site’s terms of service before scraping, and implement delays between requests to avoid overwhelming servers.

Handle errors gracefully. Network requests fail regularly. Wrap fetches in try-catch blocks and implement retry logic with exponential backoff:

async function fetchWithRetry(url, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await axios.get(url);
    } catch (error) {
      if (i === retries - 1) throw error;
      await new Promise(r => setTimeout(r, 1000 * Math.pow(2, i)));
    }
  }
}

Validate extracted data. Don’t assume selectors will always work. Log warnings when expected elements are missing and consider implementing data quality checks:

function validateProduct(product) {
  const required = ['name', 'price', 'url'];
  const missing = required.filter(field => !product[field]);
  
  if (missing.length > 0) {
    console.warn(`Product missing fields: ${missing.join(', ')}`);
    return false;
  }
  
  return true;
}

Store results incrementally. For large scraping jobs, save progress after each batch to avoid losing work if the process crashes.

Conclusion

Cheerio provides a robust foundation for HTML parsing in Node.js, and when combined with Claude Code’s automation capabilities, you can build sophisticated data extraction workflows. Start with basic text and attribute extraction, then progressively add pagination, error handling, and validation as your needs grow. The key is maintaining clean separation between fetching, parsing, and data processing so each component remains testable and maintainable.

Remember to always scrape responsibly—respect server resources, implement appropriate delays, and verify you’re allowed to access the data you need before building production scrapers.

Built by theluckystrike — More at zovo.one