Web Scraper & Data Extractor

Parse HTML with CSS selectors, extract tables to JSON or CSV, test JSONPath expressions, and pull structured data from any markup. Everything runs in your browser. Nothing is sent to a server.

Last updated: March 2026 | Free to use, no signup required

HTML Source

Matched Elements (0)

CSS Selector Reference

elementTag name

.classClass selector

#idID selector

[attr]Has attribute

[attr=value]Attribute equals

[attr*=value]Attribute contains

A > BDirect child

A BDescendant

A + BAdjacent sibling

A ~ BGeneral sibling

:first-childFirst child

:nth-child(n)Nth child

:not(sel)Negation

A, BMultiple selectors

:last-childLast child

:emptyNo children

HTML with Table

JSON Input

JSONPath Reference

$Root object

$.keyChild property

$..keyRecursive descent

$.arr[0]Array index

$.arr[0:3]Array slice

$.arr[*]All elements

$.*All properties

$.arr[-1]Last element

Additional Utilities

Regex Pattern Matcher

Global Case-insensitive

URL Parser

All processing happens in your browser. No data is transmitted.

What Is Web Scraping

Web scraping is the process of extracting data from websites programmatically. Instead of copying text by hand, a scraper reads the HTML structure of a page, locates the elements that contain the data you need, and pulls them into a structured format like JSON, CSV, or a database table. The technique is used across industries for price monitoring, research aggregation, lead generation, content indexing, and competitive analysis.

At its core, a web scraper operates on two principles: fetching a page and parsing its content. Fetching means making an HTTP request to a URL to retrieve the raw HTML response. Parsing means walking through that HTML to find specific elements using patterns like CSS selectors or XPath expressions. This tool handles the parsing side. You paste in HTML source, define what you want to extract, and the tool returns structured output.

Client-side scraping (like this tool) works on HTML you already have. Server-side scraping, by contrast, fetches pages from remote servers, which introduces CORS restrictions, rate limiting, and legal considerations. For learning, prototyping, and testing extraction logic, a client-side parser is the fastest way to iterate.

How Web Scrapers Work

A typical web scraping workflow has four stages:

Request: the scraper sends an HTTP GET (or POST) request to a target URL and receives the HTML response body.
Parse: the HTML string is loaded into a DOM parser that builds a tree structure representing the page.
Select: the scraper queries the DOM tree using CSS selectors, XPath, or regex patterns to locate target elements.
Extract: the matched elements yield their text content, attribute values, or inner HTML, which gets stored as structured data.

Modern scrapers may also handle JavaScript-rendered pages using headless browsers like Puppeteer or Playwright. These tools launch a real browser engine, wait for the page to fully render, then expose the resulting DOM for extraction. This approach is necessary for single-page applications where the content is loaded via JavaScript after the initial HTML response.

Scrapers also deal with pagination (following "next page" links), authentication (logging in before scraping), and throttling (adding delays between requests to avoid overwhelming target servers).

CSS Selectors for Web Scraping

CSS selectors are the primary way to target elements within HTML. They were designed for styling, but they work equally well for data extraction. Every major scraping library supports CSS selectors, including BeautifulSoup, Cheerio, Puppeteer, and Playwright.

The most common selectors for scraping are:

tag selects all elements of that type. Example: p selects every paragraph.
.classname selects elements with a specific class. Example: .product-title targets product headings on an e-commerce page.
#id selects one element by its unique ID.
[attribute] selects elements that have a given attribute. Example: a[href] selects all links with an href.
[attribute=value] selects elements where the attribute matches an exact value.
parent > child selects direct children. Example: ul > li selects list items that are immediate children of an unordered list.
ancestor descendant selects all descendants regardless of nesting depth.

Pseudo-selectors like :first-child, :nth-child(2), and :not(.hidden) add further precision. Combining selectors with commas lets you match multiple patterns in a single query. The selector tester tab in this tool provides a live environment for experimenting with all of these.

JSONPath Expressions Explained

JSONPath is a query language for JSON data, similar to how XPath works for XML. It lets you navigate nested JSON structures and extract specific values without writing custom traversal code.

The syntax starts with $ representing the root object. Dot notation accesses properties: $.store.name retrieves the name property inside store. Bracket notation handles special characters or array indexing: $.store.book[0] gets the first book.

Key operators include:

$..key performs recursive descent, finding every occurrence of key at any depth.
[*] matches all elements in an array or all properties of an object.
[0:3] returns a slice of an array (elements 0, 1, and 2).
[-1] returns the last element of an array.
[?(@.price < 10)] is a filter expression that returns elements meeting a condition.

JSONPath is supported natively by many APIs and data processing tools. It appears in AWS Step Functions, Kubernetes configurations, and various ETL platforms. Knowing JSONPath saves time when working with deeply nested API responses.

Legal Considerations for Web Scraping

Web scraping occupies a complicated legal space. The legality depends on what data you scrape, how you scrape it, what you do with the data, and the jurisdiction you operate in.

In the United States, the Computer Fraud and Abuse Act (CFAA) has been applied to scraping cases. The 2022 Ninth Circuit ruling in hiQ Labs v. LinkedIn established that scraping publicly available data does not violate the CFAA. However, this does not mean all scraping is legal. Terms of service violations, copyright infringement, and privacy regulations like GDPR in Europe or CCPA in California add layers of restriction.

General guidelines that reduce legal risk:

Scrape only publicly accessible data. Avoid bypassing login walls or CAPTCHA systems.
Respect robots.txt directives. Though not legally binding in all jurisdictions, ignoring them weakens your position if challenged.
Do not overwhelm servers. Rate-limit your requests and honor retry-after headers.
Avoid scraping personal data unless you have a lawful basis under applicable privacy laws.
Do not republish copyrighted content. Extracting facts is generally permissible; copying creative works is not.
Check the website's terms of service for explicit prohibitions on automated access.

When in doubt, consult legal counsel before running scraping operations at scale.

Popular Web Scraping Languages and Tools

Python dominates the web scraping space due to its extensive library ecosystem. The standard stack includes Requests for HTTP calls and BeautifulSoup for HTML parsing. For more advanced use cases, Scrapy provides a full framework with built-in support for crawling, item pipelines, middleware, and distributed scraping via Scrapy-Redis.

JavaScript scrapers use Cheerio (a server-side jQuery-like library for Node.js) for static pages and Puppeteer or Playwright for JavaScript-rendered content. Playwright is cross-browser and supports Chromium, Firefox, and WebKit.

Other notable tools:

Selenium: browser automation tool available in Python, Java, C#, and Ruby. Slower than headless options but widely supported.
Colly: a Go-based scraping framework known for speed and low memory usage.
Nokogiri: Ruby's primary HTML/XML parser, fast and well-maintained.
jsoup: a Java library for working with real-world HTML, handling malformed markup gracefully.
curl and wget: command-line tools for fetching pages, often combined with other parsers.

Cloud-based platforms like Apify, ScrapingBee, and Bright Data handle infrastructure, proxy rotation, and CAPTCHA solving for large-scale commercial scraping operations.

Building Your First Web Scraper

A minimal Python scraper that extracts all links from a page takes about ten lines of code:

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

for link in soup.select("a[href]"):
    print(link["href"], link.get_text(strip=True))

This script sends a GET request, parses the HTML into a BeautifulSoup object, then uses the CSS selector a[href] to find all anchor elements with an href attribute. For each match, it prints the URL and link text.

To handle JavaScript-rendered pages, swap Requests for Playwright:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://example.com")
    links = page.query_selector_all("a[href]")
    for link in links:
        print(link.get_attribute("href"), link.inner_text())
    browser.close()

Before scraping any live site, test your selectors on static HTML using the tools on this page. Paste the page source into the HTML Parser tab, try different selectors, and verify the output matches what you expect. This prevents wasted requests and speeds up development.

Frequently Asked Questions

Can this tool scrape live websites?

No. This tool is a client-side HTML parser and data extractor. It processes HTML, JSON, and text that you paste into the input fields. Browser security policies (CORS) prevent JavaScript running on a webpage from fetching content from other domains. To scrape live sites, you need a server-side tool like Python with BeautifulSoup, Node.js with Cheerio, or a headless browser like Puppeteer or Playwright.

How do I get the HTML source of a page to paste here?

In most browsers, right-click on a page and select "View Page Source" or press Ctrl+U (Cmd+Option+U on Mac). This opens the raw HTML in a new tab, which you can copy and paste into this tool. For JavaScript-rendered content, use the browser's DevTools (F12), navigate to the Elements tab, right-click the html element, and choose "Copy > Copy outerHTML" to get the fully rendered DOM.

What CSS selectors work for scraping?

All standard CSS selectors work, including tag names (div, p, a), class selectors (.product-name), ID selectors (#main-content), attribute selectors (a[href], img[src]), combinators (div > p, ul li), and pseudo-classes (:first-child, :nth-child(2), :not(.hidden)). The most useful for scraping are attribute selectors and class selectors because they target specific data-carrying elements. Use the CSS Selector Tester tab to experiment with selectors against your HTML.

Is web scraping legal?

The legality of web scraping varies by jurisdiction and circumstances. Scraping publicly available data is generally permissible in the United States following the hiQ v. LinkedIn ruling. However, violating a site's terms of service, bypassing access controls, scraping personal data without consent, or republishing copyrighted content can create legal liability. Always review the target site's terms and robots.txt, avoid collecting personal information without a lawful basis, and consult a lawyer if you plan to scrape at commercial scale.

What is the difference between CSS selectors and XPath?

Both CSS selectors and XPath are used to locate elements in HTML. CSS selectors use a compact syntax designed for styling (e.g., div.class > p) and are supported natively in browsers via querySelectorAll. XPath uses a path-like syntax (e.g., //div[@class="name"]/p) and can traverse the DOM in directions CSS cannot, such as selecting parent elements or preceding siblings. CSS selectors are simpler for most scraping tasks. XPath provides more power when you need to navigate upward in the DOM tree or use complex conditions.

Is my data safe when using this tool?

Yes. This web scraper runs entirely in your browser using JavaScript. No data is transmitted to any server. There are no API calls, no analytics tracking on your input, and nothing is stored after you close the page. You can verify this by opening your browser's developer tools and watching the Network tab while using the tool. It is safe for processing HTML that contains sensitive or proprietary content.

Related Tools

HTML to Markdown - Convert HTML to clean Markdown
JSON Formatter Pro - Advanced JSON toolkit with diff and queries
CSV to JSON Converter - Convert CSV data to JSON format
Regex Tester - Test regular expressions with live highlighting
Diff Checker - Compare two texts side by side

Michael Lip

Developer and tools engineer at Zovo. Building free developer and productivity tools.