How to Find All Pages on a Website A Developer's Guide
- Feb 17
- 16 min read
Finding every single page on a website isn't as simple as it used to be. There’s no magic bullet; the best strategy is to layer several techniques. You'll want to start with the easy wins, like the XML sitemap, then bring in a web crawler to follow all the links, and finally, use a headless browser to deal with any JavaScript-heavy pages.
Why Finding Every Page Is a Modern Data Challenge
Before we get into the nitty-gritty, let's talk about why this is such a headache now. What was once a basic SEO check has morphed into a complex data engineering problem. Websites aren't just digital brochures anymore—they're massive, dynamic ecosystems.
Think about it: a huge e-commerce site could have millions of product pages, and a news outlet creates new URLs every single hour. The real problem is that so many of these pages are essentially invisible to standard tools. They're tucked away behind JavaScript, login screens, or weird navigation funnels.
This "invisibility" is a massive roadblock for anyone who needs a complete map of a site's content. And let's be honest, the reasons for needing that map are more important than ever.
Real SEO Audits: You can't optimize what you don't know exists. A full site inventory is the only way to spot orphaned pages, uncover crawl errors, and tackle duplicate content issues head-on.
Feeding AI and Machine Learning Models: Training large language models or building custom AI tools demands huge amounts of high-quality data. A website's entire content library can be a goldmine, but only if you can actually get to all of it.
Price Monitoring & Competitive Intel: If you're in e-commerce or market intelligence, you have to track every product and price shift. Missing pages lead to bad data and flawed insights.
Content & Asset Management: Getting a handle on your complete website structure is also fundamental for big projects, like building an effective AI Powered Knowledge Base that needs to pull from all of your site's content.
The Scale of the Modern Web
The sheer size of the internet today makes this even more of a challenge. As of early 2026, there are roughly 1.34 billion websites online. Google has indexed somewhere around 50 billion pages, but some older estimates put the total number of pages well over 130 trillion. That number has definitely exploded with the rise of dynamic frameworks like React, Angular, and Vue. You can find more of these eye-opening website statistics over on DiviFlash.com.
This monumental scale means that attempting to find all pages on a website is not just about writing a simple script. It’s about building a resilient data acquisition pipeline capable of navigating a massive, ever-changing, and often defensive web.
Old-school methods of just parsing static HTML files are dead. To get the job done today, you need a game plan that handles modern web architecture, from client-side JavaScript rendering to tricky anti-bot defenses. This guide will give you the tactics you need to get past those hurdles and collect the complete data you're after.
Start with the Obvious: Sitemaps and Search Console
Before you fire up a complex crawler or write a single line of code, the best place to begin is with the map the website owner has already created. An XML sitemap is your first and easiest win. It's a public list of URLs the site owner actively wants search engines to find and index.
Think of it as the official, curated tour of the website. While it's not always 100% complete or perfectly up-to-date, it’s an authoritative list that gives you a direct look at the site’s intended structure.
Finding and Using Sitemaps
Most of the time, finding a sitemap is pretty straightforward. You can usually just tack on a few common paths to the end of the domain name. For a site like , you'd check .
Here are the usual suspects to try first:
(This is common for larger sites that split their sitemaps into smaller, more manageable files.)
(This file is gold. It often contains a directive that points you right to the correct location.)
Once you've found it, you could manually copy and paste the URLs, but that's not practical for any site with more than a handful of pages. A simple script is a much better way to go. Using Python with a library like to grab the file and to parse the XML will get the job done quickly.
This decision tree gives you a good mental model for choosing your tools based on how the website is built.

As the graphic shows, for simple HTML sites, basic methods work just fine. But once you're dealing with a modern site that relies heavily on JavaScript, you'll need more advanced tools that can actually render the content like a real browser.
Conceptually, a script to pull these URLs is simple. You just need to fetch the sitemap, parse the XML, and pull out the text from every tag, which holds the URL. Keep in mind that sitemaps can be nested; a sitemap index file might link to a dozen other sitemap files, so your script might need to be smart enough to follow those links and parse them, too.
The Ultimate Source of Truth: Google Search Console
Sitemaps are fantastic for any site, but if you have access to the website you're analyzing, then Google Search Console (GSC) is your single most important tool. Nothing gives you a clearer picture of how Google actually sees a website.
The real power of GSC is that it shows you what Google has actually found, not just what you told it to find. This is where you uncover forgotten pages, old subdomains, and URLs with crawling errors.
Dive straight into the "Pages" report inside GSC. This is where the magic happens. It gives you a complete list of every URL Google knows about, conveniently broken down by its indexing status:
Indexed: These are the pages that are live in Google's search results right now.
Not indexed: This is a treasure trove of information, listing pages Google found but didn't index. The reasons vary—maybe they have a tag, are duplicates, or have crawl issues.
This report is the definitive record of your site’s health and visibility in the eyes of Google. It’s also the best way to spot "orphan pages"—pages that exist but have no internal links pointing to them, making them impossible for a standard crawler to find.
For any site audit, starting with a full export from Google Search Console is non-negotiable. It provides a solid baseline that will guide and validate every other discovery method you use.
Using Web Crawlers for Deeper Link Discovery
When a sitemap is missing, outdated, or just plain incomplete, your hunt for a site's pages is far from over. This is exactly where web crawlers, often called spiders, become your most powerful ally. A crawler works by starting at a single point—usually the homepage—and systematically following every single link it finds.

It’s the same basic method search engines like Google use to map the internet. By mimicking this process, you uncover the site's actual structure as a user or bot would see it, not just the neat and tidy version presented in a sitemap. You’ll find pages the owner forgot even existed.
The scale of this task can be mind-boggling. With 6.04 billion internet users browsing 1.13 billion websites, the number of pages is astronomical. Google's index tops 50 billion pages, but that's just a fraction of the story. Considering the 26.6 million e-commerce sites out there, the true total is well into the trillions. You can dive deeper into these numbers over at Musemind Agency.
Choosing Your Crawling Tool
The good news is you don't need to be a programmer to start crawling. There's a whole ecosystem of tools out there, catering to different needs and technical comfort levels. Your choice really boils down to the website's complexity and how much control you need.
Here's a quick rundown of some popular options:
For a User-Friendly Interface: Screaming Frog SEO Spider is the go-to desktop tool for many SEO professionals. It gives you a massive amount of data on every URL—response codes, metadata, word count, you name it—all without writing a single line of code. It’s perfect for comprehensive site audits.
For a Command-Line Approach: If you're comfortable in a terminal, is a surprisingly effective, no-frills utility. A simple command like will download the entire site, discovering all linked pages as it goes. It’s a workhorse for simpler, static websites.
For Ultimate Control: When you need a custom solution, building your own crawler is the way to go. Frameworks like Scrapy for Python or libraries like Puppeteer for JavaScript provide the building blocks to create a crawler perfectly suited for your project. If that sounds interesting, our guide on how to build a web crawler in JavaScript from scratch is a great starting point.
Comparison of Website Crawling Tools
Choosing the right tool can make all the difference. This table breaks down the most popular options to help you decide which one fits your project and skillset.
Tool | Best For | Technical Skill | Handles JavaScript | Cost |
|---|---|---|---|---|
Screaming Frog | SEO Audits, Link Analysis | Low (GUI-based) | Yes (with configuration) | Freemium |
wget | Simple site mirroring | Medium (Command-line) | No | Free |
Scrapy (Python) | Large-scale, custom scraping | High (Programming) | No (requires Splash/Playwright) | Free (Open-Source) |
Puppeteer (JS) | Modern JS-heavy sites (SPAs) | High (Programming) | Yes (natively) | Free (Open-Source) |
Ultimately, the "best" tool is the one that gets the job done efficiently for your specific target. For most marketing and SEO tasks, Screaming Frog is more than enough. For complex, dynamic sites or large-scale data extraction, a custom Scrapy or Puppeteer build is a better investment of your time.
The Core Logic of a Custom Crawler
Building your own crawler isn't as scary as it sounds. The logic behind it is pretty simple and centers on managing a "to-do" list and a "done" list to avoid getting lost.
Here's how a typical crawler thinks:
It starts with a to-do list: This is the URL Queue (or "frontier"), which initially contains just your starting URL.
It keeps a "been there" list: The Visited Set tracks every URL the crawler has already processed. This is absolutely critical to stop it from getting stuck in an infinite loop.
It looks for new paths: On each page, the crawler's Extraction Logic parses the HTML, finds all the tags, and pulls out the links ( attributes).
It updates the lists: Each new link is checked. If it's on the target domain and isn't in the Visited Set, it gets added to the URL Queue.
This process repeats until the URL Queue is empty. By then, the crawler has followed every available path on the site, giving you a comprehensive list of all internally linked pages.
At its heart, a web crawler is just an automated explorer. It navigates the digital pathways you define, turning a chaotic web of links into an organized list of discoverable pages.
Ethical Crawling Is Non-Negotiable
With great power comes great responsibility. A poorly configured crawler can hit a server so hard and fast that it slows down or even crashes the entire website. This is not only rude but will almost certainly get your IP address banned. Always crawl responsibly.
Follow these golden rules to be a good web citizen:
Respect : This is the rulebook. Always check first. This file tells bots which parts of the site the owner does not want you to visit. Your crawler must obey these directives.
Set a Crawl Delay: Don't slam the server with requests. Pause for a second or two between each page fetch to minimize your impact. Some files will even suggest a to use.
Identify Your Bot: Use a descriptive User-Agent in your request headers. Something like is transparent and gives the site owner a way to contact you if there's a problem.
Handle Errors Gracefully: If the server pushes back with an error (like a ), your crawler should back off and try again later, not keep hammering it.
By following these simple guidelines, you can get the data you need without causing headaches for the site owner and ensure you don't get locked out.
Handling Modern JavaScript-Driven Websites
If you’ve ever tried to crawl a modern website and gotten back a nearly empty page, you've run into the JavaScript problem. Traditional crawlers are brilliant at one thing: parsing static HTML. But the web has moved on.
Today's sites, often built with frameworks like React, Angular, or Vue, operate more like applications. They send your browser a barebones HTML shell, and JavaScript does the heavy lifting, fetching content and building the page you actually see. For a simple crawler, it's like showing up to a party before the host has even put out the food. It sees an empty room, finds no links, and just gives up, completely missing the real website.

This isn't a niche issue. With 1.34 billion websites out there and WordPress alone powering 43.2% of them, dynamic content is the norm. It's driven by the demand for interactive experiences, especially since mobile traffic now exceeds 60%. While Google has managed to index around 50 billion pages, countless more are hidden behind JavaScript that basic tools just can't execute. If you want a deeper dive into these numbers, Exploding Topics has some great data.
The Rise of Headless Browsers
So, how do we see what the user sees? The answer is a headless browser.
Imagine a real web browser like Chrome, but without any of the visible windows or buttons. It runs entirely behind the scenes, controlled by your code. It can do everything a normal browser can—execute JavaScript, handle API calls, and render the complete, final version of a page.
By letting a headless browser do the work first, your crawler gets the fully-rendered HTML, packed with all the links and content that were previously invisible. It’s the key to unlocking modern websites.
Two of the go-to tools for this are:
Puppeteer: A Node.js library from Google that gives you a clean, powerful API to control Chrome or Chromium. It's my personal favorite for its simplicity and reliability.
Selenium: The old guard of browser automation. It’s been around forever, supports practically every browser, and has a huge community with libraries for Python, Java, and C#, among others.
Choosing the right tool is a big decision when you're building a serious scraping project. We put together a detailed comparison of Puppeteer vs Playwright for modern web scraping that can help you decide what fits your needs.
A headless browser isn't just a tool; it's a fundamental shift in how we approach web scraping. It moves from passively parsing static text to actively interacting with a live web application, making it possible to map even the most complex JavaScript-driven sites.
Tackling Infinite Scroll and Pagination
Just having a headless browser isn't quite enough, though. You still need to tell it how to behave like a user. Many sites hide content behind actions like scrolling or clicking "load more" buttons. Your code needs to simulate these interactions.
Infinite Scroll You see this everywhere, from social media feeds to e-commerce product listings. New content appears as you scroll down the page.
To get all this data, you have to program your script to mimic that scrolling behavior. The logic is pretty straightforward:
Load the initial page.
Run a JavaScript command to scroll to the very bottom, like .
Pause for a moment to let new content load. You can wait a fixed time or, for more robust scripts, wait for a specific network request to finish.
Grab the newly loaded HTML and extract any links.
Keep scrolling and scraping in a loop until you scroll and nothing new appears.
"Load More" Buttons Other sites use a button to load the next "page" of results without a full page refresh. This is just another interaction you need to automate.
Your script needs to find and "click" that button over and over. Here’s the game plan:
Load the page and scrape the content that's visible initially.
Find the "Load More" button using its CSS selector (like a class or ID).
Tell your headless browser to click it.
Wait for the new items to appear on the page.
Repeat this process until the button either vanishes or becomes disabled, signaling you've reached the end.
Mastering these interaction techniques is what separates a basic crawl from a comprehensive one. By programming your crawler to act like a real user, you can coax a website into revealing all its content and the links hidden inside, ensuring your map of the site is truly complete.
Navigating Anti-Bot Defenses and Rate Limiting
Once your crawler starts picking up speed, it stops being a quiet visitor. It becomes a bot, hammering a server with hundreds or thousands of requests in rapid succession. Modern websites are built to spot this exact behavior and shut it down. Sooner or later, you're going to hit a wall—a CAPTCHA, a 403 Forbidden error, or an outright IP ban.
This is where the real cat-and-mouse game begins. Finding all the pages on a large, protected website isn't just about following links; it's about making your crawler look less like a bot and more like a real person browsing the site. This means moving beyond simple GET requests and adopting smarter, more resilient strategies.
Blending In with Proxies and User Agents
The first things a website's security system checks are your IP address and your User-Agent string. A thousand requests from the same IP in a minute is a dead giveaway.
To fly under the radar, you need to diversify your crawler's identity.
Rotate Your User-Agent: Never, ever use the default User-Agent from your HTTP library (like ). Keep a running list of common, real-world browser User-Agents and cycle through them with each request. This simple step makes your traffic look like it’s coming from many different people on different browsers.
Use High-Quality Proxies: A proxy server acts as a middleman, masking your true IP. Datacenter proxies are cheap and easy to spot. For any serious project, residential proxies are the way to go. These are real IP addresses from actual internet service providers, making your requests indistinguishable from legitimate user traffic.
Distributing your requests across a large pool of residential proxies is the single most effective tactic for avoiding IP-based blocks and rate limits. Your crawler suddenly looks like a crowd of individual users instead of one aggressive bot.
The goal isn't just to hide; it's to blend in. A successful crawler is one that the server doesn't even notice, seamlessly mixing its requests with the flow of genuine human traffic.
How to Handle Rate Limits Gracefully
Even with proxies, firing off too many requests too quickly will get you flagged. This is rate limiting, a server's self-defense mechanism. Instead of trying to brute-force your way through it, the smart play is to work with it.
When you get an HTTP response, the server is giving you a polite warning to slow down. The worst thing you can do is ignore it. Best practice is to implement an exponential backoff strategy. If a request fails, you wait one second before trying again. If it fails a second time, wait two seconds, then four, then eight, and so on. This automatically dials back your crawl speed when you meet resistance.
This respectful approach not only keeps you from getting banned but also ensures your crawler doesn't negatively impact the website's performance. There are many tactics you can employ, and our guide on how to scrape a website without getting blocked covers these strategies in much greater detail.
Tackling Advanced Challenges
Some sites bring out the big guns. These advanced systems don't just look at your IP or User-Agent; they analyze your crawler's digital "fingerprint."
Here are some of the toughest hurdles and how to approach them:
CAPTCHAs: These are designed specifically to differentiate humans from bots. While CAPTCHA-solving services exist, prevention is always better. Using high-quality residential proxies and mimicking human-like request patterns can often keep you from triggering them in the first place.
Browser Fingerprinting: Modern security can check for dozens of browser attributes—screen resolution, installed fonts, plugins, and even how your machine renders WebGL graphics. A simple HTTP request has none of these. To pass these checks, you'll need a real headless browser like Puppeteer, configured to look like a standard, everyday browser.
Behavioral Analysis: The most sophisticated systems track mouse movements, scrolling behavior, and typing speed. Overcoming this requires highly specialized tools that can simulate these human-like interactions, often using AI-driven browser automation to appear genuine.
Successfully navigating these defenses isn't about finding a single magic bullet. It’s about building a resilient, multi-layered system. By combining rotating residential proxies, realistic browser fingerprints, and intelligent rate-limit handling, you can build a crawler capable of mapping even the most well-defended websites on the internet.
Common Questions About Finding Website Pages
As you get deeper into mapping out a website, you’ll inevitably run into some tricky situations. It happens every time. Navigating these edge cases—from technical quirks to legal gray areas—is what separates a quick scan from a thorough, professional project. Let's tackle some of the most common questions that pop up.
How Can I Find Pages That Are Not Linked Anywhere?
Ah, the classic mystery of "orphaned pages." These are the pages a standard crawler will completely miss because they have zero internal links pointing to them. A spider just follows tags, so if there's no path, it never finds the page. You have to think like a detective and look for clues outside the site's visible link structure.
Your best bet is to check sources that list URLs directly:
Dig into the XML Sitemap: This is the website owner’s official roadmap. It’s the first place I always look for pages that were created but never properly linked up from the main site.
Use Google Search Operators: A quick search can be surprisingly revealing. Google often finds and indexes pages that users (and crawlers) can't, giving you a list of URLs to investigate.
Analyze JavaScript Files: Modern sites often build URLs on the fly. It's worth poking around in the site's JavaScript to see if you can find routing logic or API endpoints that hint at how page URLs are generated.
Finding these hidden pages is often the last step in making sure you truly find all pages on a website for a comprehensive audit.
Is It Legal to Scrape All Pages from Any Website?
This is a big one, and the honest answer is: it’s complicated. Generally, scraping publicly available information isn't illegal, but you're not operating in a vacuum. You have to respect the website's rules and any relevant data privacy laws.
The first thing you should always do is check the website's Terms of Service and file. While they aren't always legally binding everywhere, ignoring them is bad form and the fastest way to get your IP address blocked. Beyond that, privacy laws like GDPR and CCPA have very strict rules about collecting and handling personal data, even if it's publicly visible.
The real legal danger often isn't the act of scraping itself, but the impact of it. If your crawler hammers a server so hard that it slows down or crashes the site, that could be interpreted as a denial-of-service (DoS) attack. The golden rule is to be respectful, go slow, and if you have any serious doubts, it's time to talk to a lawyer.
What Is the Best Way to Handle Duplicate Content?
Get ready to find a lot of duplicate content. It's a given on almost any large-scale crawl. A single product might be reachable through dozens of URLs, especially on e-commerce sites with filters and tracking parameters (like or ).
The industry-standard solution is to find and respect the canonical URL. This is a little tag in the page's HTML that looks like . It's the site owner's way of telling search engines, "Hey, out of all these similar-looking pages, this is the one true version."
Your crawler or process needs to be smart about this:
On every page you land on, check the for a canonical tag.
If that tag points to a different URL, treat the current page as a duplicate and discard it.
Only add the designated canonical URL to your final master list.
By following canonical tags, you stop yourself from processing the same content over and over, which saves a ton of time and resources. More importantly, it ensures your final list of pages is clean, accurate, and reflects the site's intended structure.
Comments