How to Find All Existing and Archived URLs on a web site

There are plenty of reasons you could want to discover all of the URLs on a web site, but your actual target will figure out Whatever you’re seeking. As an illustration, you may want to:

Establish every single indexed URL to investigate concerns like cannibalization or index bloat
Accumulate current and historic URLs Google has observed, especially for web site migrations
Obtain all 404 URLs to Get better from write-up-migration faults
In each circumstance, an individual Instrument received’t Supply you with anything you may need. Unfortunately, Google Lookup Console isn’t exhaustive, and also a “website:case in point.com” research is proscribed and difficult to extract info from.

In this particular article, I’ll stroll you through some applications to make your URL listing and ahead of deduplicating the information employing a spreadsheet or Jupyter Notebook, based upon your site’s sizing.

Old sitemaps and crawl exports
When you’re trying to find URLs that disappeared within the live web site not long ago, there’s a chance an individual on your own crew may have saved a sitemap file or perhaps a crawl export ahead of the improvements had been made. For those who haven’t previously, look for these data files; they're able to normally supply what you require. But, if you’re reading through this, you almost certainly didn't get so Fortunate.

Archive.org
Archive.org
Archive.org is an invaluable tool for Web optimization duties, funded by donations. If you try to find a website and choose the “URLs” option, you could access approximately ten,000 listed URLs.

Even so, There are several constraints:

URL limit: It is possible to only retrieve as many as web designer kuala lumpur 10,000 URLs, which can be inadequate for larger sized web-sites.
Excellent: Several URLs may be malformed or reference source documents (e.g., pictures or scripts).
No export solution: There isn’t a built-in strategy to export the list.
To bypass the lack of the export button, utilize a browser scraping plugin like Dataminer.io. Having said that, these limits necessarily mean Archive.org might not supply an entire Option for more substantial web sites. Also, Archive.org doesn’t suggest irrespective of whether Google indexed a URL—but if Archive.org observed it, there’s a great probability Google did, too.

Moz Pro
When you might usually use a link index to locate exterior internet sites linking to you, these applications also find URLs on your website in the procedure.


The best way to utilize it:
Export your inbound links in Moz Pro to obtain a speedy and straightforward listing of goal URLs from the web-site. If you’re coping with a massive Web site, think about using the Moz API to export facts outside of what’s manageable in Excel or Google Sheets.

It’s crucial to Take note that Moz Pro doesn’t affirm if URLs are indexed or identified by Google. However, considering the fact that most web sites implement the same robots.txt regulations to Moz’s bots as they do to Google’s, this process frequently is effective well as a proxy for Googlebot’s discoverability.

Google Look for Console
Google Lookup Console provides quite a few precious sources for making your listing of URLs.

Back links experiences:


Much like Moz Pro, the Inbound links portion offers exportable lists of concentrate on URLs. Sad to say, these exports are capped at one,000 URLs Every single. You are able to implement filters for specific pages, but due to the fact filters don’t utilize to the export, you could ought to depend upon browser scraping equipment—limited to five hundred filtered URLs at a time. Not ideal.

Overall performance → Search engine results:


This export provides you with a list of webpages acquiring search impressions. When the export is proscribed, You need to use Google Look for Console API for larger datasets. You can also find free Google Sheets plugins that simplify pulling a lot more in depth facts.

Indexing → Web pages report:


This part presents exports filtered by situation style, even though these are definitely also restricted in scope.

Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a wonderful resource for amassing URLs, with a generous limit of 100,000 URLs.


Even better, you are able to apply filters to make distinct URL lists, efficiently surpassing the 100k limit. For example, if you wish to export only blog URLs, abide by these techniques:

Stage 1: Increase a section towards the report

Action 2: Simply click “Create a new section.”


Stage three: Determine the section which has a narrower URL pattern, which include URLs containing /blog site/


Note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.

Server log documents
Server or CDN log documents are Potentially the ultimate Instrument at your disposal. These logs capture an exhaustive checklist of every URL path queried by consumers, Googlebot, or other bots over the recorded period of time.

Criteria:

Details sizing: Log data files is usually large, a great number of web pages only retain the last two months of knowledge.
Complexity: Examining log information can be complicated, but a variety of resources are offered to simplify the method.
Combine, and fantastic luck
After you’ve collected URLs from each one of these resources, it’s time to combine them. If your website is sufficiently small, use Excel or, for bigger datasets, instruments like Google Sheets or Jupyter Notebook. Ensure all URLs are persistently formatted, then deduplicate the record.

And voilà—you now have an extensive list of present, previous, and archived URLs. Very good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *