Managing parameterized & duplicate URLs and crawl budget for SEO

Ecommerce websites often generate multiple URLs for the same content due to product variations, category paths, filters, sorting, pagination internal search and, depending on the ecommerce platform or software used, due to many other factors.

On ShopWired, products and categories can be available (displayed) at many different URLs depending on the navigation path taken by the user to arrive at the page. In addition, ShopWired makes available variation URLs for each variation combination on each product.

For this reason, it's common to see in Google Search Console (or similar tools provided by other search engines) that thousands, tens of thousands or even hundreds of thousands of URLs have been discovered by the search engine whilst only a fraction of them have actually been indexed and will appear in search results.

To prevent issues, primarily to prevent duplicate content issues, ShopWired automatically uses canonical tags to designate the primary URL for each category and product page - and places canonical URLs on all other page types too (even if they cannot be accessed from different URLs). Canonical tags are output on your website using a variable within your theme's code global.header - canonical tags are therefore not implemented at the "theme level" but by the ShopWired platform itself, independently of your theme.

Canonical tags and noindex tags

Canonical tags

A canonical tag is a hint to Google indicating the “preferred” or master version of a page. If multiple URLs have the same content, you put <link rel="canonical" href="https://example.com/preferred-page"> on the duplicates, pointing to the primary URL.

Canonical tags consolidate ranking signals and tell Google to index the canonical page, not the duplicate.

Noindex tags

A noindex tag is a directive (e.g. <meta name="robots" content="noindex">) that tells Google not to index the specific page it is placed on. It is not an instruction not to crawl the page, but one that it should be dropped from search results. This is useful for pages that you never want shown in search (login pages, internal search results, etc.), but it does not merge ranking signals – it simply excludes the page.

Google itself says that pages with noindex may still be crawled and may still consume crawl cycles. Google's own guidance on Crawl Budget says "Don’t use noindex, as Google will still request [the page], but then drop it when it sees the noindex meta tag, wasting crawling time.".

Duplicate content

For duplicate content situations, using a canonical tag is the standard and best practice solution.

It avoids indexing duplicates while still allowing Google to crawl them and pass any link equity to the canonical. Adding a noindex on duplicate pages is typically unnecessary and potentially counterproductive. SEO experts typically summarise this as “The canonical tag already tells Google which page to prioritise...Adding a noindex tag might confuse things....Just stick with the canonical tag unless you have a specific reason to completely hide the page.”.

In other words, if a page is meant to funnel signals to a canonical version, you don’t also need to noindex it – Google will simply not index the duplicate (it will show up in Search Console as “Alternate page with proper canonical tag”).

To prevent issues with duplicate content, for duplicate URLs that have a canonical set, a noindex would be asking for the same outcome twice (and doing so in a way that doesn’t save crawl effort). Google’s John Mueller has advised against combining a noindex with a canonical on the same page. His recommendation is to choose one approach or the other based on your goal.

URL footprints

Parameterised URLs

Query parameters can be appended to the end of URLs to display different content on those pages or preload specific actions on the page for the visitor of a URL. A good example of parameterised URLs is ShopWired's variations URL feature (which is an essential feature, for example when optimising the effectiveness of Google Shopping campaigns).

Other types of parameterised URLs can also occur on ShopWired websites, for example with URL query based pagination, sorting and filters.

Products with many hundreds of individual variation combinations can therefore have several hundreds of variation URLs.

Category and product URLs

Single category and product pages can be available at different URLs depending on category assignments. A subcategory that is assigned to, for example, ten different parent categories will be available at a total of 11 different URLs (all of it's parent category URLs in the path and the single canonical URL).

Combining the two factors

When the effects of parameterised URLs and category URLs are combined, the end result can be explosive. A product assigned to five subcategories, where each of those subcategories are assigned to 5 parents, and where the product has 200 variation combinations can have more than one thousand URL variations, all showing the same page content.

Crawl budget

Crawl budget refers to the amount of attention or number of URLs Googlebot will crawl on your site in a given period. If Google’s crawler finds hundreds of thousands of URLs on a site, it may not crawl all of them frequently, prioritizing what it deems important. For most small-to-medium sites, crawl budget isn’t a limiting factor – Google will crawl what it needs to. However, when a site has a large number of low-value or duplicative URLs, it can waste Googlebot’s time and resources, potentially causing Google to give up crawling some pages in depth.

Google’s documentation notes that if many URLs on your site are duplicates or unnecessary, “this wastes a lot of Google crawling time on your site.”

A Google representative has recently said of crawl budget "it really is not an issue unless [we’re] talking millions of pages.".

An important metric within Google Search Console is the number of discovered URLs vs the number of pages indexed. Whilst there is no golden rule about what the ratio should be, or be lower than, higher ratios indicate that Googlebot wastes time on duplicate URLs, it might crawl other pages less often, which can slow down how quickly new or updated pages get indexed. In extreme cases, it can also lead to Google crawling parts of the site very infrequently.

Google’s own guide for large sites says a key best practice is to “manage your URL inventory” so that Google isn’t busy on unnecessary URLs. If Google spends too much time on URLs that “aren’t appropriate for the index”, it might decide not to increase crawl efforts on your site.

Some website owners may become concerned that their website is in this position, and their discovered URLs : pages index ratio is too high. However, the exact ratio is not a determinant factor, and a very high ratio (e.g. 100 : 1 ) may cause no problems at all for a small website with 1,000 discovered URLs, where a lower ratio (e.g. 20 : 1) may cause problems for a large website with 1,000,000 URLs.

If a website owner is concerned that their crawl budget is being used inefficiently, ShopWired advises to focus on crawl optimisation strategies to attempt to cut down on crawl waste.

Crawl budget should only be a concern if important pages on your website aren't being indexed or are being updated too infrequently because Google is too busy crawling other URLs on your website.

Sitemap

ShopWired automatically creates a sitemap.xml file for your website. The sitemap lists all of the URLs on your website from product pages to blog posts, but importantly it only lists a single "page" (i.e. a data object like a product or a blog post) once, at its canonical URL.

Your website's automatically generated sitemap never includes parameterised URLs such as variation URLs or category sorting URLs, and never includes tge same page more than once.

Impact on indexing & Google Search Console coverage

In Google Search Consoles’s Pages > Not Indexed report, websites can have a very large number of URLs labeled “Alternate page with proper canonical tag.”

This status means Google found those URLs, saw a canonical tag on them pointing to another page, and thus did not index them (the canonical version was indexed instead). For example, a URL like www.yoursite.com/product?size=large might be listed there, with Google noting that its canonical is www.yoursite.com/product – so only the latter is indexed.

As RankMath’s knowledge base explains: this status “means that there are two versions of a page having the same canonical URL. Google will simply exclude the duplicate version and index the main version of the page.”

In fact, Google recognizing pages as “Alternate with proper canonical” is a good sign – it implies your canonical tags are working correctly, and Google is obeying them. The RankMath article goes on to say “Google recognizes these canonicalized URLs correctly, and there is nothing you need to do on your part.”

Therefore, from an indexing perspective, in an example website's collapse from say 50,000 discovered URLs to 1,000 indexed pages is not an indication that anything is wrong with the website's structure or URL architecture. This kind of result is the expected outcome of Google filtering out duplicates.

However, such a large ratio of 50 : 1 is an indication of "crawl bloat". Changes to your website's architecture or category structure, the sudden addition of a large number of variation options for certain products, as well as other actions you may take on your website, can all manifest as large changes to the ratio.

Using robots.txt to block unnecessary URL crawling

The robots.txt file is a good way to instruct search engines (and other bots) what URLs should not be crawled on your website at all.

If you are worried about "crawl bloat" on your website or that your ratio of discovered URLs to indexed pages is too high, you can use the robots.txt file to instruct search engines not to crawl particular URLs (or URLs with particular patterns) on your website pages.

You can use ShopWired's custom robots.txt feature to configure your own rules for your website's robots.txt file.

Disallowing all parameterised URLs

You can add a rule to your robots.txt file:

Disallow: /*?

to block all URLs containing a “?” (query string). Google (and most modern search engines) do support wildcards in robots.txt (the * and $ patterns) so /*? would indeed cover any URL on the website that has a query parameter. This would blanket-block variation URLs, filter URLs, sort URLs, pagination, search queries, tracking parameters (like ?utm=), etc. It’s a very broad stroke.

It will certainly prevent Google from crawling the vast majority of duplicate URLs. However, caution is needed: such a rule would also block any legitimate page that uses ? in the URL. For example, if the site’s pagination is ?page=2, those would be disallowed.

Disallowing specific parameter URLs

If you prefer not to blanket block all parameterised URLs from robots.txt you can use more refined rules to target specific query parameters on your website.

Disallow: /*?variation= will block all variation URLs
Disallow: /*?page will block all category, brand and search result pagination URLs
Disallow: /search will block your website's search page

Before deploying such rules, double-check that no vital content is only accessible via URLs that you are blocking. As a best practice, use a robots.txt testing tool (Google Search Console has one, or something like realrobotstxt.com) to simulate the rules against a list of URLs. Be ABSOLUTELY SURE you are not blocking legitimate URLs that you do want Google to crawl.

Noindex

You can use ShopWired's noindex feature if you want to specifically request Google does not index a website page. This should not be used to block one of your website pages from being crawled (the appropriate tool to do so is the robots.txt file) and only if you are absolutely sure you want the website page (it's canonical URL) removed from Google's index.

Monitoring

After making changes, use Google Search Console's coverage and crawl statistics to monitor the performance of the rules or changes that you have implemented.

In Index Coverage, you should start seeing the number of “Alternate page with canonical” plateau or drop, and ideally see more URLs in “Blocked by robots.txt” category (for those parameterised URLs now disallowed). Seeing them in “Blocked” is fine – it means Google knows of them (perhaps from previous crawling or links) but isn’t crawling them now. Over time, if no new links to them appear, Google will de-prioritise them.

The Crawl Stats report can show the total requests by Googlebot. If the changes are effective, you may notice fewer Googlebot requests to pages with ? parameters. This might free up crawl capacity for other pages. Also watch if Google’s crawl rate on the site improves or if “time spent downloading a page” decreases – blocking heavy filter pages might even reduce server load slightly. The Crawl Stats “By response code” and “By file type” breakdown can confirm Google is mostly hitting HTML pages that matter, not tons of parameter pages.

As a quick fix, for any particularly problematic URLs that are currently indexed and you want them gone ASAP (for example, if ?variation=123 pages are somehow indexed and showing in results), you can use Google Search Console’s Remove URLs tool to temporarily remove them. This is a temporary removal (valid for about 6 months) and you must also address it via meta tags or robots.txt in the meantime.

After making changes to reduce the discovered URLs to indexed pages ratio, Google will likely crawl the site more efficiently, but it may not result in a sudden massive increase in indexed pages – because Google was already indexing the correct pages. The main benefit is stability and future-proofing. With fewer URLs of little use, Google’s crawl budget for the site will be focused on real content. This can help with faster discovery of new products or changes and should also mean the Search Console reports will be cleaner.

Do not anticipate seeing results if the overall number of pages on your website, either the discovered URLs or the indexed pages is not what Google themselves consider to be a "large website".

Frequently asked questions

Is there a platform-level solution to apply noindex tags to non-canonical variation URLs?

No – and there shouldn’t be. Canonical tags already prevent Google from indexing variation URLs. Adding a noindex tag would be redundant and potentially counterproductive.

Does the ShopWired platform have canonical variation URLs?

Yes – indirectly. All variation URLs (e.g. ?variation=123) contain a <link rel="canonical"> pointing to the main product URL (e.g. /product-name). This is handled by ShopWired at the platform level and not manually within the theme templates.

Should non-canonical URLs also have a `noindex` tag?

No. Best practice is that canonical and noindex should not be combined. Google’s own guidance, echoed by top SEOs, is to use one or the other – not both. Canonical says “index this other version,” while noindex says “don’t index this at all.” Mixed signals reduce clarity and have potential for error.

Will `noindex` prevent Google from crawling those URLs?

No. This is a common misunderstanding. noindex only stops indexing – Google still needs to crawl the page to see the noindex tag. In fact, it uses crawl budget to do so. To prevent crawling entirely, use robots.txt.

Is it true that parameterised URLs are harming crawl budget even if they aren’t indexed?

It depends. Canonical tags stop indexing but not crawling. If there is a very large number of discovered URLs (e.g. more than 50,000 or 100,000) AND a high ratio of discovered URLs to indexed pages the difference has still used crawl budget which is important on larger websites. Googlebot must access and process each URL to evaluate its canonical status.

Crawl budget isn’t a concern for small sites, but for medium or large catalogues it’s wise to reduce crawl-waste by limiting discoverable duplicate URLs.

Is Google indexing parameterised URLs?

It should not be. Parameterised URLs on your ShopWired website will always contain a canonical link. They should therefore all appear in GSC under “Alternate page with proper canonical tag.”

Are canonical tags configured correctly on ShopWired?

Yes, they are output at the platform level through global.header and are not theme-defined. L.

Is my website's sitemap including URLs outside of canonical URLs

If you are using ShopWired's automatic sitemap, and have not manually entered your own, it will only contain canonical URLs - this is true no matter the page type.

Should Google Search Console removals be used?

Only if:

Parameterised URLs were previously indexed (not just crawled)
They are currently appearing in search results

Otherwise, canonical + robots.txt is the long-term solution. Removals are temporary and won’t help unless underlying causes are fixed.

Does crawl budget affect indexing?

Indirectly. Crawl budget affects how often and how deeply Google crawls a site. If 90% of crawl time is spent on redundant URLs, new or updated content may be crawled less frequently, delaying indexing or updates.