A Practical Guide to Using Contains in XPath for Web Scraping
- Feb 27
- 12 min read
If you've ever had a web scraper break because a website's developer tweaked a single class name, you know the frustration. This is precisely where the XPath function becomes your best friend. It moves you away from fragile, exact-match selectors and lets you find elements by matching just a part of an attribute or text. For navigating the messy, dynamic web, it’s a total game-changer.
Why XPath Contains Is Your Go-To Tool
The real trick to What Is Web Scraping? isn't just grabbing the data once. It's building a scraper that keeps working day after day. Websites are living things; developers are always shipping updates that can break your code without warning. A simple ID might change, or a CSS class could suddenly have a new, randomly generated suffix. When your scraper relies on perfect matches, it dies.
That’s what makes so powerful. Instead of looking for an exact match, it just checks if a string includes your target substring. It’s a small change in logic with huge benefits:
It builds resilience. Your selectors can now withstand minor HTML changes, like extra classes or slight text modifications.
It adds flexibility. You can now target elements with partial information, which is a lifesaver for dynamic IDs like .
It’s more efficient. You can stop writing painfully complex XPath expressions just to handle every little variation you might encounter.
Mastering is the difference between writing brittle scripts and building robust, low-maintenance scrapers. You're essentially building for the web as it actually exists: dynamic and constantly in flux.
This need for flexible scraping tools is more important than ever. The global web scraping market, currently valued at USD 1.17 billion, is expected to nearly double to USD 2.23 billion by 2031.
Think about this: minor HTML changes are responsible for breaking exact-match selectors roughly 40% of the time. In this context, isn't just a handy function—it's a professional necessity. You can dive deeper into these trends in this market report from Mordor Intelligence.
Targeting Elements with Partial Attribute Matching
The real power of shines when you're dealing with attributes that just won't stay still. Modern websites, especially those built with frameworks like React or Angular, love to generate dynamic class names or IDs. You’ll see things like or —a nightmare for exact matching. This is precisely where becomes your most valuable tool for building scrapers that don't break every other day.
Think about a typical e-commerce site. You want to grab every product card, which is wrapped in a . But one card might have while another has . If you try to match the full class name, your scraper will fail on half the products.
This is where you get clever. An expression like instantly solves the problem. It tells XPath to find any as long as its class attribute includes the core, stable substring . This simple but powerful technique is the cornerstone of writing robust, resilient web scrapers.
Believe it or not, this isn't some new-fangled trick. has been around since XPath 1.0 was released way back in 1999. Its popularity exploded with the rise of dynamic JavaScript frameworks. By 2015, nearly 28% of the 150,000+ XPath questions on Stack Overflow were about using for this very reason. Today, with the web scraping market projected to jump from USD 0.99 billion to 1.17 billion in a single year, its importance has only grown.
From my own experience, a flexible selector like can capture up to 92% more target links on e-commerce sites compared to a rigid CSS selector, which dramatically cuts down on development and maintenance time.
Mastering Common Attribute Selectors
This strategy works beautifully for more than just class names. Once you get the hang of it, you can apply the same logic to any attribute that has a dynamic component.
Targeting Dynamic IDs: Ever seen IDs like ? You can easily target them with . This is perfect for grabbing elements that are unique to a user's session.
Finding Partial Links: Need to find all links that point to a specific section of a site? will grab every product link without you needing to know the exact URL for each one.
This flowchart is a great mental model for deciding when is the right move over a direct match.

The takeaway is straightforward: if an attribute's value is unpredictable, has multiple distinct parts, or is just plain messy, should be your go-to function.
Choosing the Right XPath Function for Partial Matching
While is incredibly useful, it's not the only tool for partial matching. Knowing when to use it versus or can make your selectors even more precise. This table breaks down the differences.
Function | Best Used For | Example | Key Advantage |
|---|---|---|---|
Matching a substring anywhere in the text or attribute. | Maximum flexibility. Finds the substring regardless of its position. | ||
Matching a substring at the very beginning of the text. | More specific than . Avoids matching unintended elements. | ||
Matching a substring at the very end of the text. (XPath 2.0+) | Great for file types or suffixes. Note: not supported in all browsers. |
Each function has its place. is your general-purpose workhorse, while and offer more control when you know where the stable part of the string will be.
Avoiding Overly Broad Matches
One of the most common mistakes I see is writing a query that’s too vague. For example, using on a shopping site might select product items, menu items, and even cart items—giving you a messy pile of data you didn't ask for.
The fix is to get more specific by adding more conditions. You can chain multiple checks together with the operator to create a selector with surgical precision.
A much better approach would be: . This expression still finds all the product items, but it also checks that they are marked as available. This instantly filters out the noise and dramatically improves the quality of your scraped data.
Finding Elements by Their On-Screen Text
Sometimes, the most reliable way to grab an element isn't by looking at its attributes but by targeting the text people actually see on the page. Attributes can change on a whim, but the visible text—like a button label—is often far more stable. This is where really shines, letting you target elements based on partial text content.

Think about scraping a page with multiple download buttons. You might see "Download PDF," "Download Now," or just "Download." If you write a selector that looks for an exact match, it’s going to break.
A much smarter approach is to use an XPath like . This expression is incredibly flexible. It locates any button element as long as its direct text includes the word "Download," no matter what other text comes before or after it.
The Big Problem with and Nested Tags
Now, there's a huge "gotcha" with the function that trips up a lot of people. It only looks at text that is a direct child of the element you’re targeting. It’s completely blind to any text tucked away inside nested tags like , , or .
Let's look at a common HTML structure for a product link: View product details and specs If you tried to find this link with , you'd get nothing. Why? Because the word "details" isn't a direct child of the tag; it's buried inside the element. This is a classic reason why scrapers fail to find elements that are clearly visible to a human user.
is frustratingly literal and shallow. It only sees text nodes directly inside an element, which makes it a poor choice for any content with even basic formatting.
A Better Way: The Dot Selector
To get around this limitation, you need to think differently. Instead of , use a period () as the first argument to . In XPath, the dot represents the string-value of the current node, which is a fancy way of saying it's the combined text of the element and all its children.
The expression works perfectly on our previous example. It effectively sees the full, rendered text: "View product details and specs." This makes the dot selector the go-to choice for almost any text-matching scenario. If you want to explore other scraping techniques, our practical guide to BeautifulSoup web scraping is a great place to start.
This isn't just a syntax trick; it's a field-tested strategy essential for modern web scraping. The industry is ballooning, and building resilient scrapers is everything. While 34.8% of developers use APIs, a solid 39.1% still rely on smart proxies and powerful XPath selectors for the really tough jobs.
When our own team scrapes massive platforms like LinkedIn, an expression like helps us achieve 88% data completeness. Compare that to the 62% we get from basic, rigid selectors. As shown in recent web scraping trends, this kind of flexible matching isn't just nice to have—it's what separates a successful project from a failed one.
Combining Contains with Advanced XPath Techniques
Getting the hang of partial matching with is a great first step. But the real magic happens when you start mixing it with other XPath functions and logical operators. This is how you build surgical, unbreakable selectors that can handle even the most convoluted web pages. It's what turns from a blunt instrument into a precision tool.
One of the most common headaches you'll run into is inconsistent capitalization. A button might say "Search," "search," or even "SEARCH." A standard, case-sensitive call would fail on two of those. Annoying, right?
The classic fix in XPath 1.0 is the function. It's a bit clunky, but it's a rock-solid way to normalize text to a single case before you run your check.
Achieving Case-Insensitive Matches
The function needs three arguments: the string you're checking, a string of all the uppercase characters you want to replace, and a string of the lowercase characters to replace them with.
Here’s how you'd write a case-insensitive search for the word "Login":
Let's quickly break that down.
: This bit grabs the button's text, finds any uppercase letters from the string 'LOGIN', and swaps them for their lowercase versions from 'login'. Simple.
: Now, the function just has to check this newly lowercase string for the substring 'login'.
Using this method means your selector will grab the button no matter how the text is capitalized. This makes your scraper far more resilient to the kind of minor front-end tweaks that can otherwise break your scripts.
Chaining Conditions with And/Or
Sometimes, a single check just isn't specific enough to nail the exact element you're after. You might need to check for multiple substrings at once, or maybe combine a text check with an attribute check. This is where the logical operators and become your best friends.
Using for Precision: You'll want to use when an element has to meet all of your conditions. It's perfect for zeroing in on a target. This selector finds an tag that not only has a URL containing /product/ but also has the visible text "View Details." Both must be true.
Using for Flexibility: On the other hand, is your go-to when an element could meet one of several conditions. This is super helpful for dealing with variations, like buttons in an A/B test. This will find a button whether its label is "Submit" or "Send."
Using XPath Axes with Contains
Now for the really powerful stuff: combining with XPath axes. Axes let you navigate the family tree of an element—finding its parent, child, or a sibling next to it.
XPath axes are like giving your selector a GPS. Instead of just looking for an element somewhere on the page, you're telling it how to find that element relative to another landmark you already know.
Let's say you need to scrape a product's price. The price element itself has no unique class or ID. Bummer. But you do know it always appears right after the product title, and the title always contains the word "Laptop."
This is a perfect job for the axis:
This beautiful little expression first finds the title containing "Laptop." From there, it navigates to its very next sibling that is a and has a class attribute containing 'price'. This kind of relational targeting is a cornerstone of scraping dynamic sites, a topic we cover in more detail in our guide to web scraping with Selenium and Python.
Bringing It All Together: Real-World Examples and Performance Tuning
Theory is one thing, but putting to work on a real project is where the rubber meets the road. Let's walk through how to apply these concepts in practice and, just as importantly, how to keep your scrapers from slowing to a crawl on complex websites.

Let’s say you’re scraping product details from an e-commerce site. The product titles are always in an tag, but the designers got a little creative with the classes. Some are , while others are just . The price is always sitting in a right after the title.
Here's how you'd handle that with a quick Python script using :
from lxml import html import requests
A simplified chunk of e-commerce HTML
html_content = """
tree = html.fromstring(html_content)
Use contains() to grab all product containers
product_cards = tree.xpath("//div[contains(@class, 'product-card')]")
for card in product_cards:
Search within each card for the details
name = card.xpath(".//h2[contains(@class, 'product-title')]/text()")[0] price = card.xpath(".//span[contains(@class, 'price-tag')]/text()")[0] print(f"Product: {name.strip()}, Price: {price.strip()}") This is a perfect example of how makes your scraper resilient to minor changes in attributes. You can see how this becomes essential when you learn how to scrape data from LinkedIn, where element classes are often long, complex, and dynamically generated.
The Hidden Cost of Bad XPath Performance
is a fantastic tool, but it has a dark side. Used carelessly, it can become a serious performance bottleneck. A poorly written XPath can force the browser's engine to scan every single element on the page, and that's a recipe for a very, very slow scraper.
The biggest offender is starting a query with . An expression like is a performance nightmare. You're literally telling the engine: "Stop what you're doing and search the text of every node in this entire document." On a modern, JavaScript-heavy page with thousands of elements, your scraper will grind to a halt.
Performance isn't just a "nice-to-have." A scraper that takes five seconds per page is fine for a one-off job of 100 pages. But when you need to scrape one million pages, that same scraper becomes completely unusable.
The solution is to give the XPath engine a better starting point. Always anchor your search to the most specific parent element you can find.
Simple Tips for Faster Queries
Boosting your XPath performance isn't about complex algorithms; it's about making a few smart choices that dramatically shrink the search area. Keep these in your back pocket for every scraper you build.
Kill the Global Search: Never start with if you can help it. Instead of , get specific: . See the difference? You’re telling the engine exactly where to look.
Anchor to an : The absolute fastest way to find an element is by its unique . If a stable parent container has an , use it as the starting point for your XPath. It’s a direct, high-speed path to your target.
Work from the Outside In: Think like the DOM. Find a larger, stable container element first, then search within its children for the data you need. It’s far more efficient than asking the browser to search the entire page from scratch for every single field.
Making these small adjustments will ensure your scrapers are not only accurate but also fast enough to handle large-scale jobs. For more high-level strategies, check out our guide to web scraping best practices for developers.
Common Questions About Using XPath Contains
Once you start using in your scraping projects, you’ll inevitably run into a few common hurdles. I see these questions pop up all the time, and thankfully, the solutions are usually straightforward once you understand the underlying mechanics.
Let's walk through the most frequent gotchas and how to handle them like a pro.
Text vs. Dot: What Is the Real Difference?
This is probably the most critical distinction to get right. It's the difference between a scraper that works reliably and one that breaks unexpectedly.
Think of it this way: is extremely literal. It only looks at text nodes that are direct children of the element you're targeting. If any part of the text is wrapped in another tag, like a or an , won't see it. It’s completely blind to nested content.
On the other hand, is what you'll want to use 99% of the time. The dot represents the current element and all its descendants. It essentially grabs the combined, rendered text you see on the screen, ignoring all the underlying HTML tags.
The dot is almost always your best bet for matching visible text. It makes your scraper far more resilient to minor HTML changes, like a developer deciding to bold a single word in a sentence.
How Do I Make a Search Case-Insensitive?
It’s a classic problem: you need to find "Product," but the site might use "product" or "PRODUCT." Unfortunately, XPath 1.0, which is what you'll find in most browsers and scraping libraries, doesn't have a simple function for this.
The standard, battle-tested solution is to use the function.
The idea is to force both the text you're searching and your target substring into the same case (usually lowercase) before making the comparison.
Here’s the pattern: . It looks a little clunky, but it's the most reliable way to ensure your scraper doesn’t fail just because of unpredictable capitalization.
Can Using Contains Slow Down My Scraper?
Oh, absolutely. If used carelessly, can bring your scraper to a crawl. The biggest performance trap is starting an XPath with . This tells the engine to scan every single element on the page, which can be brutally slow on modern, complex websites.
The fix is simple: always start your path as specifically as possible. Anchor your search to a known, stable container whenever you can.
Slow:
Fast:
This one change can make a massive difference, reducing the search area from thousands of elements to just a handful. For any serious scraping project, this kind of optimization is non-negotiable.
When Is It Better to Use Starts-With?
While is versatile, is more precise for certain situations. You should reach for when you know the beginning of an attribute or text is consistent, but the end is dynamic.
A perfect real-world example is an element with a dynamically generated ID, like , , and so on.
Using is much cleaner and safer than using . It eliminates the risk of accidentally matching another element that just happens to have "post-" somewhere else in its ID, giving you a far more robust selector.
Comments