How to Scrape Hidden Web Data (2024)

How to Scrape Hidden Web Data (1)

Modern websites store data not only in the visible HTML page but in the embedded javascript code as well. This is especially common in dynamic website elements that are rendered by javascript on page load or triggered by user interactions.

The most common way to scraping dynamic data is to use a headless browser to force hidden data rendering in the HTML. In this article, however, we'll be taking a look at how can we extract this data directly without the use of web browsers which can be a thousand times faster and more efficient approach.

We'll take a look at what is hidden data, some common examples and how can we scrape it using regular expressions and other clever parsing algorithms.

How to Scrape Dynamic Websites Using Headless Web BrowsersIf you'd like to learn about the alternative approach of using headless browsers for this challenge see our complete introduction article

Dynamic web front-ends often store data in javascript variables and then render it as HTML on demand (like page load or user action). This means the data is not visible on the page directly though it's still there!

For example, a website could do this:

<html> <head> </head> <body> <div id="product"> <!-- There's no product data in the html --> </div> <script> // but we can see data here var data = {"product": {"name": "some product", "price": 44.33}}; // and it's being put into the HTML on page load: productName = document.createElement("div"); productName.setAttribute("id", "product-name"); productName.innerText = data['product']['name']; product = document.getElementById("product"); product.appendChild(productName); </script> </body></html>

We see that the initial HTML just has an empty product <div> node and the data itself resides in a javascript variable data. Then, on page load, javascript is used to turn that data into visible HTML nodes. If we look at the page source in our javascript-enabled browser we would see:

<div id="product"> <div id="product-name">some product</div></div>

Modern web developers love this technique as they can just hide all of the data in the page and update the front-end to represent data any way they like.
Unfortunately, web scrapers, which do not execute javascript (anything that doesn't run a browser) don't see this data rendered to HTML - meaning, they have to find ways to find and parse those Javascript variables.

How to Find Hidden Web Data

We can approach hidden web data in two ways:

Tools like Playwright, Puppeteer and Selenium can be used to control a real, headless web browser to render the pages and return final rendered HTML. Though this is expensive and slow - we need to run a whole web browser and wait for everything to load!

Alternatively, we can parse the HTML for these hidden state/cache variables using HTML parsing tools, regular expressions and common parsing algorithms. We have to get our hands dirty but our process will be significantly faster and we'll have access to the whole dataset which might contain more details than we can see in the visible HTML.

Hidden web data also often contains various tokens used by website's hidden APIs or details used to obfuscate data or for web scraper blocking.

Let's take a look at some common ways hidden data is stored and how we can find it.

To confirm whether the website contains hidden web data we can employ a simple test:

  1. Load the page in our web browser and find a unique data identifier (such as product name, id or part of the description).
  2. Disable javascript in our browser and reload the page.
  3. Check page source (right click on the page) and look for our unique identifier (e.g. ctrl+f)

Almost all forms of hidden data are stored in HTML nodes such as <script>. Which could be a JSON object or a variable. So, the first thing we can do is capture the script text containing this data.

We can do this using common HTML parsing packages like parsel or beautifulsoup:

import jsonhtml = """<html> <head> </head> <body> <script id="__NEXT_DATA__" type="application/json"> {"product": {"id": 1, "name": "first product"}} </script> </body></html"""# using parselfrom parsel import Selectorselector = Selector(html)data = selector.css("#__NEXT_DATA__::text").get()data = json.loads(data)print(data['product'])# {"id": 1, "name": "first product"}# using beautifulsoupfrom bs4 import BeautifulSoupsoup = BeautifulSoup(html)data = soup.select_one("#__NEXT_DATA__").textdata = json.loads(data)print(data['product'])# {"id": 1, "name": "first product"}

In both cases above we load HTML and find text in the <script> node with the specific id attribute. Then load the found JSON data as Python dictionary and we can parse it as we wish!

This often can be enough to retrieve hidden data if it's stored as type=application-json as it is in our example. However, that's not always the case and the data in the script can be under a javascript variable.

Using Regex

Regular Expressions are perfect for finding structured text data such as JSON. For example, if our hidden data appears like this in the source code:

<script id="__NEXT_DATA__"> // javascript data: var product = {"product": {"id": "1", "name": "first product"}}; var _meta = ...</script>

Python's JSON module is not smart enough to extract this. Instead, we can assist it with regular expressions:

html = """<html> <head> </head> <body> <script id="__NEXT_DATA__"> // javascript data: var product = {"product": {"id": "1", "name": "first product"}}; var _meta = ... </script> </body></html>"""# find script text using parsel:from parsel import Selectorselector = Selector(html)script_text = selector.css("#__NEXT_DATA__::text").get()# find json using regular expressions:import reimport jsondata = re.findall(r"product = ({.*?});", script_text)data = json.loads(data[0])print(data["product"])

In the example above we used a regular expression pattern to select the text between product = and }; tokens which is the hidden JSON web data.

Regular expressions work great but can get quite complicated and break easily. Another approach to extract this data is to use common data parsing algorithms - let's take a look at that next.

Using JSON Finding Algorithms

Python comes with a great JSON data decoder that can be used to find JSON documents in any text!

For example, here's a popular function that can find all valid JSON objects in a text string:

import jsondef find_json_objects(text: str, decoder=json.JSONDecoder()): """Find JSON objects in text, and generate decoded JSON data""" pos = 0 while True: match = text.find("{", pos) if match == -1: break try: result, index = decoder.raw_decode(text[match:]) yield result pos = match + index except ValueError: pos = match + 1text = """This text contains some {"json": "objects"} and some json products likeproduct = {"product": {"id": 1, "name": "first product"}};console.log("more javascript");"""found = list(find_json_objects(text))print(found)# [{'json': 'objects'}, {'product': {'id': 1, 'name': 'first product'}}]

The function finds all JSON objects in any text string which is much more convenient than our regex example. Also, since we know how our product data object looks (e.g. it contains a product key) we can select it exclusively without much extra effort:

product = next(data for data in found if data.get('product'))print(product)# {'product': {'id': 1, 'name': 'first product'}}

Finding Javascript Data

JSON objects in javascript are native meaning they can contain javascript code itself and that's where things get complicated. A valid javascript code object is not a valid JSON data object. Let's take a look at this example:

text = """var product = { // some comment: "element": document.createElement("div"), "url": "http://foo.com", // some trailing comment "price": 44.23,"discount": 22.11, "features": ["warm", "cold"], "product": {"id": 1, "name": "first product"}}"""print(list(find_json_objects(text)))

Both of our regex and JSON finder based solutions would fail to parse the whole object successfully. That's because this is a valid javascript object and not a valid JSON data object. It contains comments and code blocks that our scraper cannot understand without a web browser.

There are a few ways we could approach this:

  • Remove comments and anything that is not a base data type (string, number, boolean etc.) and then use our JSON finder.
  • Parse javascript code using javascript language parsers and then extract that data.

Depending on your project size and complexity either one of these approaches could be more fitting. For example, for something small projects we can hack our JSON finder to remove the garbage data though for bigger projects we'd probably need to invest more time into a more resilient language-parsing-based approach.
Let's take a look at both!

Removing Javascript from JSON

To convert javascript objects to JSON objects all we have to do is remove any values that are not primitive values like strings, booleans or numbers and remove comments.

To clear the objects we can use regular expressions and for comments, we can take advantage of existing packages like pyparsing:

import reimport pyparsingimport jsoncomment_remover = pyparsing.cpp_style_comment.suppress()comment_remover.ignore(pyparsing.QuotedString('"') | pyparsing.QuotedString("'"))def remove_objects(text): """ replaces all `"key": object` ocurrances in text with `"key": {}` """ text = comment_remover.transform_string(text) def _rm(match: re.Match): key, value, trail = match.groups() return key + "{}" + trail return re.sub(r'("[^"]+?"\s*:\s*)([^"\s[{\d(?:true|false)].+?)(,|$|})', _rm, text)cleaned = remove_objects(text)# let's try it with our text:text = """var product = { // some comment: "element": document.createElement("div"), "url": "http://foo.com", // some trailing comment "price": 44.23,"discount": 22, "features": ["warm", "cold"], "product": {"id": 1, "name": "first product"}}"""clean_text = remove_objects(comment_remover.transform_string(text))print(list(find_json_objects(clean_text)))# will print:[ { "element": {}, "url": "http://foo.com", "price": 44.23, "discount": 22.11, "features": ["warm", "cold"], "product": {"id": 1, "name": "first product"}, }]

With this quick hack, we can easily scrape more complex embedded JSON structures. Though, we are losing all of that javascript data - what if there's something valuable there? Additionally, regular expression patterns although fast, are complicated and can break easily upon website changes.
Let's take a look at another approach - parsing javascript code itself.

Parsing Javascript with js2xml

Just like javascript interpreters need to parse the code to understand it we can also parse it for variable data.

Using js2xml we can convert javascript code (including JSON) to XML document which we can parse using CSS or XPath selectors. Let's take a look at our example again:

import js2xmlfrom js2xml.utils.vars import get_vars, make_objtext = """var product = { // some comment: "element": document.createElement("div"), "url": "http://fo,o.com", // some trailing comment "price": 44.23,"discount": document.deleteElement(foo), "features": ["warm", "cold"], "product": {"id": 1, "name": "first, product"}}"""# first convert javascript code to XML tree (return lxml.Element)parsed_tree = js2xml.parse(text)# we can see generated XML tree:print(js2xml.pretty_print(parsed_tree))"""<program> <var name="product"> <object> <property name="element"> <functioncall> <function> <dotaccessor> <object> <identifier name="document"/> </object> <property> <identifier name="createElement"/> </property> </dotaccessor> </function> <arguments> <string>div</string> </arguments> </functioncall> </property> <property name="url"> <string>http://fo,o.com</string> </property> <property name="price"> <number value="44.23"/> </property> <property name="discount"> <functioncall> <function> <dotaccessor> <object> <identifier name="document"/> </object> <property> <identifier name="deleteElement"/> </property> </dotaccessor> </function> <arguments> <identifier name="foo"/> </arguments> </functioncall> </property> <property name="features"> <array> <string>warm</string> <string>cold</string> </array> </property> <property name="product"> <object> <property name="id"> <number value="1"/> </property> <property name="name"> <string>first, product</string> </property> </object> </property> </object> </var></program>"""# we can also extract this tree as jsonprint(get_vars(parsed_tree)){ "product": { "element": None, "url": "http://fo,o.com", "price": 44.23, "discount": None, "features": ["warm", "cold"], "product": {"id": 1, "name": "first, product"}, }}# or if the json is deep in the code we can find it with xpath and then convert itprint(make_obj(parsed_tree.xpath('//property[@name="product"]/object')[0])){"id": 1, "name": "first, product"}

In the example above we used js2xml to convert javascript code to XML and then we can either parse it with css/xpath selectors or convert data to python dictionaries.

Some Real Examples

We encounter hidden web data often in our scrapeguide blog series which cover tutorials on how to scrape popular web scraping targets.

For example, in we use simple regex patterns when scraping https://www.glassdoor.com/index.htm in our How to Scrape Glassdoor (2024 update) article:

import reimport httpximport jsondef extract_apollo_state(html): """Extract apollo graphql state data from HTML source""" # here we use regex pattern to find first json object after apolloState keyword: data = re.findall('apolloState":\s*({.+})};', html)[0] return json.loads(data)def scrape_overview(company_id: int): short_url = f"https://www.glassdoor.com/Overview/-IE_EI{company_id}.htm" response = httpx.get(short_url) apollo_state = extract_apollo_state(response.text) return next(v for k, v in state.items() if k.startswith("Employer:"))# Ebay's glassdoor profile page:print(json.dumps(scrape_overview("7671"), indent=2))

Some other hidden web data examples we've covered on this blog:

  • How to Scrape Indeed.com (2024 Update)
  • How to Scrape Zoominfo Company Data (2024 Update)
  • How to Scrape Wellfound Company Data and Job Listings

Using Headless Browsers

We covered how scraping hidden web data can be an alternative to using headless browsers to fully render dynamic data. In the same way, we can use headless browsers to retrieve javascript variables present in the page which returns fully rendered hidden web datasets.

For example, let's say we have this hidden web data piece:

html = """<html> <head> </head> <body> <script id="__NEXT_DATA__"> var product = { "product": { "id": "1", "name": "first product", "secret": create_secret() }}; var _meta = ... </script> </body></html

Here, we can see that the secret field is dynamically generated by a javascript function. If we scrape this as is we'd just get the function name in our data.

Instead, we can fire up a real, headless web browser through Playwright, Puppeteer or Selenium and evaluate custom javascript to capture this data.

As a real-life example, let's go back to Glassdoor and see how we could do this in Playwright and Python:

from playwright.sync_api import sync_playwrightwith sync_playwright() as pw: browser = pw.chromium.launch(headless=True) context = browser.new_context(viewport={"width": 1920, "height": 1080}) page = context.new_page() # got to glassdoor url page.goto("https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm") # extract apolloState data, Eployer:7853 contains company overview data # of ebay which is ID 7853: data = page.evaluate("window.appCache.apolloState['Employer:7853']") print(data)# will print{ '__typename': 'Employer', 'id': 7853, 'shortName': 'eBay', 'website': 'www.ebayinc.com', 'type': 'Company - Public', 'revenue': '$10+ billion (USD)', 'headquarters': 'San Jose, CA', 'size': '10000+ Employees', 'stock': 'EBAY', ...}

In the example above, we fire up a headless instance of a Chrome browser, tell it to go to Ebay's profile page on glassdoor.com and extract hidden web data through javascript evaluation function.

Hidden data is not overly complex when it comes to scraping but it can quickly become a tough issue when starting to scale scrapers. For this, we made Scrapfly!

How to Scrape Hidden Web Data (3)

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

  • Anti-bot protection bypass - scrape web pages without blocking!
  • Rotating residential proxies - prevent IP address and geographic blocks.
  • JavaScript rendering - scrape dynamic web pages through cloud browsers.
  • Full browser automation - control browsers to scroll, input and click on objects.
  • Format conversion - scrape as HTML, JSON, Text, or Markdown.
  • Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

For example, we can replicate our Glassdoor example using ScrapFly SDK:

from scrapfly import ScrapeConfig, ScrapflyClientclient = ScrapflyClient(key="YOUR SCRAPFLY KEY")result = client.scrape(ScrapeConfig( url="https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm", # enable headless browser use and evaluate javascript script render_js=True, js="return window.appCache.apolloState['Employer:7853']", # we can tell the headless browser to wait 2 seconds for the content to load: rendering_wait=2_000, # we can set specific proxy country: country="CA", # we can also take screenshots to see what our browser is doing: screenshots={"fullpage": "fullpage"}))# get the javascript result:print(result.scrape_result['browser_data']['javascript_evaluation_result'])

in ScrapFly player

Or try it through the interactive web player directly.

FAQ

To wrap this article up let's take a look at some frequently asked questions regarding the scraping of hidden web data:

Is it legal to scrape hidden web data?

Yes, hidden web data is the same public data as the visible HTML. Note that due to GDRP in the European Union region hidden web data should be cleared of user-identifying information.

Hidden web data is becoming increasingly popular as websites rely more and more on javascript to generate web content dynamically. So, in this extensive tutorial, we've taken a look at how to scrape this data, how to parse it and what are common challenges in these areas.

We explored common regular expression patterns, JSON parsing algorithms and tools like js2xml and pyparsing for lexical data parsing - all of which are great tools to find public hidden datasets on the web.

How to Scrape Hidden Web Data (2024)
Top Articles
What Are The Advantages Of Using Zeroes Data Erasure Algorithm?
Frequency Polygon: Definition, Steps to Draw, Videos, Solved Examples
Katie Pavlich Bikini Photos
Gamevault Agent
Hocus Pocus Showtimes Near Harkins Theatres Yuma Palms 14
Free Atm For Emerald Card Near Me
Craigslist Mexico Cancun
Hendersonville (Tennessee) – Travel guide at Wikivoyage
Doby's Funeral Home Obituaries
Vardis Olive Garden (Georgioupolis, Kreta) ✈️ inkl. Flug buchen
Select Truck Greensboro
Things To Do In Atlanta Tomorrow Night
Non Sequitur
How To Cut Eelgrass Grounded
Pac Man Deviantart
Alexander Funeral Home Gallatin Obituaries
Craigslist In Flagstaff
Shasta County Most Wanted 2022
Energy Healing Conference Utah
Testberichte zu E-Bikes & Fahrrädern von PROPHETE.
Aaa Saugus Ma Appointment
Geometry Review Quiz 5 Answer Key
Walgreens Alma School And Dynamite
Bible Gateway passage: Revelation 3 - New Living Translation
Yisd Home Access Center
Home
Shadbase Get Out Of Jail
Gina Wilson Angle Addition Postulate
Celina Powell Lil Meech Video: A Controversial Encounter Shakes Social Media - Video Reddit Trend
Walmart Pharmacy Near Me Open
A Christmas Horse - Alison Senxation
Ou Football Brainiacs
Access a Shared Resource | Computing for Arts + Sciences
Pixel Combat Unblocked
Cvs Sport Physicals
Mercedes W204 Belt Diagram
Rogold Extension
'Conan Exiles' 3.0 Guide: How To Unlock Spells And Sorcery
Teenbeautyfitness
Where Can I Cash A Huntington National Bank Check
Facebook Marketplace Marrero La
Nobodyhome.tv Reddit
Topos De Bolos Engraçados
Gregory (Five Nights at Freddy's)
Grand Valley State University Library Hours
Holzer Athena Portal
Hampton In And Suites Near Me
Stoughton Commuter Rail Schedule
Bedbathandbeyond Flemington Nj
Free Carnival-themed Google Slides & PowerPoint templates
Otter Bustr
Selly Medaline
Latest Posts
Article information

Author: Geoffrey Lueilwitz

Last Updated:

Views: 6105

Rating: 5 / 5 (60 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.