Scrapy Vs. Beautiful Soup: Which is the best choice for your business? (2024)

When it comes to web scraping, Python offers a large variety of tools to choose from.Selenium,MechanicalSoup,Scrapy,Requests,Beautiful Soup, andlxmlare often used within this context. However, these tools are not created equal as each of them has its own set of use cases in which they shine. Some of them are even complementary, as this article will demonstrate.

In this article, you’ll take a closer look at Scrapy and Beautiful Soup, two popular choices for web scraping.

Beautiful Soup is a parsing library. It enables the navigation of documents usingXPathandCSS selectors. This facilitates the transformation of data from markup languages (such as HTML and XML) into structured data. In contrast, Scrapy is a complete web scraping framework that loads a document and (optionally) stores it.

Learn more about web scraping with Beautiful Soup.

In this comparison, you’ll consider the following aspects: crawling usability, scraping usability, speed, multi step execution, proxy rotation, andCAPTCHAsolving.

Scrapy vs. Beautiful Soup: Quick Comparison

If you’re in a hurry, here’s a quick comparison between Scrapy and Beautiful Soup for web scraping with Python.

Scrapy is a comprehensive web scraping framework, perfect for large-scale data extraction projects and offers built-in support for crawling, whereas Beautiful Soup is a parsing library best suited for smaller, more straightforward scraping tasks without the built-in crawling capabilities.

Scrapy excels in speed and efficiency for extensive scraping operations, and Beautiful Soup shines in simplicity and ease of use for quick tasks. Choose Scrapy for complex projects or Beautiful Soup for simple, direct parsing needs.

Scrapy

Scrapy is an all-in-one suite for crawling the web, downloading documents, processing them, and storing the resulting data in an accessible format. Installing Scrapy is easily done withpiporconda:

pip install scrapyconda install -c conda-forge scrapy

Web Crawling with Scrapy

Scrapy helps you crawl sets of pages and websites to gather URLs to scrape or to discover if a page contains the specific information you’re looking for. Scrapy works withspiders, which are Python classes in which one can define how to navigate a website, how deep it should go in the website structure, which data it should extract, and how it should be stored. To assemble a list of URLs, Scrapy can navigate HTML, XML, and CSV documents and even load sitemaps.

On top of that, Scrapy offers theScrapy shell, an interactive shell for testing and debugging XPath and CSS expressions on specific pages. Using the shell can save you time when it comes to crawling and scraping since it eliminates the need to restart the spider every time you make changes.

Web Scraping with Scrapy

When it comes to scraping, you usually need a lot of flexibility. Scrapy offers two ways for selecting items in a document: through XPath and CSS expressions. The former is mainly used for XML documents, while the latter is exclusively for HTML documents.

A unique Scrapy feature is the ability to define pipelines. When an item is scraped, it can be sent to a pipeline in which a sequence of actions is performed on it: cleaning, validation, hashing, deduplication, and enrichment.

Speed

Another important aspect of scraping web documents is the time that it takes. Assessing the speed of Scrapy is not easy since it has a lot of overhead that needs to be processed. For this reason, the overhead is only loaded once, while the crawling and extracting happen ten times.

In the following example, the h2 of a simple (ienondynamic) web page is extracted. All code runs in aJupyter Notebook.

First, load the required Scrapy libraries:

import scrapyfrom scrapy.crawler import CrawlerProcess

Second, establish theMySpiderclass that describes the scraping job:

class MySpider(scrapy.Spider): name = "myspider" start_urls = [ 'https://edition.cnn.com' # Or repeat this 10 times to calculate marginal time ] def parse(self, response): yield {'output': response.css('h2.container_lead-package__title_url-text::text').extract()}process = CrawlerProcess( settings={ "FEEDS": { "scrapy_output.json": {"format": "json", "overwrite": True} } })process.crawl(MySpider)

Third, run the script and time it:

%%timeit -n 1 -r 1process.start()

The sequence of crawling, scraping, and storing a single web document took approximately 400 milliseconds. However, repeating the same process ten times took 1,200 milliseconds. This implies that a single sequence takes around 80 milliseconds, which is impressive. Given the overhead, Scrapy should be your first choice for intensive jobs.

Multistep Scraping with Scrapy

Many websites, if not the most popular websites, like X/Twitter, Substack, and LinkedIn, are dynamic. This means that large swaths of information are hidden behind login screens, search queries, pop-ups, scrolls, or mouse overs. Consequently, having your spider simply visit a page is often not enough to extract data from it.

Scrapy offers various approaches for handling these jobs as a stand-alone tool. One could produce the necessary HTTP requests or execute the relevant JavaScript snippets. However, using a headless browser offers the most flexibility. For example, there arePlaywrightand Selenium integrations for Scrapy that can be used for interfacing with dynamic elements.

Proxy Rotation and CAPTCHA Prevention with Scrapy

The arrival oflarge language modelshas motivated many companies to fine-tune models, but this requires specific (often scraped) data. Additionally, many organizations don’t want bots straining their website’s servers and have no commercial interest in sharing their data. This is why many websites are not only set up as dynamic but also introduce antiscraping technologies, such as automatic IP blocking and CAPTCHA.

To prevent getting locked out, Scrapy doesn’t offer out-of-the-box tools for rotating proxies (and IP addresses). However, Scrapy can be extended through theMiddleware framework, a set of hooks to modify Scrapy’s request and response process. To rotate proxies, one can attach a Python module, such asscrapy-rotating-proxies, that is specifically made for doing so. Through the same mechanism, one can attachthe DeCAPTCHA module.

Beautiful Soup

Unlike Scrapy, Beautiful Soup does not offer a full-suite solution for extracting and processing data from web documents; it only offers the scraping part. You just need to feed it a downloaded document, and Beautiful Soup can turn it into structured data through CSS and XPath selectors.

Installing Beautiful Soup can be done via pip and conda:

pip install BeautifulSoup4conda install -c anaconda beautifulsoup4

Web Crawling with Beautiful Soup

While Scrapy deploys spiders to navigate a website, Beautiful Soup does not offer such capabilities. However, with some Python creativity, using both Beautiful Soup and the Requests library, one can write a script to navigate a website to a certain depth. Nevertheless, it’s certainly not as easy as with Scrapy.

Web Scraping with Beautiful Soup

Web scraping is what makes Beautiful Soup 4 tick. Not only does it offer CSS and XPath selectors, but it also comes with a multitude of methods to traverse documents. When documents have a complex structure, methods like.parentand.next_siblingcan extract elements that are otherwise hard to reach. Additionally, throughfind_all()and similar methods, you can specify text filters, regular expressions, and even custom functions to find the required elements.

Finally, Beautiful Soup has various output formatters to pretty-print output, encode it, remove Microsoft’s smart quotes, and even parse and validate HTML.

Speed

Unlike Scrapy, Requests and Beautiful Soup have no overhead and can simply run ten times to assess their speed.

First, load the required libraries:

import requests, jsonfrom bs4 import BeautifulSoup

Second, time the code by wrapping it in atimeitmagic command:

%%timeit -n 10 -r 1page = requests.get('https://edition.cnn.com')page_html = BeautifulSoup(page.text, 'html.parser')page_html = page_html.select_one('h2.container_lead-package__title_url-text').textjson_object = json.dumps({'output': page_html})with open("bs4_output.json", "w") as output_file: output_file.write(json_object)

Running it once takes approximately 300 milliseconds. Running it ten times takes 3,000 milliseconds, which is considerably slower than Scrapy. However, it requires a lot less configuration and relatively little knowledge of a particular framework.

Multistep Scraping with Beautiful Soup

Since Beautiful Soup has no crawling capabilities, it certainly cannot handle dynamic web pages. However, like Scrapy, it works perfectly well together with automation tools, such as Playwright,Puppeteer, and Selenium. Pairing automation tools with Beautiful Soup always works in the same way: the headless browsers handle the dynamic elements, while Beautiful Soup extracts the rendered data in those browsers.

Proxy Rotation and CAPTCHA Prevention with Beautiful Soup

Since Beautiful Soup is a scraping tool and not a crawling tool, it doesn’t offer any tools to prevent getting blocked by a website’s servers. If you need this, these features should be part of the crawling tool you choose.

Conclusion

This article outlined how Beautiful Soup and Scrapy differ in usability for web crawling and web scraping in terms of speed, handling of dynamic web documents, and circumvention of anti-scraping measures.

As an end-to-end tool, Scrapy is a clear favorite for day-to-day scraping jobs. However, it does require some middleware to scrape dynamic websites and to ensure one does not get blocked.

Although Beautiful Soup (together with the request package) is quite slow, it offers a very familiar and simple way for ad hoc scraping jobs. Like Scrapy, it requires extra tools for scraping dynamic websites and blocking prevention.

If you’re looking for a one-stop shop for scraping websites, considerBright Data. Bright Data offers numerous products, such asproxy servicesandWeb Unlocker, to assist with all your web scraping needs, no matter which option you decide to use.

Start free trial

No credit card required

Interesetd in learning how to integrate Bright Data proxies? Read our Scrapy proxies integration and BeautifulSoup proxies guide.

Scrapy Vs. Beautiful Soup: Which is the best choice for your business? (2024)
Top Articles
Paid online surveys: Are they worth it?
Surveys That Pay Cash Instantly 2024: List of 15 Best Survey Sites
Katie Pavlich Bikini Photos
Gamevault Agent
Hocus Pocus Showtimes Near Harkins Theatres Yuma Palms 14
Free Atm For Emerald Card Near Me
Craigslist Mexico Cancun
Hendersonville (Tennessee) – Travel guide at Wikivoyage
Doby's Funeral Home Obituaries
Vardis Olive Garden (Georgioupolis, Kreta) ✈️ inkl. Flug buchen
Select Truck Greensboro
How To Cut Eelgrass Grounded
Pac Man Deviantart
Alexander Funeral Home Gallatin Obituaries
Craigslist In Flagstaff
Shasta County Most Wanted 2022
Energy Healing Conference Utah
Testberichte zu E-Bikes & Fahrrädern von PROPHETE.
Aaa Saugus Ma Appointment
Geometry Review Quiz 5 Answer Key
Walgreens Alma School And Dynamite
Bible Gateway passage: Revelation 3 - New Living Translation
Yisd Home Access Center
Home
Shadbase Get Out Of Jail
Gina Wilson Angle Addition Postulate
Celina Powell Lil Meech Video: A Controversial Encounter Shakes Social Media - Video Reddit Trend
Walmart Pharmacy Near Me Open
Dmv In Anoka
A Christmas Horse - Alison Senxation
Ou Football Brainiacs
Access a Shared Resource | Computing for Arts + Sciences
Pixel Combat Unblocked
Umn Biology
Cvs Sport Physicals
Mercedes W204 Belt Diagram
Rogold Extension
'Conan Exiles' 3.0 Guide: How To Unlock Spells And Sorcery
Teenbeautyfitness
Weekly Math Review Q4 3
Facebook Marketplace Marrero La
Nobodyhome.tv Reddit
Topos De Bolos Engraçados
Gregory (Five Nights at Freddy's)
Grand Valley State University Library Hours
Holzer Athena Portal
Hampton In And Suites Near Me
Stoughton Commuter Rail Schedule
Bedbathandbeyond Flemington Nj
Free Carnival-themed Google Slides & PowerPoint templates
Otter Bustr
Selly Medaline
Latest Posts
Article information

Author: Lakeisha Bayer VM

Last Updated:

Views: 5965

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Lakeisha Bayer VM

Birthday: 1997-10-17

Address: Suite 835 34136 Adrian Mountains, Floydton, UT 81036

Phone: +3571527672278

Job: Manufacturing Agent

Hobby: Skimboarding, Photography, Roller skating, Knife making, Paintball, Embroidery, Gunsmithing

Introduction: My name is Lakeisha Bayer VM, I am a brainy, kind, enchanting, healthy, lovely, clean, witty person who loves writing and wants to share my knowledge and understanding with you.