How to Scrape Crunchbase in 2024 (2024)

In this tutorial, we'll explain how to scrape Crunchbase - the most extensive public resource for financial information of various public and private companies and investments.

Crunchbase contains thousands of company profiles, which include investment data, funding information, leadership positions, mergers, news and industry trends.

To scrape Crunchbase, we'll be using the hidden web data web scraping approach using Python with an HTTP client library.

Mostly, we'll focus on capturing company data through generic scraping techinques we'll learn, which can be easily applied to other Crunchbase areas such as people or acquisition data with a minimum effort. Let's dive in!

Latest Crunchbase.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping
and for more you should consult a lawyer.

Why Scrape Crunchbase.com?

Crunchbase has an enormous business dataset that can be used in various forms of market analytics and business intelligence research. For example, the company dataset contains the company's summary details (like description, website and address), public financial information (like acquisitions and investments) and used technology data.

Additionally, Crunchbase data contains many data points used in lead generation, like the company's contact details, leadership's social profiles, and event aggregation.

For more on scraping use cases see our extensive web scraping use case article

Project Setup

To scrape Crunchbase, we'll be using Python and two major community packages:

Available Crunchbase Targets

Crunchbase contains several data types: acquisitions, people, events, hubs and funding rounds. Our Crunchbase scraper will focus on the company and people data. That being said, the same techincal concepts can be applied to other pages on Crunchbase.

You can explore available data types by taking a look at the Crunchbase discovery pages.

Finding Crunchbase Companies and People

To start scraping Crunchbase.com content, we need to find a way to find all of the company or people URLs. Altough Crunchbase offer a search system, it's only available for premium users. So, how do we find these targets?

Crunchbase offers a sitemap directory that contains all of its target URLs to be crawled and indexed by search engines. Let's start by taking a look at the crunchbase.com/robots.txt endpoint:

User-agent: *Allow: /v4/md/applications/crunchbaseDisallow: /login<...>Sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-index.xml

The /robots.txt page indicates crawling suggestions for various web crawlers, such as Google. We can see that there's a sitemap index that contains indexes for various target pages in an XML format:

<?xml version='1.0' encoding='UTF-8'?><sitemapindex xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <sitemap> <loc>https://www.crunchbase.com/www-sitemaps/sitemap-acquisitions-2.xml.gz</loc> <lastmod>2022-07-06T06:05:33.000Z</lastmod></sitemap><...> <sitemap> <loc>https://www.crunchbase.com/www-sitemaps/sitemap-events-0.xml.gz</loc> <lastmod>2022-07-06T06:09:30.000Z</lastmod></sitemap><...> <sitemap> <loc>https://www.crunchbase.com/www-sitemaps/sitemap-funding_rounds-9.xml.gz</loc> <lastmod>2022-07-06T06:10:49.000Z</lastmod> </sitemap><...><sitemap> <loc>https://www.crunchbase.com/www-sitemaps/sitemap-hubs-1.xml.gz</loc> <lastmod>2022-07-06T06:05:10.000Z</lastmod> </sitemap><...><sitemap> <loc>https://www.crunchbase.com/www-sitemaps/sitemap-organizations-42.xml.gz</loc> <lastmod>2022-07-06T06:10:35.000Z</lastmod></sitemap><...><sitemap> <loc>https://www.crunchbase.com/www-sitemaps/sitemap-people-29.xml.gz</loc> <lastmod>2022-07-06T06:09:25.000Z</lastmod> </sitemap></sitemapindex>

We can see that this page contains sitemap index pages for acquisitions, events, funding rounds, hubs as well as companies and people.

Each sitemap index can contain a maximum of 50 000 urls, so currently using this index we can find over 2 million companies and almost 1.5 million people!

Further, we can also access that last update date indicated by the <lastmod> node. So, we also have the information for when this index was updated the last time. Next, let's explore how to scrape this XML data.

Scraping Sitemaps

To scrape sitemaps, we'll download the sitemap indexes using our httpx client and parse the URLs using parsel:

import gzipfrom datetime import datetimefrom typing import Iterator, List, Literal, Tupleimport httpxfrom loguru import logger as logfrom parsel import Selectorasync def _scrape_sitemap_index(session: httpx.AsyncClient) -> List[str]: """scrape Crunchbase Sitemap index for all sitemap urls""" log.info("scraping sitemap index for sitemap urls") response = await session.get("https://www.crunchbase.com/www-sitemaps/sitemap-index.xml") sel = Selector(text=response.text) urls = sel.xpath("//sitemap/loc/text()").getall() log.info(f"found {len(urls)} sitemaps") return urlsdef parse_sitemap(response) -> Iterator[Tuple[str, datetime]]: """parse sitemap for location urls and their last modification times""" sel = Selector(text=gzip.decompress(response.content).decode()) urls = sel.xpath("//url") log.info(f"found {len(urls)} in sitemap {response.url}") for url_node in urls: url = url_node.xpath("loc/text()").get() last_modified = datetime.fromisoformat(url_node.xpath("lastmod/text()").get().strip("Z")) yield url, last_modifiedasync def discover_target( target: Literal["organizations", "people"], session: httpx.AsyncClient, min_last_modified=None): """discover url from a specific sitemap type""" sitemap_urls = await _scrape_sitemap_index(session) urls = [url for url in sitemap_urls if target in url] log.info(f"found {len(urls)} matching sitemap urls (from total of {len(sitemap_urls)})") for url in urls: log.info(f"scraping sitemap: {url}") response = await session.get(url) for url, mod_time in parse_sitemap(response): if min_last_modified and mod_time < min_last_modified: continue # skip yield url

Above, our code retrieves the central sitemap index and collects all sitemap URLs. Then, we scrape each sitemap URL matching either people or organization patterns. Let's run this Crunchbase scraper code and see the data returned:

Run code and example output

# append this to the previous code snippet to run it:import asyncioBASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br",}async def run(): async with httpx.AsyncClient( limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True ) as session: print('discovering companies:') async for url in discover_target("organization", session): print(url) print('discovering people:') async for url in discover_target("people", session): print(url)if __name__ == "__main__": asyncio.run(run())

discovering companies:INFO | _scrape_sitemap_index - scraping sitemap index for sitemap urlsINFO | _scrape_sitemap_index - found 89 sitemapsINFO | discover_target - found 43 matching sitemap urls (from total of 89)INFO | discover_target - scraping sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-organizations-0.xml.gzINFO | parse_sitemap - found 50000 in sitemap https://www.crunchbase.com/www-sitemaps/sitemap-organizations-0.xml.gzhttps://www.crunchbase.com/organization/tesla<...>discovering people:INFO | _scrape_sitemap_index - scraping sitemap index for sitemap urlsINFO | _scrape_sitemap_index - found 89 sitemapsINFO | discover_target - found 30 matching sitemap urls (from total of 89)INFO | discover_target - scraping sitemap: https://www.crunchbase.com/www-sitemaps/sitemap-people-0.xml.gzINFO | parse_sitemap - found 50000 in sitemap https://www.crunchbase.com/www-sitemaps/sitemap-people-0.xml.gzhttps://www.crunchbase.com/person/john-doe<...>

Cool! By exploring Crunchbase sitemap, we can successfully discover pages on the website. Next, let's explore how to scrape Crunchbase for this data using the URLs we got.

Scraping Crunchbase Companies

The Crunchbase company page contains various data spread across multiple pages:

However, instead of parsing the HTML we can dig into the page source and we can see that the same data is also available in the page's app state variable:

We can see that a <script id="client-app-state"> node contains a large JSON file with a lot of the same details we see on the page. Since Crunchbase is using the Angular JavaScript front-end framework, it stores the HTML data in page state cache , which we can extract directly instead of parsing the HTML page. Let's take a look at how we can apply that:

import jsonimport httpximport asynciofrom typing import Dict, List, TypedDictfrom parsel import Selectorclass CompanyData(TypedDict): """Type hint for data returned by Crunchbase company page parser""" organization: Dict employees: List[Dict]def _parse_organization_data(data: Dict) -> Dict: """example that parses main company details from the whole company dataset""" properties = data['properties'] cards = data['cards'] parsed = { # theres meta data in the properties field: "name": properties['title'], "id": properties['identifier']['permalink'], "logo": "https://res.cloudinary.com/crunchbase-production/image/upload/" + properties['identifier']['image_id'], "description": properties['short_description'], # but most of the data is in the cards field: "semrush_global_rank": cards['semrush_summary']['semrush_global_rank'], "semrush_visits_latest_month": cards['semrush_summary']['semrush_visits_latest_month'], # etc... There's much more data! } return parseddef _parse_employee_data(data: Dict) -> List[Dict]: """example that parses employee details from the whole employee dataset""" parsed = [] for person in data['entities']: parsed.append({ "name": person['properties']['name'], "linkedin": person['properties'].get('linkedin'), "job_levels": person['properties'].get('job_levels'), "job_departments": person['properties'].get('job_departments'), # etc... }) return parseddef _unescape_angular(text): """Helper function to unescape Angular quoted text""" ANGULAR_ESCAPE = { "&a;": "&", "&q;": '"', "&s;": "'", "&l;": "<", "&g;": ">", } for from_, to in ANGULAR_ESCAPE.items(): text = text.replace(from_, to) return textdef parse_company(response) -> CompanyData: """parse company page for company and employee data""" sel = Selector(text=response.text) app_state_data = sel.css("script#ng-state::text").get() if not app_state_data: app_state_data = _unescape_angular(sel.css("script#client-app-state::text").get() or "") app_state_data = json.loads(app_state_data) # there are multiple caches: cache_keys = list(app_state_data["HttpState"]) # Organization data can be found in this cache: data_cache_key = next(key for key in cache_keys if "entities/organizations/" in key) # Some employee/contact data can be found in this key: people_cache_key = next(key for key in cache_keys if "/data/searches/contacts" in key) organization = app_state_data["HttpState"][data_cache_key]["data"] employees = app_state_data["HttpState"][people_cache_key]["data"] return { "organization": _parse_organization_data(organization), "employees": _parse_employee_data(employees), }async def scrape_company(company_id: str, session: httpx.AsyncClient) -> CompanyData: """scrape crunchbase company page for organization and employee data""" # note: we use /people tab because it contains the most data: url = f"https://www.crunchbase.com/organization/{company_id}/people" response = await session.get(url) return parse_company(response)

Run code and example output

# append this to the previous code snippet to run it:BASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br",}async def run(): async with httpx.AsyncClient( limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True ) as session: data = await scrape_company("tesla-motors", session=session) print(json.dumps(data, indent=2, ensure_ascii=False))if __name__ == "__main__": asyncio.run(run())

{ "organization": { "name": "Tesla", "id": "tesla-motors", "logo": "https://res.cloudinary.com/crunchbase-production/image/upload/v1459804290/mkxozts4fsvkj73azuls.png", "description": "Tesla Motors specializes in developing a full range of electric vehicles.", "semrush_global_rank": 3462, "semrush_visits_latest_month": 34638116 }, "employees": [ { "name": "Kenneth Rogers", "linkedin": "kenneth-rogers-07a7b149", "job_levels": [ "l_500_exec" ], "job_departments": [ "management" ] }, ... ]}

Above we define our company scraper which as you can see is mostly parsing code. Let's quickly unpack our process here:

We retrieve organizations' "people" tab page e.g. /organization/tesla-motors. We use this page because all of the organization sub-pages (aka tabs) contain the same cache except the people's page in addition also contains some employee data.
We find cache data in <script id="app-state-data"> and unquote it as it uses a special Angular quotation.
Load that us as JSON to Python's dictionary and select a few important fields from the dataset. Note, there's a lot of data in the cache - most of what's visible on the page and more - but for this demonstration, we stick to a few essential fields.

As you can see, since we're scraping Angular cache directly instead of parsing HTML we can easily pick up the entire dataset in just a few lines of code! Can we apply this to scraping other data types hosted on Crunchbase?

Scraping Other Crunchbase Data Types

Crunchbase contains details not only of companies but of industry news, investors (people), funding rounds and acquisitions. Because we chose to approach parsing through Angular cache rather than HTML itself we can easily adapt our parser to extract data set from these other endpoints as well:

import jsonfrom typing import Dict, List, TypedDictfrom parsel import Selectorclass PersonData(TypedDict): id: str name: strdef parse_person(response) -> PersonData: """parse person/investor profile from Crunchbase person's page""" sel = Selector(text=response.text) app_state_data = sel.css("script#ng-state::text").get() if not app_state_data: app_state_data = _unescape_angular(sel.css("script#client-app-state::text").get() or "") app_state_data = json.loads(app_state_data) cache_keys = list(app_state_data["HttpState"]) dataset_key = next(key for key in cache_keys if "data/entities" in key) dataset = app_state_data["HttpState"][dataset_key]["data"] parsed = { # we can get metadata from properties field: "title": dataset['properties']['title'], "description": dataset['properties']['short_description'], "type": dataset['properties']['layout_id'], # the rest of the data can be found in the cards field: "investing_overview": dataset['cards']['investor_overview_headline'], "socials": {k: v['value'] for k, v in dataset['cards']['overview_fields2'].items()}, "positions": [{ "started": job.get('started_on', {}).get('value'), "title": job['title'], "org": job['organization_identifier']['value'], # etc. } for job in dataset['cards']['current_jobs_image_list']], # etc... there are many more fields to parse } return parsedasync def scrape_person(person_id: str, session: httpx.AsyncClient) -> PersonData: """scrape Crunchbase.com investor's profile""" url = f"https://www.crunchbase.com/person/{person_id}" response = await session.get(url) return parse_person(response)

The example above uses the same technique we used to scrape company data to scrape investor data. By extracting data from Angular app state we can scrape the dataset of any Crunchbase endpoint with just a few lines of code!

Bypass Blocking with ScrapFly

We looked at how to Scrape Crunchbase.com. However, when scraping at scale we are likely to be either blocked or start serving captchas to solve. This will hinder or completely disable our web scraper.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

To scrape Crunchbase with scrapfly-sdk we can start by installing scrapfly-sdk package using pip:

$ pip install scrapfly-sdk

To take advantage of ScrapFly's API in our Crunchbase scraper, all we need to do is change our httpx session code with scrapfly-sdk client requests. For avoiding Crunchbase scraping blocking, we'll use the anti scraping protection bypass feature, which can be enabled using asp=True argument. For example, let's take a look how can we use ScrapFly to scrape a single company page:

from scrapfly import ScrapflyClient, ScrapeConfigclient = ScrapflyClient(key='YOUR_SCRAPFLY_KEY')result = client.scrape(ScrapeConfig( url="https://www.crunchbase.com/organization/tesla-motors/people", # we need to enable Anti Scraping Protection bypass with a keyword argument: asp=True,))

FAQ

To wrap this guide up let's take a look at some frequently asked questions about web scraping Crunchbase:

Is it legal to scrape Crunchbase.com ?

Yes. Crunchbase data is publicly available, and we're not extracting anything private. Scraping Crunchbase.com at slow, respectful rates would fall under the ethical scraping definition. That being said, attention should be paid to GDRP compliance in the EU when scraping personal data such as people's (investor) data. For more, see our Is Web Scraping Legal? article.

Can you crawl Crunchbase.com?

Yes, there are many ways to crawl crunchbase. However, crawling is unnecessary as Crunchbase has a rich sitemap infrastructure. For more, see Finding Companies and People section

Latest Crunchbase.com Scraper Code

https://github.com/scrapfly/scrapfly-scrapers/

Summary

In this tutorial, we built a Crunchbase scraper. We've taken a look at how to discover the company and people pages through Crunchbase's sitemap functionality. Then, we wrote a generic dataset parser for Angular-powered websites like Crunchbase itself and put it to use for scraping company and people data.

For this, we used Python with a few community packages like httpx and to prevent being blocked we used ScrapFly's API which smartly configures every web scraper connection to avoid being blocked. For more on ScrapFly see our documentation and try it out for free!