Table of Contents
While official API use is paid, Similarweb offers a browser extension, which uses an internal API to retrieve data.
Methodology for Code Reproduction
To update the scraper logic when Similarweb changes their verification, follow these steps:
- Install the extension in a Chromium-based browser.
- Go to
chrome:/ /extensions/. - Toggle on Developer mode (top right).
- Find the Similarweb extension and click on the service worker link (this opens a specific DevTools window for the extension's background processes).
- Go to the Network tab in that window.
- Browse to any website and click the Similarweb extension icon to trigger a data fetch.
- Look for the request to
https://data.similarweb.com/api/v1/data?domain=…. - Copy the headers (specifically
x-extension-version) and check if a request to/identitywas made previously to establish cookies. - Give that as context + the source code of the extension to you favorite LLM and ask it to generate code that goes around the bot detection.
Implementation
This is what LLM generated for me:
Modern web protections (like Cloudflare) used by Similarweb block standard Python requests due to TLS fingerprinting. We use curl_cffi to impersonate a real browser TLS handshake and perform an “Identity Handshake” to establish a valid session.
- main.py
from curl_cffi import requests from urllib.parse import urlparse import pandas as pd import time # Configuration copied from extension inspection EXTENSION_VER = "6.12.18" BASE_URL = "https://data.similarweb.com/api/v1" HEADERS = { "accept": "*/*", "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36", "x-extension-version": EXTENSION_VER, "sec-fetch-dest": "empty", "sec-fetch-mode": "cors", "sec-fetch-site": "none", } class SimilarWebScraper: def __init__(self): self.session = requests.Session() self.has_identity = False def _ensure_identity(self): """Must be called at least once per session to get valid sgID cookies""" if not self.has_identity: self.session.get(f"{BASE_URL}/identity", headers=HEADERS, impersonate="chrome120") self.has_identity = True def get_data(self, domain_url): self._ensure_identity() domain = urlparse(domain_url).netloc or domain_url resp = self.session.get( f"{BASE_URL}/data?domain={domain}", headers=HEADERS, impersonate="chrome120" ) if resp.status_code == 200: return resp.json() elif resp.status_code == 403: print(f"Blocked by Cloudflare for {domain}. Check if headers/version need updating.") return None if __name__ == "__main__": scraper = SimilarWebScraper() domains = [ 'https://github.com/', 'https://reddit.com/', 'https://measuretheweb.org' ] results = [] for d in domains: print(f"Fetching {d}...") data = scraper.get_data(d) if data: results.append(data) time.sleep(2) # Be kind to avoid rate limiting if results: df = pd.json_normalize(results) print(f"Succeeded on {len(results)} out of {len(domains)} domains.") df.to_csv('similarweb_data.csv', index=False)
Response Structure
The internal API returns a comprehensive JSON object. When flattened using pandas.json_normalize, the following columns are typically available:
| Column Group | Field |
|---|---|
| Basic Info | Version, SiteName, Description, Title, Category, LargeScreenshot |
| Ranking | GlobalRank.Rank, CountryRank.Rank, CategoryRank.Rank, GlobalCategoryRank.Rank |
| Engagement | Engagments.BounceRate, Engagments.Visits, Engagments.TimeOnSite, Engagments.PagePerVisit |
| Traffic Sources | TrafficSources.Direct, TrafficSources.Search, TrafficSources.Social, TrafficSources.Referrals |
| Geography | TopCountryShares, Countries |
Note on Verification: If the code stops working, check if the x-extension-version in the extension's manifest.json has been incremented and update the constant in the script accordingly.
