User Tools

Site Tools


programming:similarweb

While official API use is paid, Similarweb offers a browser extension, which uses an internal API to retrieve data.

Methodology for Code Reproduction

To update the scraper logic when Similarweb changes their verification, follow these steps:

  1. Install the extension in a Chromium-based browser.
  2. Go to chrome:/ /extensions/.
  3. Toggle on Developer mode (top right).
  4. Find the Similarweb extension and click on the service worker link (this opens a specific DevTools window for the extension's background processes).
  5. Go to the Network tab in that window.
  6. Browse to any website and click the Similarweb extension icon to trigger a data fetch.
  7. Copy the headers (specifically x-extension-version) and check if a request to /identity was made previously to establish cookies.
  8. Give that as context + the source code of the extension to you favorite LLM and ask it to generate code that goes around the bot detection.

Implementation

This is what LLM generated for me:

Modern web protections (like Cloudflare) used by Similarweb block standard Python requests due to TLS fingerprinting. We use curl_cffi to impersonate a real browser TLS handshake and perform an “Identity Handshake” to establish a valid session.

main.py
from curl_cffi import requests
from urllib.parse import urlparse
import pandas as pd
import time
 
# Configuration copied from extension inspection
EXTENSION_VER = "6.12.18" 
BASE_URL = "https://data.similarweb.com/api/v1"
 
HEADERS = {
    "accept": "*/*",
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "x-extension-version": EXTENSION_VER,
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "none",
}
 
class SimilarWebScraper:
    def __init__(self):
        self.session = requests.Session()
        self.has_identity = False
 
    def _ensure_identity(self):
        """Must be called at least once per session to get valid sgID cookies"""
        if not self.has_identity:
            self.session.get(f"{BASE_URL}/identity", headers=HEADERS, impersonate="chrome120")
            self.has_identity = True
 
    def get_data(self, domain_url):
        self._ensure_identity()
        domain = urlparse(domain_url).netloc or domain_url
 
        resp = self.session.get(
            f"{BASE_URL}/data?domain={domain}", 
            headers=HEADERS, 
            impersonate="chrome120"
        )
 
        if resp.status_code == 200:
            return resp.json()
        elif resp.status_code == 403:
            print(f"Blocked by Cloudflare for {domain}. Check if headers/version need updating.")
        return None
 
if __name__ == "__main__":
    scraper = SimilarWebScraper()
    domains = [
        'https://github.com/',
        'https://reddit.com/',
        'https://measuretheweb.org' 
    ]
 
    results = []
    for d in domains:
        print(f"Fetching {d}...")
        data = scraper.get_data(d)
        if data:
            results.append(data)
        time.sleep(2) # Be kind to avoid rate limiting
 
    if results:
        df = pd.json_normalize(results)
        print(f"Succeeded on {len(results)} out of {len(domains)} domains.")
        df.to_csv('similarweb_data.csv', index=False)

Response Structure

The internal API returns a comprehensive JSON object. When flattened using pandas.json_normalize, the following columns are typically available:

Column Group Field
Basic Info Version, SiteName, Description, Title, Category, LargeScreenshot
Ranking GlobalRank.Rank, CountryRank.Rank, CategoryRank.Rank, GlobalCategoryRank.Rank
Engagement Engagments.BounceRate, Engagments.Visits, Engagments.TimeOnSite, Engagments.PagePerVisit
Traffic Sources TrafficSources.Direct, TrafficSources.Search, TrafficSources.Social, TrafficSources.Referrals
Geography TopCountryShares, Countries

Note on Verification: If the code stops working, check if the x-extension-version in the extension's manifest.json has been incremented and update the constant in the script accordingly.

You could leave a comment if you were logged in.
programming/similarweb.txt · Last modified: by karelkubicek