Differences

This shows you the differences between two versions of the page.

--- programming:similarweb [2025/01/03 17:41] – Added discussion karelkubicek
+++ programming:similarweb [2026/03/24 09:34] (current) – Updated API against blocking karelkubicek
@@ Line 1: / Line 1: @@
-While official API use is paid, Similarweb offers a [[https://www.similarweb.com/corp/extension/|browser extension]], which uses internal API to retrieve the data. We recommend first trying the extension manually, checking the API access in the network tab of the browser development tools, there you can find the request to domain like ''https://data.similarweb.com/api/v1/data?domain=example.com'' and the headers used in this request. Using these headers, you can automate the access similarly to the following code.
+While official API use is paid, Similarweb offers a [[https://www.similarweb.com/corp/extension/|browser extension]], which uses an internal API to retrieve data.
+==== Methodology for Code Reproduction ====
+To update the scraper logic when Similarweb changes their verification, follow these steps:
+  -   Install the extension in a Chromium-based browser.
+  -   Go to ''chrome:/ /extensions/''.
+  -   Toggle on **Developer mode** (top right).
+  -   Find the Similarweb extension and click on the **service worker** link (this opens a specific DevTools window for the extension's background processes).
+  -   Go to the **Network tab** in that window.
+  -   Browse to any website and click the Similarweb extension icon to trigger a data fetch.
+  -   Look for the request to ''https://data.similarweb.com/api/v1/data?domain=...''.
+  -   Copy the headers (specifically ''x-extension-version'') and check if a request to ''/identity'' was made previously to establish cookies.
+  -   Give that as context + the source code of the extension to you favorite LLM and ask it to generate code that goes around the bot detection.
+==== Implementation ====
+This is what LLM generated for me:
+Modern web protections (like Cloudflare) used by Similarweb block standard Python ''requests'' due to TLS fingerprinting. We use ''curl_cffi'' to impersonate a real browser TLS handshake and perform an "Identity Handshake" to establish a valid session.
 <file python main.py>
-import requests
+from curl_cffi import requests
 from urllib.parse import urlparse
 import pandas as pd
+import time
-API_URL = "https://data.similarweb.com/api/v1/data?domain="
+# Configuration copied from extension inspection
-HEADERS = {  # TODO: update the User-Agent with your current agent, else you will get blocked
+EXTENSION_VER = "6.12.18"
-    'User-Agent': 'User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0'
+BASE_URL = "https://data.similarweb.com/api/v1"
+HEADERS = {
+    "accept": "*/*",
+    "accept-language": "en-US,en;q=0.9",
+    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
+    "x-extension-version": EXTENSION_VER,
+    "sec-fetch-dest": "empty",
+    "sec-fetch-mode": "cors",
+    "sec-fetch-site": "none",
 }
-def similarweb_get(domain):
+class SimilarWebScraper:
-    domain = urlparse(domain).netloc  # Similarweb input is only the domains netloc
+    def __init__(self):
-    resp = requests.get(API_URL + domain, headers = HEADERS)
+        self.session = requests.Session()
-    if resp.status_code == 200:
+        self.has_identity = False
-        return resp.json()
-    else:
+    def _ensure_identity(self):
-        resp.raise_for_status()
+        """Must be called at least once per session to get valid sgID cookies"""
-        return False
+        if not self.has_identity:
+            self.session.get(f"{BASE_URL}/identity", headers=HEADERS, impersonate="chrome120")
+            self.has_identity = True
+    def get_data(self, domain_url):
+        self._ensure_identity()
+        domain = urlparse(domain_url).netloc or domain_url
+        resp = self.session.get(
+            f"{BASE_URL}/data?domain={domain}",
+            headers=HEADERS,
+            impersonate="chrome120"
+        )
+        if resp.status_code == 200:
+            return resp.json()
+        elif resp.status_code == 403:
+            print(f"Blocked by Cloudflare for {domain}. Check if headers/version need updating.")
+        return None
 if __name__ == "__main__":
+    scraper = SimilarWebScraper()
     domains = [
         'https://github.com/',
-        'https://measuretheweb.org'  # not in the dataset (yet :)), response is valid but empty
+        'https://reddit.com/',
+        'https://measuretheweb.org'
     ]
-    responses = []
+    results = []
-    success_n = 0
+    for d in domains:
-    consecutive_errors_n = 0
+        print(f"Fetching {d}...")
+        data = scraper.get_data(d)
+        if data:
+            results.append(data)
+        time.sleep(2) # Be kind to avoid rate limiting
-    for domain in domains:
+    if results:
-        # if you are often getting errors, you might try to add a sleep here
+        df = pd.json_normalize(results)
-        try:
+        print(f"Succeeded on {len(results)} out of {len(domains)} domains.")
-            resp = similarweb_get(domain)
+        df.to_csv('similarweb_data.csv', index=False)
-            responses.append(resp)
-            success_n += 1
-            consecutive_errors_n = 0
-        except HTTPError:
-            responses.append("")
-            consecutive_errors_n += 1
-            if consecutive_errors_n >= 5:
-                break
-    print(f'Succeeded on {success_n} out of {len(domains)} domains.')
-    df = pd.json_normalize(responses)
-    print(df)
-    df.to_csv('similarweb.csv')
 </file>
-Output:
+==== Response Structure ====
-<code>
-Succeeded on 2 out of 2 domains.
-   Version           SiteName                                        Description  ... Notification.Content GlobalCategoryRank.Rank  GlobalCategoryRank.Category
-        1         github.com  github is where people build software. more th...  ...                 None                     NaN                          NaN
-        1  measuretheweb.org                                               None  ...                 None                     NaN                          NaN
-[2 rows x 39 columns]
+The internal API returns a comprehensive JSON object. When flattened using ''pandas.json_normalize'', the following columns are typically available:
-</code>
-Columns:
+^ Column Group ^ Field ^
-  * ''Version'':
+| **Basic Info** | ''Version'', ''SiteName'', ''Description'', ''Title'', ''Category'', ''LargeScreenshot'' |
-  * ''SiteName'':
+| **Ranking** | ''GlobalRank.Rank'', ''CountryRank.Rank'', ''CategoryRank.Rank'', ''GlobalCategoryRank.Rank'' |
-  * ''Description'':
+| **Engagement** | ''Engagments.BounceRate'', ''Engagments.Visits'', ''Engagments.TimeOnSite'', ''Engagments.PagePerVisit'' |
-  * ''TopCountryShares'':
+| **Traffic Sources** | ''TrafficSources.Direct'', ''TrafficSources.Search'', ''TrafficSources.Social'', ''TrafficSources.Referrals'' |
-  * ''Title'':
+| **Geography** | ''TopCountryShares'', ''Countries'' |
-  * ''GlobalCategoryRank'':
-  * ''IsSmall'':
-  * ''Policy'':
-  * ''Category'':
-  * ''LargeScreenshot'':
-  * ''IsDataFromGa'':
-  * ''Countries'':
-  * ''TopKeywords'':
-  * ''SnapshotDate'':
-  * ''Engagments.BounceRate'':
-  * ''Engagments.Month'':
-  * ''Engagments.Year'':
-  * ''Engagments.PagePerVisit'':
-  * ''Engagments.Visits'':
-  * ''Engagments.TimeOnSite'':
-  * ''EstimatedMonthlyVisits.2024-09-01'':
-  * ''EstimatedMonthlyVisits.2024-10-01'':
-  * ''EstimatedMonthlyVisits.2024-11-01'':
-  * ''GlobalRank.Rank'':
-  * ''CountryRank.Country'':
-  * ''CountryRank.CountryCode'':
-  * ''CountryRank.Rank'':
-  * ''CategoryRank.Rank'':
-  * ''CategoryRank.Category'':
-  * ''TrafficSources.Social'':
-  * ''TrafficSources.Paid Referrals'':
-  * ''TrafficSources.Mail'':
-  * ''TrafficSources.Referrals'':
-  * ''TrafficSources.Search'':
-  * ''TrafficSources.Direct'':
-  * ''Competitors.TopSimilarityCompetitors'':
-  * ''Notification.Content'':
-  * ''GlobalCategoryRank.Rank'':
-  * ''GlobalCategoryRank.Category'':
+**Note on Verification:** If the code stops working, check if the ''x-extension-version'' in the extension's ''manifest.json'' has been incremented and update the constant in the script accordingly.
 ~~DISCUSSION~~