User Tools

Site Tools


programming:similarweb

While official API use is paid, Similarweb offers a browser extension, which uses internal API to retrieve the data. We recommend first trying the extension manually, checking the API access in the network tab of the browser development tools, there you can find the request to domain like https://data.similarweb.com/api/v1/data?domain=example.com and the headers used in this request. Using these headers, you can automate the access similarly to the following code.

main.py
import requests
from urllib.parse import urlparse
 
import pandas as pd
 
API_URL = "https://data.similarweb.com/api/v1/data?domain="
HEADERS = {  # TODO: update the User-Agent with your current agent, else you will get blocked
    'User-Agent': 'User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:133.0) Gecko/20100101 Firefox/133.0'
}
 
def similarweb_get(domain):
    domain = urlparse(domain).netloc  # Similarweb input is only the domains netloc
    resp = requests.get(API_URL + domain, headers = HEADERS)
    if resp.status_code == 200:
        return resp.json()
    else:
        resp.raise_for_status()
        return False
 
if __name__ == "__main__":
    domains = [
        'https://github.com/',
        'https://measuretheweb.org'  # not in the dataset (yet :)), response is valid but empty
    ]
 
    responses = []
    success_n = 0
    consecutive_errors_n = 0
 
    for domain in domains:
        # if you are often getting errors, you might try to add a sleep here
        try:
            resp = similarweb_get(domain)
            responses.append(resp)
            success_n += 1
            consecutive_errors_n = 0
        except HTTPError:
            responses.append("")
            consecutive_errors_n += 1
            if consecutive_errors_n >= 5:
                break
 
    print(f'Succeeded on {success_n} out of {len(domains)} domains.')
    df = pd.json_normalize(responses)
    print(df)
    df.to_csv('similarweb.csv')

Output:

Succeeded on 2 out of 2 domains.
   Version           SiteName                                        Description  ... Notification.Content GlobalCategoryRank.Rank  GlobalCategoryRank.Category
0        1         github.com  github is where people build software. more th...  ...                 None                     NaN                          NaN
1        1  measuretheweb.org                                               None  ...                 None                     NaN                          NaN

[2 rows x 39 columns]

Columns:

  • Version:
  • SiteName:
  • Description:
  • TopCountryShares:
  • Title:
  • GlobalCategoryRank:
  • IsSmall:
  • Policy:
  • Category:
  • LargeScreenshot:
  • IsDataFromGa:
  • Countries:
  • TopKeywords:
  • SnapshotDate:
  • Engagments.BounceRate:
  • Engagments.Month:
  • Engagments.Year:
  • Engagments.PagePerVisit:
  • Engagments.Visits:
  • Engagments.TimeOnSite:
  • EstimatedMonthlyVisits.2024-09-01:
  • EstimatedMonthlyVisits.2024-10-01:
  • EstimatedMonthlyVisits.2024-11-01:
  • GlobalRank.Rank:
  • CountryRank.Country:
  • CountryRank.CountryCode:
  • CountryRank.Rank:
  • CategoryRank.Rank:
  • CategoryRank.Category:
  • TrafficSources.Social:
  • TrafficSources.Paid Referrals:
  • TrafficSources.Mail:
  • TrafficSources.Referrals:
  • TrafficSources.Search:
  • TrafficSources.Direct:
  • Competitors.TopSimilarityCompetitors:
  • Notification.Content:
  • GlobalCategoryRank.Rank:
  • GlobalCategoryRank.Category:
You could leave a comment if you were logged in.
programming/similarweb.txt · Last modified: 2025/01/03 17:41 by karelkubicek