User Tools

Site Tools


design:website_selection

Overview

Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, and best practices.

This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with Website Classification, as popularity is among the most important categories used in research. This page focuses more in detail on the services that provide ranking as primary data, while Website Classification focuses more on categories such as website industry or company data.

A variety of top website lists are commonly utilized in research. Below is an overview of the most widely referenced options:

1. CrUX (Chrome User Experience Report)

CrUX provides rank order magnitude buckets (e.g., 1k, 5k, 10k, …, 5M) of popular websites based on user-initiated page loads observed by Google Chrome users with the “Help improve Chrome's features and performance” setting enabled.

  • Advantages: The most representative rankings according to [1Ruth, Kimberly; Kumar, Deepak; Wang, Brandon; Valenta, Luke; Durumeric, Zakir (2022): "Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists", in: Proceedings of the 22nd ACM Internet Measurement Conference, pp. 374–387. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)]; derived from real user browsing data; includes rich country- and device-specific insights.
  • Limitations: Rankings are aggregated into broad rank buckets (e.g., Top 1K, 10K).
  • More details, API: CrUX

2. Tranco

Tranco is a research-oriented, hardened top sites ranking designed to enhance stability and resilience against manipulation. It combines data from Alexa, Cisco Umbrella, and Majestic over a 30-day period. It was introduced in NDSS 2019 [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)].

  • Advantages: Resilient to popularity manipulation; research-oriented with robust versioning; configurable upon registration.
  • Limitations: Dependent on the quality of its constituent lists; subject to change due to the discontinuation of older services.
  • More details, API: Tranco

3. Cloudflare Radar

Cloudflare Radar provides rankings based on DNS query data and traffic on Cloudflare-operated websites.

4. Cisco Umbrella

Cisco Umbrella ranks domains based on DNS query traffic to its resolvers.

  • Advantages: Captures non-browser-based traffic.
  • Limitations: DNS-based methodology can introduce biases (see Limitations of DNS-Based Lists); may include invalid domains (e.g., typos, internal domains).

5. Majestic Million

Majestic ranks domains based on backlinks.

  • Advantages: Valuable for analyzing network link structures.
  • Limitations: Does not strictly reflect actual user traffic.

6. Other

  • SimilarWeb: Offers high-quality data but is paywalled (some exceptions exist). See SimilarWeb.
  • Quantcast: Primarily focused on US traffic.
  • Alexa: Discontinued as of August 1, 2023; previously widely used in research. Rankings were based on page visits from a user panel and tracking scripts, making them more reliable than DNS-based lists but highly volatile.
  • Farsight: TODO.
  • SecRank: List based on Chinese DNS data, introduced in USENIX Security 2022 [3Xie, Qinge; Tang, Shujun; Zheng, Xiaofeng; Lin, Qingran; Liu, Baojun; Duan, Haixin; Li, Frank (2022): "Building an Open, Robust, and Stable Voting-Based Domain Top List", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 625-642.].

Best Practices

Based on recent studies [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI), 1Ruth, Kimberly; Kumar, Deepak; Wang, Brandon; Valenta, Luke; Durumeric, Zakir (2022): "Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists", in: Proceedings of the 22nd ACM Internet Measurement Conference, pp. 374–387. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)], the following recommendations can enhance the representativeness and reliability of website selection:

  1. Use Aggregated Data: Majority of research does not require the popularity indices of individual websites, but rather aggregated ranks (e.g., top 10k). This reduces Temporal Stability and Manipulations limitation.
  2. Understand Use-Cases and Biases: Use a list that is collected by methods similar to your target study audience. E.g., if you study mobile websites, use CrUX collected from mobile users, while if you study internet requests in China, use Chinese-DNS-based SecRank. Be aware of biases inherent in DNS-based lists (e.g., Cisco Umbrella) and traffic-based lists (e.g., SimilarWeb).
  3. Document and Archive: Record the version of the list and the date of access to ensure reproducibility. Publish the exact list in the publication's artefact.
  4. Avoid Single Sources: If possible, report your results according to multiple lists, such as Tranco and CrUX, for better representativeness.

Detailed Analysis

Limitations of DNS-Based Lists

DNS-based lists, such as Cisco Umbrella, Farsight, SecRank, and partially Cloudflare Radar and Tranco, rank domains based on DNS query volumes. While this approach provides insights into overall domain popularity, it introduces several limitations:

DNS lists capture queries from devices and applications that automatically resolve domain names without user interaction. This can lead to overrepresentation of domains related to software updates, network configuration, or telemetry, rather than user-driven web browsing. For example:

  • windowsupdate.microsoft.com is frequently queried for updates but does not represent typical user-visited websites.
  • Internal or infrastructure-related domains, such as ec2.internal, may also appear on these lists.

Temporal Stability and Manipulations

Alexa, a historically prominent ranking source, relied on data from users who installed its browser toolbar. This methodology presented two primary issues:

  • Manipulations: Adversaries could artificially inflate website popularity by automating visits with the installed toolbar. Given economic interest for ranking high, there was real-world incentive for such manipulation.
  • Stability: Due to the high number of websites, a limited user base of the toolbar, and daily updates, the published lists exhibited significant volatility.

Tranco was designed to reduce the impact of these issues by aggregating diverse ranking sources over a 30-day period. Similarly, CrUX, which organizes rankings into broader buckets, offers greater stability. While CrUX is theoretically susceptible to manipulation, the vast user base of Chrome (billions of users sharing browsing data) and the high cost of such an attack provide significant resistance.

Possible manipulations of lists and their estimated cost according to Le Pochat et al.

Possible manipulations of lists and their estimated cost according to [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)].

Representativeness

  • CrUX ranks websites based on actual page loads, making it more reflective of real user behavior compared to Majestic, which relies on backlink analysis.

Use in Publications

Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. [4Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] and Xie et al. [5Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.], noting, however, that recent trends are not fully captured due to the lag in the research process. For instance, Alexa, though discontinued in 2023, is still used in 2024 publications due to sampling occurring at the start of studies. Also, the publication survey ends in 2022.

Popularity of website lists in web measurement publications according to Scheitle et al.

Popularity of website lists in web measurement publications according to [4Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].

Popularity of website lists in web measurement publications according to Xie et al.

Popularity of website lists in web measurement publications according to [5Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.].

References

[1]
Ruth, Kimberly; Kumar, Deepak; Wang, Brandon; Valenta, Luke; Durumeric, Zakir (2022): "Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists", in: Proceedings of the 22nd ACM Internet Measurement Conference, pp. 374–387. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[2]
Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)
[3]
Xie, Qinge; Tang, Shujun; Zheng, Xiaofeng; Lin, Qingran; Liu, Baojun; Duan, Haixin; Li, Frank (2022): "Building an Open, Robust, and Stable Voting-Based Domain Top List", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 625-642.
[4]
Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[5]
Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.
You could leave a comment if you were logged in.
design/website_selection.txt · Last modified: 2025/01/09 11:09 by karelkubicek