Table of Contents

Overview

Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, and best practices.

This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with Website Classification, as popularity is among the most important categories used in research. This page focuses more in detail on the services that provide ranking as primary data, while Website Classification focuses more on categories such as website industry or company data.

A variety of top website lists are commonly utilized in research. Below is an overview of the most widely referenced options:

1. CrUX (Chrome User Experience Report)

CrUX provides rank order magnitude buckets (e.g., 1k, 5k, 10k, …, 5M) of popular websites based on user-initiated page loads observed by Google Chrome users with the “Help improve Chrome's features and performance” setting enabled.

2. Tranco

Tranco is a research-oriented, hardened top sites ranking designed to enhance stability and resilience against manipulation. It combines data from Alexa, Cisco Umbrella, and Majestic over a 30-day period. It was introduced in NDSS 2019 [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)].

3. Cloudflare Radar

Cloudflare Radar provides rankings based on DNS query data and traffic on Cloudflare-operated websites.

4. Cisco Umbrella

Cisco Umbrella ranks domains based on DNS query traffic to its resolvers.

5. Majestic Million

Majestic ranks domains based on backlinks.

6. Other

Best Practices

Based on recent studies [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI), 1Ruth, Kimberly; Kumar, Deepak; Wang, Brandon; Valenta, Luke; Durumeric, Zakir (2022): "Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists", in: Proceedings of the 22nd ACM Internet Measurement Conference, pp. 374–387. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)], the following recommendations can enhance the representativeness and reliability of website selection:

  1. Use Aggregated Data: Majority of research does not require the popularity indices of individual websites, but rather aggregated ranks (e.g., top 10k). This reduces Temporal Stability and Manipulations limitation.
  2. Understand Use-Cases and Biases: Use a list that is collected by methods similar to your target study audience. E.g., if you study mobile websites, use CrUX collected from mobile users, while if you study internet requests in China, use Chinese-DNS-based SecRank. Be aware of biases inherent in DNS-based lists (e.g., Cisco Umbrella) and traffic-based lists (e.g., SimilarWeb).
  3. Document and Archive: Record the version of the list and the date of access to ensure reproducibility. Publish the exact list in the publication's artefact.
  4. Avoid Single Sources: If possible, report your results according to multiple lists, such as Tranco and CrUX, for better representativeness.

Detailed Analysis

Limitations of DNS-Based Lists

DNS-based lists, such as Cisco Umbrella, Farsight, SecRank, and partially Cloudflare Radar and Tranco, rank domains based on DNS query volumes. While this approach provides insights into overall domain popularity, it introduces several limitations:

DNS lists capture queries from devices and applications that automatically resolve domain names without user interaction. This can lead to overrepresentation of domains related to software updates, network configuration, or telemetry, rather than user-driven web browsing. For example:

Temporal Stability and Manipulations

Alexa, a historically prominent ranking source, relied on data from users who installed its browser toolbar. This methodology presented two primary issues:

Tranco was designed to reduce the impact of these issues by aggregating diverse ranking sources over a 30-day period. Similarly, CrUX, which organizes rankings into broader buckets, offers greater stability. While CrUX is theoretically susceptible to manipulation, the vast user base of Chrome (billions of users sharing browsing data) and the high cost of such an attack provide significant resistance.

Possible manipulations of lists and their estimated cost according to Le Pochat et al.

Possible manipulations of lists and their estimated cost according to [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)].

Representativeness

Use in Publications

Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. [4Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] and Xie et al. [5Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.], noting, however, that recent trends are not fully captured due to the lag in the research process. For instance, Alexa, though discontinued in 2023, is still used in 2024 publications due to sampling occurring at the start of studies. Also, the publication survey ends in 2022.

Popularity of website lists in web measurement publications according to Scheitle et al.

Popularity of website lists in web measurement publications according to [4Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].

Popularity of website lists in web measurement publications according to Xie et al.

Popularity of website lists in web measurement publications according to [5Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.].

References

[1]
Ruth, Kimberly; Kumar, Deepak; Wang, Brandon; Valenta, Luke; Durumeric, Zakir (2022): "Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists", in: Proceedings of the 22nd ACM Internet Measurement Conference, pp. 374–387. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[2]
Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)
[3]
Xie, Qinge; Tang, Shujun; Zheng, Xiaofeng; Lin, Qingran; Liu, Baojun; Duan, Haixin; Li, Frank (2022): "Building an Open, Robust, and Stable Voting-Based Domain Top List", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 625-642.
[4]
Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[5]
Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.