Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, and best practices.
This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with Website Classification, as popularity is among the most important categories used in research. This page focuses more in detail on the services that provide ranking as primary data, while Website Classification focuses more on categories such as website industry or company data.
A variety of top website lists are commonly utilized in research. Below is an overview of the most widely referenced options:
CrUX provides rank order magnitude buckets (e.g., 1k, 5k, 10k, …, 5M) of popular websites based on user-initiated page loads observed by Google Chrome users with the “Help improve Chrome's features and performance” setting enabled.
Tranco is a research-oriented, hardened top sites ranking designed to enhance stability and resilience against manipulation. It combines data from Alexa, Cisco Umbrella, and Majestic over a 30-day period. It was introduced in NDSS 2019 [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)].
Cloudflare Radar provides rankings based on DNS query data and traffic on Cloudflare-operated websites.
Cisco Umbrella ranks domains based on DNS query traffic to its resolvers.
Majestic ranks domains based on backlinks.
SimilarWeb
: Offers high-quality data but is paywalled (some exceptions exist). See SimilarWeb.Quantcast
: Primarily focused on US traffic.Alexa
: Discontinued as of August 1, 2023; previously widely used in research. Rankings were based on page visits from a user panel and tracking scripts, making them more reliable than DNS-based lists but highly volatile.Farsight
: TODO.SecRank
: List based on Chinese DNS data, introduced in USENIX Security 2022 [3Xie, Qinge; Tang, Shujun; Zheng, Xiaofeng; Lin, Qingran; Liu, Baojun; Duan, Haixin; Li, Frank (2022): "Building an Open, Robust, and Stable Voting-Based Domain Top List", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 625-642.].Based on recent studies [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI), 1Ruth, Kimberly; Kumar, Deepak; Wang, Brandon; Valenta, Luke; Durumeric, Zakir (2022): "Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists", in: Proceedings of the 22nd ACM Internet Measurement Conference, pp. 374–387. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)], the following recommendations can enhance the representativeness and reliability of website selection:
DNS-based lists, such as Cisco Umbrella, Farsight, SecRank, and partially Cloudflare Radar and Tranco, rank domains based on DNS query volumes. While this approach provides insights into overall domain popularity, it introduces several limitations:
DNS lists capture queries from devices and applications that automatically resolve domain names without user interaction. This can lead to overrepresentation of domains related to software updates, network configuration, or telemetry, rather than user-driven web browsing. For example:
windowsupdate.microsoft.com
is frequently queried for updates but does not represent typical user-visited websites.ec2.internal
, may also appear on these lists.Alexa, a historically prominent ranking source, relied on data from users who installed its browser toolbar. This methodology presented two primary issues:
Tranco was designed to reduce the impact of these issues by aggregating diverse ranking sources over a 30-day period. Similarly, CrUX, which organizes rankings into broader buckets, offers greater stability. While CrUX is theoretically susceptible to manipulation, the vast user base of Chrome (billions of users sharing browsing data) and the high cost of such an attack provide significant resistance.
Possible manipulations of lists and their estimated cost according to [2Le Pochat, Victor; Van Goethem, Tom; Tajalizadehkhoob, Samaneh; Korczy´nski, Maciej; Joosen, Wouter (2019): "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation", in: Proceedings of the 26th Annual Network and Distributed System Security Symposium. (DOI)].
Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. [4Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] and Xie et al. [5Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.], noting, however, that recent trends are not fully captured due to the lag in the research process. For instance, Alexa, though discontinued in 2023, is still used in 2024 publications due to sampling occurring at the start of studies. Also, the publication survey ends in 2022.
Popularity of website lists in web measurement publications according to [4Scheitle, Quirin; Hohlfeld, Oliver; Gamba, Julien; Jelten, Jonas; Zimmermann, Torsten; Strowes, Stephen D.; Vallina-Rodriguez, Narseo (2018): "A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists", in: Proceedings of the Internet Measurement Conference 2018, pp. 478–493. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].
Popularity of website lists in web measurement publications according to [5Xie, Qinge; Li, Frank (2024): "Crawling to the Top: An Empirical Evaluation of Top List Use", in: International Conference on Passive and Active Network Measurement, pp. 277-306.].