===== Overview =====
Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, and best practices.
This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with [[Design:Website Classification]], as popularity is among the most important categories used in research. This page focuses more in detail on the services that provide ranking as primary data, while [[Design:Website Classification]] focuses more on categories such as website industry or company data.
===== Popular Website Lists =====
A variety of top website lists are commonly utilized in research. Below is an overview of the most widely referenced options:
==== 1. CrUX (Chrome User Experience Report) ====
[[https://developer.chrome.com/docs/crux|CrUX]] provides rank order magnitude buckets (e.g., 1k, 5k, 10k, ..., 5M) of popular websites based on user-initiated page loads observed by Google Chrome users with the "Help improve Chrome's features and performance" setting enabled.
* **Advantages**: The most representative rankings according to {[ruth2022_toppling]}; derived from real user browsing data; includes rich country- and device-specific insights.
* **Limitations**: Rankings are aggregated into broad rank buckets (e.g., Top 1K, 10K).
* **More details, API**: [[Programming:CrUX]]
==== 2. Tranco ====
[[https://tranco-list.eu/|Tranco]] is a research-oriented, hardened top sites ranking designed to enhance stability and resilience against manipulation. It combines data from Alexa, Cisco Umbrella, and Majestic over a 30-day period. It was introduced in NDSS 2019 {[LePochat2019_tranco]}.
* **Advantages**: Resilient to popularity manipulation; research-oriented with robust versioning; configurable upon registration.
* **Limitations**: Dependent on the quality of its constituent lists; subject to change due to the discontinuation of older services.
* **More details, API**: [[Programming:Tranco]]
==== 3. Cloudflare Radar ====
[[https://radar.cloudflare.com/domains|Cloudflare Radar]] provides rankings based on DNS query data and traffic on Cloudflare-operated websites.
* **Advantages**: TODO.
* **Limitations**: TODO.
* **More details, API**: [[Programming:Cloudflare_Radar]]
==== 4. Cisco Umbrella ====
[[https://umbrella-static.s3-us-west-1.amazonaws.com/index.html|Cisco Umbrella]] ranks domains based on DNS query traffic to its resolvers.
* **Advantages**: Captures non-browser-based traffic.
* **Limitations**: DNS-based methodology can introduce biases (see [[#Limitations of DNS-Based Lists]]); may include invalid domains (e.g., typos, internal domains).
==== 5. Majestic Million ====
Majestic ranks domains based on backlinks.
* **Advantages**: Valuable for analyzing network link structures.
* **Limitations**: Does not strictly reflect actual user traffic.
==== 6. Other ====
* ''SimilarWeb'': Offers high-quality data but is paywalled (some exceptions exist). See [[Programming:SimilarWeb]].
* ''Quantcast'': Primarily focused on US traffic.
* ''[[https://en.wikipedia.org/wiki/Alexa_Internet|Alexa]]'': Discontinued as of August 1, 2023; previously widely used in research. Rankings were based on page visits from a user panel and tracking scripts, making them more reliable than DNS-based lists but highly volatile.
* ''[[https://www.domaintools.com/resources/blog/mirror-mirror-on-the-wall-whos-the-fairest-website-of-them-all|Farsight]]'': TODO.
* ''[[https://secrank.cn/|SecRank]]'': List based on Chinese DNS data, introduced in USENIX Security 2022 {[xie2022_building]}.
===== Best Practices =====
Based on recent studies {[LePochat2019_tranco,ruth2022_toppling]}, the following recommendations can enhance the representativeness and reliability of website selection:
- **Use Aggregated Data**: Majority of research does not require the popularity indices of individual websites, but rather aggregated ranks (e.g., top 10k). This reduces [[#Temporal Stability and Manipulations]] limitation.
- **Understand Use-Cases and Biases**: Use a list that is collected by methods similar to your target study audience. E.g., if you study mobile websites, use CrUX collected from mobile users, while if you study internet requests in China, use Chinese-DNS-based SecRank. Be aware of biases inherent in DNS-based lists (e.g., Cisco Umbrella) and traffic-based lists (e.g., SimilarWeb).
- **Document and Archive**: Record the version of the list and the date of access to ensure reproducibility. Publish the exact list in the publication's [[Research_practices|artefact]].
- **Avoid Single Sources**: If possible, report your results according to multiple lists, such as Tranco and CrUX, for better representativeness.
===== Detailed Analysis =====
==== Limitations of DNS-Based Lists ====
DNS-based lists, such as **Cisco Umbrella**, **Farsight**, **SecRank**, and partially **Cloudflare Radar** and **Tranco**, rank domains based on DNS query volumes. While this approach provides insights into overall domain popularity, it introduces several limitations:
DNS lists capture queries from devices and applications that automatically resolve domain names without user interaction. This can lead to overrepresentation of domains related to software updates, network configuration, or telemetry, rather than user-driven web browsing. For example:
* ''windowsupdate.microsoft.com'' is frequently queried for updates but does not represent typical user-visited websites.
* Internal or infrastructure-related domains, such as ''ec2.internal'', may also appear on these lists.
==== Temporal Stability and Manipulations ====
**Alexa**, a historically prominent ranking source, relied on data from users who installed its browser toolbar. This methodology presented two primary issues:
* **Manipulations**: Adversaries could artificially inflate website popularity by automating visits with the installed toolbar. Given economic interest for ranking high, there was real-world incentive for such manipulation.
* **Stability**: Due to the high number of websites, a limited user base of the toolbar, and daily updates, the published lists exhibited significant volatility.
Tranco was designed to reduce the impact of these issues by aggregating diverse ranking sources over a 30-day period. Similarly, CrUX, which organizes rankings into broader buckets, offers greater stability. While CrUX is theoretically susceptible to manipulation, the vast user base of Chrome (billions of users sharing browsing data) and the high cost of such an attack provide significant resistance.
{{design:website_ranking_list_manipulations_lepochat2019.png|Possible manipulations of lists and their estimated cost according to Le Pochat et al.}}
Possible manipulations of lists and their estimated cost according to {[LePochat2019_tranco]}.
==== Representativeness ====
* CrUX ranks websites based on actual page loads, making it more reflective of real user behavior compared to Majestic, which relies on backlink analysis.
===== Use in Publications =====
Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. {[scheitle2018_long]} and Xie et al. {[xie2024_crawling]}, noting, however, that recent trends are not fully captured due to the lag in the research process. For instance, Alexa, though discontinued in 2023, is still used in 2024 publications due to sampling occurring at the start of studies. Also, the publication survey ends in 2022.
{{design:website_ranking_list_popularity_scheitle2018.png|Popularity of website lists in web measurement publications according to Scheitle et al.}}
Popularity of website lists in web measurement publications according to {[scheitle2018_long]}.
{{design:website_ranking_list_popularity_xie2022.png|Popularity of website lists in web measurement publications according to Xie et al.}}
Popularity of website lists in web measurement publications according to {[xie2024_crawling]}.
====== References ======
~~DISCUSSION~~