User Tools

Site Tools


design:website_selection

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
design:website_selection [2025/01/02 18:12] – created web selection page karelkubicekdesign:website_selection [2025/02/19 10:13] (current) – highlighted todos karelkubicek
Line 2: Line 2:
 Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, and best practices. Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, and best practices.
  
-This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with [[Design:Website Categorization]], as popularity is among the most important categories used in research. This page focuses more in detail on the services that provide ranking as primary data, while [[Design:Website Categorization]] focuses more on categories such as website industry or company data.+This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with [[Design:Website Classification]], as popularity is among the most important categories used in research. This page focuses more in detail on the services that provide ranking as primary data, while [[Design:Website Classification]] focuses more on categories such as website industry or company data.
  
 ===== Popular Website Lists ===== ===== Popular Website Lists =====
Line 24: Line 24:
 [[https://radar.cloudflare.com/domains|Cloudflare Radar]] provides rankings based on DNS query data and traffic on Cloudflare-operated websites. [[https://radar.cloudflare.com/domains|Cloudflare Radar]] provides rankings based on DNS query data and traffic on Cloudflare-operated websites.
  
-  * **Advantages**: TODO. +  * **Advantages**: <wrap todo>TODO</wrap> 
-  * **Limitations**: TODO.+  * **Limitations**: <wrap todo>TODO</wrap>
   * **More details, API**: [[Programming:Cloudflare_Radar]]   * **More details, API**: [[Programming:Cloudflare_Radar]]
  
Line 44: Line 44:
   * ''Quantcast'': Primarily focused on US traffic.   * ''Quantcast'': Primarily focused on US traffic.
   * ''[[https://en.wikipedia.org/wiki/Alexa_Internet|Alexa]]'': Discontinued as of August 1, 2023; previously widely used in research. Rankings were based on page visits from a user panel and tracking scripts, making them more reliable than DNS-based lists but highly volatile.   * ''[[https://en.wikipedia.org/wiki/Alexa_Internet|Alexa]]'': Discontinued as of August 1, 2023; previously widely used in research. Rankings were based on page visits from a user panel and tracking scripts, making them more reliable than DNS-based lists but highly volatile.
-  * ''[[https://www.domaintools.com/resources/blog/mirror-mirror-on-the-wall-whos-the-fairest-website-of-them-all|Farsight]]'': TODO.+  * ''[[https://www.domaintools.com/resources/blog/mirror-mirror-on-the-wall-whos-the-fairest-website-of-them-all|Farsight]]'': <wrap todo>TODO</wrap>
   * ''[[https://secrank.cn/|SecRank]]'': List based on Chinese DNS data, introduced in USENIX Security 2022 {[xie2022_building]}.   * ''[[https://secrank.cn/|SecRank]]'': List based on Chinese DNS data, introduced in USENIX Security 2022 {[xie2022_building]}.
  
 ===== Best Practices ===== ===== Best Practices =====
-Based on recent studies {[LePochat2019_tranco]} {[ruth2022_toppling]}, the following recommendations can enhance the representativeness and reliability of website selection:+Based on recent studies {[LePochat2019_tranco,ruth2022_toppling]}, the following recommendations can enhance the representativeness and reliability of website selection:
  
   - **Use Aggregated Data**: Majority of research does not require the popularity indices of individual websites, but rather aggregated ranks (e.g., top 10k). This reduces [[#Temporal Stability and Manipulations]] limitation.   - **Use Aggregated Data**: Majority of research does not require the popularity indices of individual websites, but rather aggregated ranks (e.g., top 10k). This reduces [[#Temporal Stability and Manipulations]] limitation.
Line 82: Line 82:
 Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. {[scheitle2018_long]} and Xie et al. {[xie2024_crawling]}, noting, however, that recent trends are not fully captured due to the lag in the research process. For instance, Alexa, though discontinued in 2023, is still used in 2024 publications due to sampling occurring at the start of studies. Also, the publication survey ends in 2022. Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. {[scheitle2018_long]} and Xie et al. {[xie2024_crawling]}, noting, however, that recent trends are not fully captured due to the lag in the research process. For instance, Alexa, though discontinued in 2023, is still used in 2024 publications due to sampling occurring at the start of studies. Also, the publication survey ends in 2022.
  
-<WRAP right 50% box>+<WRAP center 100% box>
 {{design:website_ranking_list_popularity_scheitle2018.png|Popularity of website lists in web measurement publications according to Scheitle et al.}} {{design:website_ranking_list_popularity_scheitle2018.png|Popularity of website lists in web measurement publications according to Scheitle et al.}}
 <div>Popularity of website lists in web measurement publications according to {[scheitle2018_long]}.</div> <div>Popularity of website lists in web measurement publications according to {[scheitle2018_long]}.</div>
 </WRAP> </WRAP>
-<WRAP right 50% box>+<WRAP center 100% box>
 {{design:website_ranking_list_popularity_xie2022.png|Popularity of website lists in web measurement publications according to Xie et al.}} {{design:website_ranking_list_popularity_xie2022.png|Popularity of website lists in web measurement publications according to Xie et al.}}
 <div>Popularity of website lists in web measurement publications according to {[xie2024_crawling]}.</div> <div>Popularity of website lists in web measurement publications according to {[xie2024_crawling]}.</div>
Line 94: Line 94:
 <bibtex bibliography></bibtex> <bibtex bibliography></bibtex>
  
-====== BibTex ====== 
- 
-<bibtex database> 
-@InProceedings{LePochat2019_tranco, 
-  author        = {{Le Pochat}, Victor and {Van Goethem}, Tom and Tajalizadehkhoob, Samaneh and Korczy\'{n}ski, Maciej and Joosen, Wouter}, 
-  title         = {Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation}, 
-  booktitle     = {Proceedings of the 26th Annual Network and Distributed System Security Symposium}, 
-  year          = {2019}, 
-  series        = {NDSS 2019}, 
-  month         = 2, 
-  doi           = {10.14722/ndss.2019.23386}, 
-} 
- 
-@InProceedings{ruth2022_toppling, 
-  author        = {Ruth, Kimberly and Kumar, Deepak and Wang, Brandon and Valenta, Luke and Durumeric, Zakir}, 
-  title         = {Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists}, 
-  booktitle     = {Proceedings of the 22nd ACM Internet Measurement Conference}, 
-  year          = {2022}, 
-  series        = {IMC '22}, 
-  pages         = {374–387}, 
-  address       = {New York, NY, USA}, 
-  publisher     = {Association for Computing Machinery}, 
-  __markedentry = {[kubicekk:6]}, 
-  doi           = {10.1145/3517745.3561444}, 
-  isbn          = {9781450392594}, 
-  location      = {Nice, France}, 
-  numpages      = {14}, 
-  url           = {https://doi.org/10.1145/3517745.3561444}, 
-} 
- 
-@inproceedings{xie2024_crawling, 
-  title={Crawling to the Top: An Empirical Evaluation of Top List Use}, 
-  author={Xie, Qinge and Li, Frank}, 
-  booktitle={International Conference on Passive and Active Network Measurement}, 
-  pages={277--306}, 
-  year={2024}, 
-  organization={Springer} 
-} 
- 
-@inproceedings{scheitle2018_long, 
-author = {Scheitle, Quirin and Hohlfeld, Oliver and Gamba, Julien and Jelten, Jonas and Zimmermann, Torsten and Strowes, Stephen D. and Vallina-Rodriguez, Narseo}, 
-title = {A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists}, 
-year = {2018}, 
-isbn = {9781450356190}, 
-publisher = {Association for Computing Machinery}, 
-address = {New York, NY, USA}, 
-url = {https://doi.org/10.1145/3278532.3278574}, 
-doi = {10.1145/3278532.3278574}, 
-booktitle = {Proceedings of the Internet Measurement Conference 2018}, 
-pages = {478–493}, 
-numpages = {16}, 
-location = {Boston, MA, USA}, 
-series = {IMC '18} 
-} 
- 
-@inproceedings{xie2022_building, 
-  title={Building an Open, Robust, and Stable Voting-Based Domain Top List}, 
-  author={Xie, Qinge and Tang, Shujun and Zheng, Xiaofeng and Lin, Qingran and Liu, Baojun and Duan, Haixin and Li, Frank}, 
-  booktitle={31st USENIX Security Symposium (USENIX Security 22)}, 
-  pages={625--642}, 
-  year={2022} 
-} 
-</bibtex> 
  
 +~~DISCUSSION~~
design/website_selection.1735841558.txt.gz · Last modified: 2025/01/02 18:12 by karelkubicek