design:website_selection
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
design:website_selection [2025/01/02 18:12] – created web selection page karelkubicek | design:website_selection [2025/02/19 10:13] (current) – highlighted todos karelkubicek | ||
---|---|---|---|
Line 2: | Line 2: | ||
Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, | Selecting a representative sample of websites is a crucial step in web measurement studies. The choice of website lists can significantly affect research outcomes, making it essential to understand their strengths, limitations, | ||
- | This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with [[Design: | + | This page examines various top website lists, outlines best practices for their usage, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with [[Design: |
===== Popular Website Lists ===== | ===== Popular Website Lists ===== | ||
Line 24: | Line 24: | ||
[[https:// | [[https:// | ||
- | * **Advantages**: | + | * **Advantages**: |
- | * **Limitations**: | + | * **Limitations**: |
* **More details, API**: [[Programming: | * **More details, API**: [[Programming: | ||
Line 44: | Line 44: | ||
* '' | * '' | ||
* '' | * '' | ||
- | * '' | + | * '' |
* '' | * '' | ||
===== Best Practices ===== | ===== Best Practices ===== | ||
- | Based on recent studies {[LePochat2019_tranco]} {[ruth2022_toppling]}, | + | Based on recent studies {[LePochat2019_tranco,ruth2022_toppling]}, |
- **Use Aggregated Data**: Majority of research does not require the popularity indices of individual websites, but rather aggregated ranks (e.g., top 10k). This reduces [[#Temporal Stability and Manipulations]] limitation. | - **Use Aggregated Data**: Majority of research does not require the popularity indices of individual websites, but rather aggregated ranks (e.g., top 10k). This reduces [[#Temporal Stability and Manipulations]] limitation. | ||
Line 82: | Line 82: | ||
Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. {[scheitle2018_long]} and Xie et al. {[xie2024_crawling]}, | Several publications have surveyed the usage of various ranking lists in academic research. The figures below illustrate findings from Scheitle et al. {[scheitle2018_long]} and Xie et al. {[xie2024_crawling]}, | ||
- | < | + | < |
{{design: | {{design: | ||
< | < | ||
</ | </ | ||
- | < | + | < |
{{design: | {{design: | ||
< | < | ||
Line 94: | Line 94: | ||
<bibtex bibliography></ | <bibtex bibliography></ | ||
- | ====== BibTex ====== | ||
- | |||
- | <bibtex database> | ||
- | @InProceedings{LePochat2019_tranco, | ||
- | author | ||
- | title = {Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation}, | ||
- | booktitle | ||
- | year = {2019}, | ||
- | series | ||
- | month = 2, | ||
- | doi = {10.14722/ | ||
- | } | ||
- | |||
- | @InProceedings{ruth2022_toppling, | ||
- | author | ||
- | title = {Toppling Top Lists: Evaluating the Accuracy of Popular Website Lists}, | ||
- | booktitle | ||
- | year = {2022}, | ||
- | series | ||
- | pages = {374–387}, | ||
- | address | ||
- | publisher | ||
- | __markedentry = {[kubicekk: | ||
- | doi = {10.1145/ | ||
- | isbn = {9781450392594}, | ||
- | location | ||
- | numpages | ||
- | url = {https:// | ||
- | } | ||
- | |||
- | @inproceedings{xie2024_crawling, | ||
- | title={Crawling to the Top: An Empirical Evaluation of Top List Use}, | ||
- | author={Xie, | ||
- | booktitle={International Conference on Passive and Active Network Measurement}, | ||
- | pages={277--306}, | ||
- | year={2024}, | ||
- | organization={Springer} | ||
- | } | ||
- | |||
- | @inproceedings{scheitle2018_long, | ||
- | author = {Scheitle, Quirin and Hohlfeld, Oliver and Gamba, Julien and Jelten, Jonas and Zimmermann, Torsten and Strowes, Stephen D. and Vallina-Rodriguez, | ||
- | title = {A Long Way to the Top: Significance, | ||
- | year = {2018}, | ||
- | isbn = {9781450356190}, | ||
- | publisher = {Association for Computing Machinery}, | ||
- | address = {New York, NY, USA}, | ||
- | url = {https:// | ||
- | doi = {10.1145/ | ||
- | booktitle = {Proceedings of the Internet Measurement Conference 2018}, | ||
- | pages = {478–493}, | ||
- | numpages = {16}, | ||
- | location = {Boston, MA, USA}, | ||
- | series = {IMC '18} | ||
- | } | ||
- | |||
- | @inproceedings{xie2022_building, | ||
- | title={Building an Open, Robust, and Stable Voting-Based Domain Top List}, | ||
- | author={Xie, | ||
- | booktitle={31st USENIX Security Symposium (USENIX Security 22)}, | ||
- | pages={625--642}, | ||
- | year={2022} | ||
- | } | ||
- | </ | ||
+ | ~~DISCUSSION~~ |
design/website_selection.1735841558.txt.gz · Last modified: 2025/01/02 18:12 by karelkubicek