programming:stateful_stateless
Table of Contents
Stateful and Stateless Crawling
This page only contains notes
Key message:
- Majority of web measurements studies use stateless crawls, as it is easy to associate events with the single browsed visited website. Also, stateless crawls do not depend on crawling order and are easier to parallelize.
- Stateful crawling is however more representative of real users, that rarely clear their browser state.
Relevant Literature
Since the majority of publication uses stateless crawling, below we list examples of influential publications doing otherwise. However, not all contribute specifically to the question of difference between stateful and stateless crawling.
- The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing [1Zeber, David; Bird, Sarah; Oliveira, Camila; Rudametkin, Walter; Segall, Ilana; Wolls´en, Fredrik; Lopatka, Martin (2020): "The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing", in: Proceedings of The Web Conference 2020, pp. 167–178. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)]
- Comparison of stateless crawls with stateful browsing experience of real (Firefox) users
- Stateless crawls surprisingly result in more third-party requests than stateful crawl (Fig. 6).
- Online Tracking: A 1-million-site Measurement and Analysis [2Englehardt, Steven; Narayanan, Arvind (2016): "Online Tracking: A 1-million-site Measurement and Analysis", in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1388–1401. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)]
- The most impactful publication utilizing both stateful and stateless crawls
- Tracing Information Flows Between Ad Exchanges Using Retargeted Ads [3Bashir, Muhammad Ahmad; Arshad, Sajjad; Robertson, William; Wilson, Christo (2016): "Tracing information flows between ad exchanges using retargeted ads", in: 25th USENIX Security Symposium (USENIX Security 16), pp. 481-496. (Link)]
- Created “shopper personas” by visiting websites of certain topics and collect advertisement on next websites using these personas
- The web never forgets: Persistent tracking mechanisms in the wild
- Cookies that give you away: The surveillance implications of web tracking
Studies of Stateful Aspects
The following studies used stateless crawls, but were interpreting some stateful properties of web:
- My Cookie is a phoenix: detection, measurement, and lawfulness of cookie respawning with browser fingerprinting [4Fouad, Imane; Santos, Cristiana; Legout, Arnaud; Bielova, Nataliia (2022): "My Cookie is a phoenix: Detection, measurement, and lawfulness of cookie respawning with browser fingerprinting", in: PETS 2022-22nd Privacy Enhancing Technologies Symposium. (DOI) (Link)]
- Evaluation of cookie respawning
Shallow vs Deep crawling
- Beyond the Front Page: Measuring Third Party Dynamics in the Field [5Urban, Tobias; Degeling, Martin; Holz, Thorsten; Pohlmann, Norbert (2020): "Beyond the Front Page:Measuring Third Party Dynamics in the Field", in: Proceedings of The Web Conference 2020, pp. 1275–1286. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)]
- Comparison of visiting only front pages vs crawling sub-pages
- Visiting sub-pages increases amount of tracking significantly
References
- [1]
- Zeber, David; Bird, Sarah; Oliveira, Camila; Rudametkin, Walter; Segall, Ilana; Wolls´en, Fredrik; Lopatka, Martin (2020): "The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing", in: Proceedings of The Web Conference 2020, pp. 167–178. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
- [2]
- Englehardt, Steven; Narayanan, Arvind (2016): "Online Tracking: A 1-million-site Measurement and Analysis", in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1388–1401. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
- [3]
- Bashir, Muhammad Ahmad; Arshad, Sajjad; Robertson, William; Wilson, Christo (2016): "Tracing information flows between ad exchanges using retargeted ads", in: 25th USENIX Security Symposium (USENIX Security 16), pp. 481-496. (Link)
- [4]
- Fouad, Imane; Santos, Cristiana; Legout, Arnaud; Bielova, Nataliia (2022): "My Cookie is a phoenix: Detection, measurement, and lawfulness of cookie respawning with browser fingerprinting", in: PETS 2022-22nd Privacy Enhancing Technologies Symposium. (DOI) (Link)
- [5]
- Urban, Tobias; Degeling, Martin; Holz, Thorsten; Pohlmann, Norbert (2020): "Beyond the Front Page:Measuring Third Party Dynamics in the Field", in: Proceedings of The Web Conference 2020, pp. 1275–1286. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
You could leave a comment if you were logged in.
programming/stateful_stateless.txt · Last modified: 2025/03/19 08:56 by karelkubicek