programming:stateful_stateless
This is an old revision of the document!
Table of Contents
Stateful and Stateless Crawling
This page only contains notes
Key message:
- Majority of web measurements studies use stateless crawls, as it is easy to associate events with the single browsed visited website. Also, stateless crawls do not depend on crawling order and are easier to parallelize.
- Stateful crawling is however more representative of real users, that rarely clear their browser state.
Relevant Literature
Since the majority of publication uses stateless crawling, below we list examples of influential publications doing otherwise. However, not all contribute specifically to the question of difference between stateful and stateless crawling.
-
- Comparison of stateless crawls with stateful browsing experience of real (Firefox) users
- Stateless crawls surprisingly result in more third-party requests than stateful crawl (Fig. 6).
-
- The most impactful publication utilizing both stateful and stateless crawls
-
- Created “shopper personas” by visiting websites of certain topics and collect advertisement on next websites using these personas
- The web never forgets: Persistent tracking mechanisms in the wild
- Cookies that give you away: The surveillance implications of web tracking
Studies of Stateful Aspects
The following studies used stateless crawls, but were interpreting some stateful properties of web:
-
- Evaluation of cookie respawning
Shallow vs Deep crawling
-
- Comparison of visiting only front pages vs crawling sub-pages
- Visiting sub-pages increases amount of tracking significantly
References
You could leave a comment if you were logged in.
programming/stateful_stateless.1742307641.txt.gz · Last modified: 2025/03/18 14:20 by karelkubicek