This is an old revision of the document!

Stateful and Stateless Crawling

This page only contains notes

Key message:

Majority of web measurements studies use stateless crawls, as it is easy to associate events with the single browsed visited website. Also, stateless crawls do not depend on crawling order and are easier to parallelize.
Stateful crawling is however more representative of real users, that rarely clear their browser state.

Relevant Literature

Since the majority of publication uses stateless crawling, below we list examples of influential publications doing otherwise. However, not all contribute specifically to the question of difference between stateful and stateless crawling.

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing
- Comparison of stateless crawls with stateful browsing experience of real (Firefox) users
- Stateless crawls surprisingly result in more third-party requests than stateful crawl (Fig. 6).
Online Tracking: A 1-million-site Measurement and Analysis
- The most impactful publication utilizing both stateful and stateless crawls
Tracing Information Flows Between Ad Exchanges Using Retargeted Ads
- Created “shopper personas” by visiting websites of certain topics and collect advertisement on next websites using these personas
The web never forgets: Persistent tracking mechanisms in the wild
Cookies that give you away: The surveillance implications of web tracking

Studies of Stateful Aspects

The following studies used stateless crawls, but were interpreting some stateful properties of web:

My Cookie is a phoenix: detection, measurement, and lawfulness of cookie respawning with browser fingerprinting
- Evaluation of cookie respawning

Shallow vs Deep crawling

Beyond the Front Page: Measuring Third Party Dynamics in the Field
- Comparison of visiting only front pages vs crawling sub-pages
- Visiting sub-pages increases amount of tracking significantly
https://www.ftc.gov/system/files/documents/public_events/776191/ialtaweelwebpriv_0.pdfWeb Privacy Census

References

You could leave a comment if you were logged in.

Measure The Web

Table of Contents

Stateful and Stateless Crawling

Relevant Literature

Studies of Stateful Aspects

Shallow vs Deep crawling

References