Table of Contents

Classifying Web Requests

A common task in web privacy measurements is to determine which web requests correspond to the benign loading of required web resources and which are used to track users. There are two main methods for such classification: matching requests against crowd-sourced lists (typically used in ad-blocking or tracking protection extensions) or using machine learning (ML) to classify the requests based on their context and request URL.

This page is dedicated to the classification of web requests and partially DOM elements on the loaded page. For classification of other resources, such as cookies, JavaScript code, or fingerprinting, navigate to the specific pages.

Block Lists

Crowd-Sourced and Outdated

The main principle of block lists is their crowd-sourced nature. For instance, the EasyList repository has almost 300 contributors, over 200k commits, and more than 7k resolved issues. This has several implications:

Blocking Specific Resources

Rules can prevent actions from happening (useful for protecting user privacy), either by blocking entire domains or specific requests based on their paths. Alternatively, rules can be applied after loading a resource (described by a CSS selector) to prevent its rendering, which is more useful for blocking advertisements or annoying elements.

Existing Lists

Programming: Using Lists

Block lists use regular expressions with custom syntax to decide which resources to block. Consider the following parsing libraries:

The advantage of these libraries is that they can classify which resources would have been blocked in post-processing. You can run a crawl allowing all resources and later use the filters to analyze the data.

If you only need CSS selector filters during a crawl, your crawling library likely supports them natively. For example, in Selenium, you can use driver.find_element(By.CSS_SELECTOR, “img#tracker”).

ML Classification

While ad-blocking lists are used by up to a billion users, machine-learning-based blocking has not been widely adopted.1) There are several reasons for this:

However, these challenges do not limit the application of ML-based detection in research. Several excellent publications have developed robust ML pipelines to detect advertising and privacy-intrusive resources, making them worth considering.

AdGraph

AdGraph: A Graph-Based Approach to Ad and Tracker Blocking [2Iqbal, Umar; Snyder, Peter; Zhu, Shitong; Livshits, Benjamin; Qian, Zhiyun; Shafiq, Zubair (2020): "AdGraph: A Graph-Based Approach to Ad and Tracker Blocking", in: 2020 IEEE Symposium on Security and Privacy (SP), pp. 763-776. (DOI)] uses ML classification based on EasyList lists. It constructs a graph structure of web elements, network requests, and JavaScript execution for feature extraction. Example features include graph size, node degree, request length, domain party, and the presence of advertising keywords in requests. A random forest model achieves performance above 90%, as shown below:

ML performance of AdGraph on various lists according to Iqbal et al.

ML performance of AdGraph on various lists according to [2Iqbal, Umar; Snyder, Peter; Zhu, Shitong; Livshits, Benjamin; Qian, Zhiyun; Shafiq, Zubair (2020): "AdGraph: A Graph-Based Approach to Ad and Tracker Blocking", in: 2020 IEEE Symposium on Security and Privacy (SP), pp. 763-776. (DOI)].

Repository with instrumented crawler

WebGraph

WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking [3Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2022): "WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2875-2892. USENIX Association, Boston, MA. (Link)] is a follow-up to AdGraph. It improves feature processing to address adversarial ML methods, removes dependency on modifiable content features, and enhances overall performance.

Repository with trained model and pipeline

Khaleesi

Khaleesi: Breaker of Advertising and Tracking Request Chains [4Iqbal, Umar; Wolfe, Charlie; Nguyen, Charles; Englehardt, Steven; Shafiq, Zubair (2022): "Khaleesi: Breaker of Advertising and Tracking Request Chains", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2911-2928. USENIX Association, Boston, MA. (Link)] also extends AdGraph. Here is a repository with trained model and pipeline.

Additionally, it offers a Firefox extension that blocks advertising chains. While not directly suitable for crawls (the current implementation blocks requests), you can disable the functionality here and collect logs to classify ads.

TODO add:

PURL: Safe and Effective Sanitization of Link Decoration [5Munir, Shaoor; Lee, Patrick; Iqbal, Umar; Shafiq, Zubair; Siby, Sandra (2024): "PURL: Safe and Effective Sanitization of Link Decoration", in: 33rd USENIX Security Symposium (USENIX Security 24), pp. 4103-4120. USENIX Association, Philadelphia, PA. (Link)]

TODO

Use in Publications

TODO

References

[1]
Snyder, Peter; Vastel, Antoine; Livshits, Ben (2020): "Who Filters the Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking", Proc. ACM Meas. Anal. Comput. Syst. 4(2). (DOI) (Link)
[2]
Iqbal, Umar; Snyder, Peter; Zhu, Shitong; Livshits, Benjamin; Qian, Zhiyun; Shafiq, Zubair (2020): "AdGraph: A Graph-Based Approach to Ad and Tracker Blocking", in: 2020 IEEE Symposium on Security and Privacy (SP), pp. 763-776. (DOI)
[3]
Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2022): "WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2875-2892. USENIX Association, Boston, MA. (Link)
[4]
Iqbal, Umar; Wolfe, Charlie; Nguyen, Charles; Englehardt, Steven; Shafiq, Zubair (2022): "Khaleesi: Breaker of Advertising and Tracking Request Chains", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2911-2928. USENIX Association, Boston, MA. (Link)
[5]
Munir, Shaoor; Lee, Patrick; Iqbal, Umar; Shafiq, Zubair; Siby, Sandra (2024): "PURL: Safe and Effective Sanitization of Link Decoration", in: 33rd USENIX Security Symposium (USENIX Security 24), pp. 4103-4120. USENIX Association, Philadelphia, PA. (Link)
1)
A notable ML-based browser extension was Privacy Badger, which used “local learning” to detect tracking. This feature was later disabled to prevent fingerprinting.