Classifying Web Requests

A common task in web privacy measurements is to determine which web requests correspond to the benign loading of required web resources and which are used to track users. There are two main methods for such classification: matching requests against crowd-sourced lists (typically used in ad-blocking or tracking protection extensions) or using machine learning (ML) to classify the requests based on their context and request URL.

This page is dedicated to the classification of web requests and partially DOM elements on the loaded page. For classification of other resources, such as cookies, JavaScript code, or fingerprinting, navigate to the specific pages.

Block Lists

Crowd-Sourced and Outdated

The main principle of block lists is their crowd-sourced nature. For instance, the EasyList repository has almost 300 contributors, over 200k commits, and more than 7k resolved issues. This has several implications:

Widespread advertisers and trackers have well-defined and up-to-date rules, while the long tail of tracking companies might not be covered well.
Adding rules is much more common than deleting them. Up to 90% of the resource-blocking rules in EasyList provide no benefit to users in common browsing scenarios [1Snyder, Peter; Vastel, Antoine; Livshits, Ben (2020): "Who Filters the Filters: Understanding the Growth, Usefulness and Efficiency of Crowdsourced Ad Blocking", Proc. ACM Meas. Anal. Comput. Syst. 4(2). (DOI) (Link)].
There is a cat-and-mouse game between list maintainers and the advertising industry. Some rules (e.g., YouTube's war on ad blocking) are short-lived, making it necessary to use up-to-date lists.

Blocking Specific Resources

Rules can prevent actions from happening (useful for protecting user privacy), either by blocking entire domains or specific requests based on their paths. Alternatively, rules can be applied after loading a resource (described by a CSS selector) to prevent its rendering, which is more useful for blocking advertisements or annoying elements.

Existing Lists

DNS-blocking lists
AdBlock-syntax filters. The most established are EasyList filters:
- EasyList: Blocks advertisements except for self-promotion (first-party ads such as featured articles). See the full policy.
- EasyPrivacy: Aims to block tracking and improve user privacy. The policy is documented at the bottom of this page. It supports four categories:
  - Generic blocks: Common URL/tracking filter patterns used by 1st- and 3rd-parties.
  - 1st-party tracking: Self-hosted trackers and CNAME trackers.
  - 3rd-party tracking: Hosted by another provider hosting a tracking script but not actually a tracking company.
  - Tracking servers: Servers with the sole purpose of tracking/analyzing users are blocked at the URL level.
Country-specific lists: Mostly block advertising.
Technology-specific lists:
- Cookie notices: To detect cookie notices, consider the following options:
  - EasyList Cookie List: A CSS selector list that is the most generic and up-to-date.
  - I (Still) Don't Care About Cookies: In addition to removing DOM elements descibed by EasyList Cookie List, this extension performs simple actions like clicking “accept all” or “reject all” when available. It is useful for passing consent notices during crawling when the specific consent action is unimportant. The rules are available here.
  - Consent-O-Matic: If you care about consent actions (accepting or rejecting consent), consider the rules of the Consent-O-Matic extension. Forks by DuckDuckGo (seems to be the most active) and Mozilla are also available. These work with specific consent providers (Consent Management Platforms, CMPs).

Programming: Using Lists

Block lists use regular expressions with custom syntax to decide which resources to block. Consider the following parsing libraries:

abp-blocklist-parser:
- Can decide whether to block requests, images, etc.
adblockparser:
- An older parser with some limitations.

The advantage of these libraries is that they can classify which resources would have been blocked in post-processing. You can run a crawl allowing all resources and later use the filters to analyze the data.

If you only need CSS selector filters during a crawl, your crawling library likely supports them natively. For example, in Selenium, you can use driver.find_element(By.CSS_SELECTOR, “img#tracker”).

ML Classification

While ad-blocking lists are used by up to a billion users, machine-learning-based blocking has not been widely adopted.¹⁾ There are several reasons for this:

Using ML to dynamically decide whether to block a resource might make the browser fingerprintable.
Adversarial machine-learning methods suggest that if ML blocking becomes popular, automated methods to evade detection will emerge.
Blocking content based on ML increases the chances of breaking websites, especially in non-reproducible ways.

However, these challenges do not limit the application of ML-based detection in research. Several excellent publications have developed robust ML pipelines to detect advertising and privacy-intrusive resources, making them worth considering.

AdGraph

AdGraph: A Graph-Based Approach to Ad and Tracker Blocking [2Iqbal, Umar; Snyder, Peter; Zhu, Shitong; Livshits, Benjamin; Qian, Zhiyun; Shafiq, Zubair (2020): "AdGraph: A Graph-Based Approach to Ad and Tracker Blocking", in: 2020 IEEE Symposium on Security and Privacy (SP), pp. 763-776. (DOI)] uses ML classification based on EasyList lists. It constructs a graph structure of web elements, network requests, and JavaScript execution for feature extraction. Example features include graph size, node degree, request length, domain party, and the presence of advertising keywords in requests. A random forest model achieves performance above 90%, as shown below:

ML performance of AdGraph on various lists according to [2Iqbal, Umar; Snyder, Peter; Zhu, Shitong; Livshits, Benjamin; Qian, Zhiyun; Shafiq, Zubair (2020): "AdGraph: A Graph-Based Approach to Ad and Tracker Blocking", in: 2020 IEEE Symposium on Security and Privacy (SP), pp. 763-776. (DOI)].

Repository with instrumented crawler

WebGraph

WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking [3Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2022): "WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2875-2892. USENIX Association, Boston, MA. (Link)] is a follow-up to AdGraph. It improves feature processing to address adversarial ML methods, removes dependency on modifiable content features, and enhances overall performance.

Repository with trained model and pipeline

Khaleesi

Khaleesi: Breaker of Advertising and Tracking Request Chains [4Iqbal, Umar; Wolfe, Charlie; Nguyen, Charles; Englehardt, Steven; Shafiq, Zubair (2022): "Khaleesi: Breaker of Advertising and Tracking Request Chains", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2911-2928. USENIX Association, Boston, MA. (Link)] also extends AdGraph. Here is a repository with trained model and pipeline.

Additionally, it offers a Firefox extension that blocks advertising chains. While not directly suitable for crawls (the current implementation blocks requests), you can disable the functionality here and collect logs to classify ads.

Classification of Link Decorators

TODO add:

PURL: Safe and Effective Sanitization of Link Decoration [5Munir, Shaoor; Lee, Patrick; Iqbal, Umar; Shafiq, Zubair; Siby, Sandra (2024): "PURL: Safe and Effective Sanitization of Link Decoration", in: 33rd USENIX Security Symposium (USENIX Security 24), pp. 4103-4120. USENIX Association, Philadelphia, PA. (Link)]

Classification of Cookie Notices and Their Interactive Elements

TODO

Use in Publications