====== Classifying Web Requests ====== A common task in web privacy measurements is to determine which web requests correspond to the benign loading of required web resources and which are used to track users. There are two main methods for such classification: matching requests against crowd-sourced lists (typically used in ad-blocking or tracking protection extensions) or using machine learning (**ML**) to classify the requests based on their context and request URL. This page is dedicated to the classification of web requests and partially DOM elements on the loaded page. For classification of other resources, such as [[Privacy:Cookies|cookies]], [[Privacy:JavaScript|JavaScript code]], or [[Privacy:Fingerprinting|fingerprinting]], navigate to the specific pages. ===== Block Lists ===== ==== Crowd-Sourced and Outdated ==== The main principle of block lists is their crowd-sourced nature. For instance, the [[https://github.com/easylist/easylist|EasyList repository]] has almost 300 contributors, over 200k commits, and more than 7k resolved issues. This has several implications: * Widespread advertisers and trackers have well-defined and up-to-date rules, while the long tail of tracking companies might not be covered well. * Adding rules is much more common than deleting them. Up to 90% of the resource-blocking rules in EasyList provide no benefit to users in common browsing scenarios {[snyder2020_who]}. * There is a cat-and-mouse game between list maintainers and the advertising industry. Some rules (e.g., [[https://www.engadget.com/inside-the-arms-race-between-youtube-and-ad-blockers-140031824.html|YouTube's war on ad blocking]]) are short-lived, making it necessary to use up-to-date lists. ==== Blocking Specific Resources ==== Rules can prevent actions from happening (useful for protecting user privacy), either by blocking entire domains or specific requests based on their paths. Alternatively, rules can be applied after loading a resource (described by a CSS selector) to prevent its rendering, which is more useful for blocking advertisements or annoying elements. ==== Existing Lists ==== * DNS-blocking lists * [[https://web.archive.org/web/20241231165520/https://help.adblockplus.org/hc/en-us/articles/360062733293-How-to-write-filters|AdBlock-syntax]] filters. The most established are [[https://easylist.to/index.html|EasyList filters]]: * **EasyList**: Blocks advertisements except for self-promotion (first-party ads such as featured articles). See the full [[https://easylist.to/pages/policy.html|policy]]. * **EasyPrivacy**: Aims to block tracking and improve user privacy. The policy is documented at the bottom of [[https://easylist.to/pages/policy.html|this page]]. It supports four categories: * Generic blocks: Common URL/tracking filter patterns used by 1st- and 3rd-parties. * 1st-party tracking: Self-hosted trackers and CNAME trackers. * 3rd-party tracking: Hosted by another provider hosting a tracking script but not actually a tracking company. * Tracking servers: Servers with the sole purpose of tracking/analyzing users are blocked at the URL level. * [[https://easylist.to/pages/other-supplementary-filter-lists-and-easylist-variants.html|Country-specific lists]]: Mostly block advertising. * Technology-specific lists: * Cookie notices: To detect cookie notices, consider the following options: * **EasyList Cookie List**: A CSS selector list that is the most generic and up-to-date. * **I (Still) Don't Care About Cookies**: In addition to removing DOM elements descibed by EasyList Cookie List, this extension performs simple actions like clicking "accept all" or "reject all" when available. It is useful for passing consent notices during crawling when the specific consent action is unimportant. The rules are available [[https://github.com/OhMyGuus/I-Still-Dont-Care-About-Cookies/blob/master/src/data/js/5_clickHandler.js|here]]. * **Consent-O-Matic**: If you care about consent actions (accepting or rejecting consent), consider the rules of the [[https://github.com/cavi-au/Consent-O-Matic|Consent-O-Matic]] extension. Forks by [[https://github.com/duckduckgo/autoconsent|DuckDuckGo]] (seems to be the most active) and [[https://github.com/mozilla/cookie-banner-rules-list|Mozilla]] are also available. These work with specific consent providers (Consent Management Platforms, **CMPs**). ==== Programming: Using Lists ==== Block lists use regular expressions with custom syntax to decide which resources to block. Consider the following parsing libraries: * [[https://github.com/englehardt/abp-blocklist-parser|abp-blocklist-parser]]: * Can decide whether to block requests, images, etc. * [[https://github.com/scrapinghub/adblockparser|adblockparser]]: * An older parser with some [[https://github.com/scrapinghub/adblockparser#limitations|limitations]]. The advantage of these libraries is that they can classify which resources would have been blocked in post-processing. You can run a crawl allowing all resources and later use the filters to analyze the data. If you only need CSS selector filters during a crawl, your crawling library likely supports them natively. For example, in Selenium, you can use ''driver.find_element(By.CSS_SELECTOR, "img#tracker")''. ===== ML Classification ===== While ad-blocking lists are used by up to a billion users, machine-learning-based blocking has not been widely adopted.((A notable ML-based browser extension was Privacy Badger, which used "local learning" to detect tracking. [[https://www.eff.org/deeplinks/2020/10/privacy-badger-changing-protect-you-better|This feature was later disabled to prevent fingerprinting]].)) There are several reasons for this: * Using ML to dynamically decide whether to block a resource might make the browser fingerprintable. * Adversarial machine-learning methods suggest that if ML blocking becomes popular, automated methods to evade detection will emerge. * Blocking content based on ML increases the chances of breaking websites, especially in non-reproducible ways. However, these challenges do not limit the application of ML-based detection in research. Several excellent publications have developed robust ML pipelines to detect advertising and privacy-intrusive resources, making them worth considering. ==== AdGraph ==== **AdGraph: A Graph-Based Approach to Ad and Tracker Blocking** {[iqbal2020_adgraph]} uses ML classification based on EasyList lists. It constructs a graph structure of web elements, network requests, and JavaScript execution for feature extraction. Example features include graph size, node degree, request length, domain party, and the presence of advertising keywords in requests. A random forest model achieves performance above 90%, as shown below: {{privacy:privacy_adgraph_performance_lists_iqbal2020_adgraph.png|ML performance of AdGraph on various lists according to Iqbal et al.}}
ML performance of AdGraph on various lists according to {[iqbal2020_adgraph]}.
[[https://github.com/uiowa-irl/AdGraph|Repository with instrumented crawler]] ==== WebGraph ==== **WebGraph: Capturing Advertising and Tracking Information Flows for Robust Blocking** {[siby2022_webgraph]} is a follow-up to AdGraph. It improves feature processing to address adversarial ML methods, removes dependency on modifiable content features, and enhances overall performance. [[https://github.com/spring-epfl/WebGraph|Repository with trained model and pipeline]] ==== Khaleesi ==== **Khaleesi: Breaker of Advertising and Tracking Request Chains** {[iqbal2022_khaleesi]} also extends AdGraph. Here is a [[https://github.com/uiowa-irl/Khaleesi|repository with trained model and pipeline]]. Additionally, it offers a [[https://github.com/uiowa-irl/Khaleesi?tab=readme-ov-file#browser-extension|Firefox extension]] that blocks advertising chains. While not directly suitable for crawls (the current implementation blocks requests), you can disable the functionality [[https://github.com/uiowa-irl/Khaleesi/blob/main/browser_extension/background/background.js#L52|here]] and collect logs to classify ads. ==== Classification of Link Decorators ==== TODO add: **PURL: Safe and Effective Sanitization of Link Decoration** {[shaoor2024purl]} ==== Classification of Cookie Notices and Their Interactive Elements ==== TODO ===== Use in Publications ===== TODO ====== References ====== /* To insert citations, you have to follow 3 steps: - Check whether the BibTex entry is in https://measuretheweb.org/literature/bibliography, if not, add it to the end. For example, ''LePochat2019_tranco'' is already present. - In the text, where you want to use the citation, use {[LePochat2019_tranco]} - it will be rendered as [1] or number given by order in the page. - Keep this section as it is to print the bibliography. If any of these steps fails, you will see a purple warning in the appropriate page. */ ~~DISCUSSION~~