Next revision | Previous revision |
privacy:cookies [2025/01/07 17:49] – created cookie classification page karelkubicek | privacy:cookies [2025/03/04 13:53] (current) – [First- and Third-Party Cookies] highlighting common misconception karelkubicek |
---|
====== Classifying Cookies ====== | ====== Classifying Cookies ====== |
| |
Browser cookies are still the most commonly used method for tracking the session state of websites and the identity of visitors. According to prior studies, between 80% in 2012 {[roesner2012_detecting]} and 90% in 2019 {[solomos2019_clash,sanchezrola2019can]} of websites use cookies for user tracking, often without users' knowledge. While other stateless tracking technologies, such as [[Privacy:Fingerprinting]] or [[Privacy:Requests#Link_Decorators|Link decorators]], exist, cookies remain the primary choice, with stateless methods typically used in combination with cookies. This trend persists despite the discontinuation of third-party cookies. | Browser cookies are still the most commonly used method for tracking the session state of websites and the identity of visitors. According to prior studies, between 80% in 2012 {[roesner2012_detecting]} and 90% in 2019 {[solomos2019_clash,sanchezrola2019can]} of websites use cookies for user tracking, often without users' knowledge. While other stateless tracking technologies, such as [[Privacy:Fingerprinting]] or [[Privacy:Requests#classification_of_link_decorators|Link decorators]], exist, cookies remain the primary choice, with stateless methods typically used in combination with cookies. This trend persists despite the discontinuation of third-party cookies. |
| |
Below, we discuss two main classification methods: using datasets of labeled cookies or machine learning (ML) to classify cookies based on their context and request URL. But first some preliminaries. | Below, we discuss two main classification methods: using datasets of labeled cookies or machine learning (ML) to classify cookies based on their context and request URL. But first some preliminaries. |
===== First- and Third-Party Cookies ===== | ===== First- and Third-Party Cookies ===== |
| |
First-party cookies are set by the domain the user is directly visiting, while all other cookies are from third parties. A common misconception is that first-party cookies are always benign and third-party cookies are always intrusive. However, first-party cookies can also track users or even be set by third parties using CNAME cloaking. First-party cookies are restricted to the website's context, while third-party cookies can track users across multiple websites. | <WRAP important>A common misconception in research is that first-party cookies are always benign and third-party cookies are always intrusive!</WRAP> |
| |
| First-party cookies are set by the domain the user is directly visiting, while all other cookies are considered third-party cookies. Although third-party cookies are significantly more used for tracking than first-party cookies, it is wrong to claim that first-party cookies are always benign and third-party cookies are always intrusive. First-party cookies can also track users or even be set by third parties using [[https://arxiv.org/abs/2102.09301|CNAME cloaking]] and there are many third-party cookies serving necessary functionality such as SSO. |
| |
| The only difference is from the browser perspective. First-party cookies are accessible only from the first-party website's context, while third-party cookies are accessible across multiple websites that embed the same third party. But this implementation depends on the browser, with [[https://webkit.org/blog/8943/privacy-preserving-ad-click-attribution-for-the-web/|Safari]] and [[https://blog.mozilla.org/en/products/firefox/firefox-rolls-out-total-cookie-protection-by-default-to-all-users-worldwide/|Firefox]] setting the storage for third parties for every website separately. |
| |
Munir et al. {[shaoor2023cookiegraph]} observed that 89.86% of the top-million websites use first-party tracking cookies. Of these, 96.61% are ghostwritten by third-party scripts embedded in the first-party context, and some are set by fingerprinting scripts. | Munir et al. {[shaoor2023cookiegraph]} observed that 89.86% of the top-million websites use first-party tracking cookies. Of these, 96.61% are ghostwritten by third-party scripts embedded in the first-party context, and some are set by fingerprinting scripts. |
Law((ePrivacy Directive, Article 5.3)) recognizes only two categories of cookies: those strictly necessary for the service and others. The industry, however, has developed various categorization schemes: | Law((ePrivacy Directive, Article 5.3)) recognizes only two categories of cookies: those strictly necessary for the service and others. The industry, however, has developed various categorization schemes: |
| |
* [[https://web.archive.org/web/20240619102040/https://www.cookielaw.org/wp-content/uploads/2019/12/icc_uk_cookiesguide_revnov.pdf|Cookie guide by UK’s International Chamber of Commerce (***ICC UK***) from 2012]] is the most widely adopted scheme, with the following four categories:((Categories description is adapted from {[bollinger2022automating]}. Refer to the guide for the original descriptions, which were found to be unclear to users {[lin2024it,jiwani2024crumbling]}.)) | * [[https://web.archive.org/web/20240619102040/https://www.cookielaw.org/wp-content/uploads/2019/12/icc_uk_cookiesguide_revnov.pdf|Cookie guide by UK’s International Chamber of Commerce (ICC UK) from 2012]] is the most widely adopted scheme, with the following four categories:((Categories description is adapted from {[bollinger2022automating]}. Refer to the guide for the original descriptions, which were found to be unclear to users {[lin2024it,jiwani2024crumbling]}.)) |
* Strictly-necessary cookies: | * Strictly-necessary cookies: |
* Required to enable essential functions of the website, such as registration or shopping carts. They are always enabled to allow for a smooth and problem-free browsing experience. | * Required to enable essential functions of the website, such as registration or shopping carts. They are always enabled to allow for a smooth and problem-free browsing experience. |
Using datasets of cookies or online classification services has significant disadvantages: they cannot classify unseen data or assign one cookie multiple classes based on dynamic content. However, they offer advantages over ML methods, such as post-crawl classification of detected cookies. | Using datasets of cookies or online classification services has significant disadvantages: they cannot classify unseen data or assign one cookie multiple classes based on dynamic content. However, they offer advantages over ML methods, such as post-crawl classification of detected cookies. |
| |
We discuss issues with dynamic cookie names, publicly released datasets, and two main online classification services: Cookiepedia and Cookiedatabase. | <WRAP info> |
| |
==== Dynamic Cookie Names ==== | |
Some websites deviate from the typical key-value (cookie name and cookie value) scheme by storing data directly in the cookie name. There are several cases, explained by following examples: | Some websites deviate from the typical key-value (cookie name and cookie value) scheme by storing data directly in the cookie name. There are several cases, explained by following examples: |
| |
* ''_gat_UA-<ID>'' and ''_ga_<ID>'' (Google Analytics cookies), where the ID is unique to the Google Analytics configuration but not dynamic per user. | * ''_gat_UA-<ID>'' and ''_ga_<ID>'' (Google Analytics cookies), where the ID is unique to the Google Analytics configuration but not dynamic per user. |
* ''AMCV_<ID>@<host>'' (Adobe Experience Cloud Identity Service cookie), where the ID is unique per user. Such cookie names cannot be found in databases due to their dynamic nature. | * ''AMCV_<ID>@<host>'' (Adobe Experience Cloud Identity Service cookie), where the ID is unique per user. Such cookie names cannot be found in databases due to their dynamic nature,, except for cases when the database stores patterns. |
| </WRAP> |
| |
| We discuss publicly released datasets and two main online classification services: Cookiepedia and Cookiedatabase. |
| |
==== OneTrust and CookieBot Dataset ==== | ==== OneTrust and CookieBot Dataset ==== |
- Preferences Cookies (in some jurisdictions known as functionality) | - Preferences Cookies (in some jurisdictions known as functionality) |
| |
| ==== Cookiesearch ==== |
| |
| https://cookiesearch.org/ (no experience) |
===== Machine-Learning Classification ===== | ===== Machine-Learning Classification ===== |
| |
Using machine learning to classify cookies, rather than relying on static datasets, addresses the limitation of classifying unseen data. Research indicates that ML methods may even outperform human classification. However, practical deployment of ML-based approaches faces challenges similar to those in ML-based advertising blocking: they are prone to adversarial attacks, may disrupt website functionality, and can potentially be used for fingerprinting. These limitations however does not hinder application of ML-based detection in research. | Using machine learning to classify cookies, rather than relying on static datasets, addresses the limitation of classifying unseen data. Research indicates that ML methods may even outperform human classification. However, practical deployment of ML-based approaches faces challenges similar to those in ML-based advertising blocking: they are prone to adversarial attacks, may disrupt website functionality, and can potentially be used for fingerprinting. These limitations however does not hinder application of ML-based detection in research. |
| |
==== CookieBlock ==== | ==== CookieBlock Model ==== |
| |
In **Automating Cookie Consent and GDPR Violation Detection** {[bollinger2022automating]}, researchers developed an ML model to classify cookies according to the four ICC UK purposes. They scraped data from 30k websites using CMPs like OneTrust and CookieBot, collecting over 2 million cookies labeled by website operators. | In **Automating Cookie Consent and GDPR Violation Detection** {[bollinger2022automating]}, researchers developed an ML model to classify cookies according to the four ICC UK purposes. They scraped data from 30k websites using CMPs like OneTrust and CookieBot, collecting over 2 million cookies labeled by website operators. |
- Install the CookieBlock extension in the same browser used by your crawler ([[https://chrome.google.com/webstore/detail/cookieblock/fbhiolckidkciamgcobkokpelckgnnol|Chrome]], [[https://addons.mozilla.org/en-US/firefox/addon/cookieblock/|Firefox]], or other browsers listed on the [[https://karelkubicek.github.io/post/cookieblock|CookieBlock page]]). | - Install the CookieBlock extension in the same browser used by your crawler ([[https://chrome.google.com/webstore/detail/cookieblock/fbhiolckidkciamgcobkokpelckgnnol|Chrome]], [[https://addons.mozilla.org/en-US/firefox/addon/cookieblock/|Firefox]], or other browsers listed on the [[https://karelkubicek.github.io/post/cookieblock|CookieBlock page]]). |
- During installation, ensure all cookie categories and "Keep Track of Cookie History" are enabled. To be safe, open the cookie popup and click "Pause Cookie Removal." | - During installation, ensure all cookie categories and "Keep Track of Cookie History" are enabled. To be safe, open the cookie popup and click "Pause Cookie Removal." |
- Identify the extension ID by navigating to CookieBlock's settings and checking the URL: %%chrome-extension://<ID>/options/cookieblock_options.html%%. | - Identify the extension ID by navigating to CookieBlock's settings and checking the URL: ''%%chrome-extension://<ID>/options/cookieblock_options.html%%''. |
- Export the profile and load it into your crawler. | - Export the profile and load it into your crawler. |
- After the crawl, collect all data from CookieBlock's database. | - After the crawl, collect all data from CookieBlock's database. |
* ''timestamp'': change timestamp | * ''timestamp'': change timestamp |
| |
==== CookieGraph ==== | ==== CookieGraph Model ==== |
| |
**CookieGraph: Understanding and Detecting First-Party Tracking Cookies** {[shaoor2023cookiegraph]} extends CookieBlock to resist adversarial modifications by avoiding easily mutable features (e.g., name) and leveraging network graph features to capture cookie usage patterns. This approach requires even further instrumentation, available only in their custom crawler. | **CookieGraph: Understanding and Detecting First-Party Tracking Cookies** {[shaoor2023cookiegraph]} extends CookieBlock to resist adversarial modifications by avoiding easily mutable features (e.g., name) and leveraging network graph features to capture cookie usage patterns. This approach requires even further instrumentation, available only in their custom crawler. |