Classifying Cookies

Browser cookies are still the most commonly used method for tracking the session state of websites and the identity of visitors. According to prior studies, between 80% in 2012 [1Roesner, Franziska; Kohno, Tadayoshi; Wetherall, David (2012): "Detecting and Defending Against Third-Party Tracking on the Web", in: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 155-168. USENIX Association, San Jose, CA. (Link)] and 90% in 2019 [5Solomos, Konstantinos; Ilia, Panagiotis; Ioannidis, Sotiris; Kourtellis, Nicolas (2020): "Clash of the trackers: Measuring the evolution of the online tracking ecosystem"., 6Sanchez-Rola, Iskander; Dell'Amico, Matteo; Kotzias, Platon; Balzarotti, Davide; Bilge, Leyla; Vervier, Pierre-Antoine; Santos, Igor (2019): "Can I Opt Out Yet? GDPR and the Global Illusion of Cookie Control", pp. 340–351. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] of websites use cookies for user tracking, often without users' knowledge. While other stateless tracking technologies, such as Fingerprinting or Link decorators, exist, cookies remain the primary choice, with stateless methods typically used in combination with cookies. This trend persists despite the discontinuation of third-party cookies.

Below, we discuss two main classification methods: using datasets of labeled cookies or machine learning (ML) to classify cookies based on their context and request URL. But first some preliminaries.

First- and Third-Party Cookies

First-party cookies are set by the domain the user is directly visiting, while all other cookies are from third parties. A common misconception is that first-party cookies are always benign and third-party cookies are always intrusive. However, first-party cookies can also track users or even be set by third parties using CNAME cloaking. First-party cookies are restricted to the website's context, while third-party cookies can track users across multiple websites.

Munir et al. [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] observed that 89.86% of the top-million websites use first-party tracking cookies. Of these, 96.61% are ghostwritten by third-party scripts embedded in the first-party context, and some are set by fingerprinting scripts.

Datasets and Classification Services

Using datasets of cookies or online classification services has significant disadvantages: they cannot classify unseen data or assign one cookie multiple classes based on dynamic content. However, they offer advantages over ML methods, such as post-crawl classification of detected cookies.

We discuss issues with dynamic cookie names, publicly released datasets, and two main online classification services: Cookiepedia and Cookiedatabase.

Dynamic Cookie Names

Some websites deviate from the typical key-value (cookie name and cookie value) scheme by storing data directly in the cookie name. There are several cases, explained by following examples:

_gat_UA-<ID> and _ga_<ID> (Google Analytics cookies), where the ID is unique to the Google Analytics configuration but not dynamic per user.
AMCV_<ID>@<host> (Adobe Experience Cloud Identity Service cookie), where the ID is unique per user. Such cookie names cannot be found in databases due to their dynamic nature.

OneTrust and CookieBot Dataset

Many websites label cookies in their consent notices or privacy policies. Some CMPs unify the display of cookie lists, which enables large-scale scraping of these labels. Bollinger and Kubicek et al. [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)] crawled 30k websites using CMPs, collecting a dataset of 2.2 million declared cookies, with over 80% being third-party cookies.

The dataset is available here as part of this publication artifact. Scraped cookie categories are in /04_Cookie_Databases/tranco_05May_20210510_201615.sqlite, in the table consent_data, with the following columns:

name and domain: Identify cookies by name and domain.
cat_name: Category defined by website operators via the CMP, often aligning with the 12 IAB TCF purposes.
cat_id: Numeric representation of ICC UK categories: 0 = Strictly-necessary, 1 = Functionality, 2 = Analytics, 3 = Advertising/tracking.

Note that 80% of cookies are third-party cookies, the majority of these involve multiple entries for a given name and domain. However, these entries might assign contradicting labels, in fact, 7.2% of cookies have labels that do not match the majority label. You should therefore aggregate labels for given cookie and domain pair, and then pick the most popular category.

There will be soon a new release based on December 2024 crawl, reach Karel Kubicek if you are reading this text and wanting the data.

Open Cookie Database

The Open Cookie Database is a crowdsourced effort to describe and categorize major cookies. As of January 2025, the published CSV file contains 2203 cookies classified into categories similar to the ICC UK guide:

Functional (also known as technical, essential or strictly necessary)
Personalization (also known as preferences)
Analytics (also known as performance or statistics)
Marketing (also known as tracking or social media)
Security (custom category, used only by a few tens of cookies, which are mostly strictly necessary according to ICC UK's guide)

Cookiepedia

Cookiepedia is a commercial website by OneTrust, containing about 42M cookies. Categories align with ICC UK's four categories, Strictly Necessary, Functionality, Performance=Analytics, and Targeting/Advertising, read more about the labeling process. OneTrust uses Cookiepedia to simplify cookie category assignments for their users (website operators), which however incentivizes labeling cookies as strictly necessary to avoid website breakage. This was observed by Bollinger and Kubicek et al. [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)] in Section 4.5.

To you Cookiepedia directly on the website, you can either select to search by website or cookie name. In the first case, Cookiepedia overviews of all the first- and third-party cookies, and in the latter, it shows the aggregated purpose across all websites with this cookie. This has limitations for first-party cookies, as different websites may use cookies with the same name for different purposes. For instance user_id cookies is classified as Strictly Necessary despite that many websites use it to track users.

You can scrape Cookiepedia or download dataset of almost 1M cookies collected by [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] as a CSV here.

Cookiedatabase

Cookiedatabase is an open alternative to Cookiepedia with 15.5k cookies operated by Complianz.io CMP. They use the following categories:

Statistics-Anonymous
Statistics/analytics (also known as performance)
Marketing/Tracking (also known as Ad-storage, or social media)
Functional (also known as technical, essential or strictly necessary)
Preferences Cookies (in some jurisdictions known as functionality)

Machine-Learning Classification

Using machine learning to classify cookies, rather than relying on static datasets, addresses the limitation of classifying unseen data. Research indicates that ML methods may even outperform human classification. However, practical deployment of ML-based approaches faces challenges similar to those in ML-based advertising blocking: they are prone to adversarial attacks, may disrupt website functionality, and can potentially be used for fingerprinting. These limitations however does not hinder application of ML-based detection in research.

CookieBlock Model

In Automating Cookie Consent and GDPR Violation Detection [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)], researchers developed an ML model to classify cookies according to the four ICC UK purposes. They scraped data from 30k websites using CMPs like OneTrust and CookieBot, collecting over 2 million cookies labeled by website operators.

The researchers extracted statistically rich, domain-specific features from the collected cookies. These features are derived from multiple attributes, including the cookie name, domain, value, expiry, and flags such as “HttpOnly.” For example, the entropy of a cookie's value is a key feature for identifying unique identifiers used for tracking, while the presence of a language locale often signifies cookies necessary for language settings.

The authors trained an XGBoost model, achieving an 87.2% accuracy (84.4% balanced accuracy). This performance is competitive with the Cookiepedia dataset when considering website operators' labels as the ground truth. The confusion matrices below provide a detailed comparison.

Performance comparison of Cookiepedia to the automated XGBoost model, showing that CookieBlock model is competitive with human expertise according to [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)].

Since the CookieBlock model requires instrumentation of cookie events, such as cookie updates, it must be applied during the crawling process. You can use OpenWPM's cookie instrumentation or install the CookieBlock browser extension to classify cookies. Follow these steps to set up CookieBlock in your crawler:

Install the CookieBlock extension in the same browser used by your crawler (Chrome, Firefox, or other browsers listed on the CookieBlock page).
During installation, ensure all cookie categories and “Keep Track of Cookie History” are enabled. To be safe, open the cookie popup and click “Pause Cookie Removal.”
Identify the extension ID by navigating to CookieBlock's settings and checking the URL: chrome-extension://<ID>/options/cookieblock_options.html.
Export the profile and load it into your crawler.
After the crawl, collect all data from CookieBlock's database.

The following Python code illustrates these steps for a Selenium-operated Chrome browser. Note that while this step is manual, it is necessary to perform it only once.

create_profile.py

from selenium import webdriver  # requires the Selenium package
 
# Initialize the ChromeDriver with options to set the profile path
options = webdriver.ChromeOptions()
options.add_argument("--user-data-dir=./profile_dir")
driver = webdriver.Chrome(options=options)
 
# Install the extension in the browser, retrieve its ID, and then close the browser.
# Backup ./profile_dir directory.

Edit your crawler to load the profile before starting the crawl and to extract the cookie category data afterward:

create_profile.py

import json
from selenium import webdriver
import pandas as pd
 
# Before crawl: Initialize the ChromeDriver with the profile
options = webdriver.ChromeOptions()
options.add_argument("--user-data-dir=./profile_dir")
driver = webdriver.Chrome(options=options)
 
# Here is your crawler's logic
driver.get('https://google.com')  # example
 
# After crawl: 
# Extract the crawled data
EXTENSION_ID = 'fbhiolckidkciamgcobkokpelckgnnol'  # replace with your ID identified in step 3.
 
# Load extension settings
driver.get(f'chrome-extension://{EXTENSION_ID}/options/cookieblock_options.html')
 
# Extraction script
indexeddb_script = """
function getCookieBlockHistory() {
    return new Promise((resolve, reject) => {
        var request = window.indexedDB.open("CookieBlockHistory", 1);
 
        request.onerror = function(event) {
            reject("Error opening IndexedDB: " + event.target.errorCode);
        };
 
        request.onsuccess = function(event) {
            var db = event.target.result;
            var transaction = db.transaction(["cookies"], "readonly");
            var objectStore = transaction.objectStore("cookies");
            var data = [];
            objectStore.openCursor().onsuccess = function(event) {
                var cursor = event.target.result;
                if (cursor) {
                    data.push(cursor.value);
                    cursor.continue();
                }
            };
 
            transaction.oncomplete = function() {
                resolve(JSON.stringify(data));
            };
 
            transaction.onerror = function(event) {
                reject("Transaction error: " + event.target.errorCode);
            };
        };
    });
}
 
return getCookieBlockHistory().then(data => {
    return data;
}).catch(error => {
    return error;
});
"""
indexeddb_data = driver.execute_script(indexeddb_script)
 
try:
    cookies = json.loads(indexeddb_data)
except TypeError as e:
    print("Error:", e)
    cookies = []
 
df = pd.DataFrame(cookies)
df.to_csv('./cookies.csv', index=False)

Example output:

                 name                      domain path  current_label       label_ts storeId                                      variable_data
0                 AEC                 .google.com    /              0  1736265280414       0  [{'host_only': False, 'http_only': True, 'secu...
1               EUULE              www.google.com    /              0  1736265428337       0  [{'host_only': True, 'http_only': False, 'secu...
2                 NID                 .google.com    /              0  1736264250782       0  [{'host_only': False, 'http_only': True, 'secu...
3                 OTZ   chromewebstore.google.com    /              0  1736264250788       0  [{'host_only': True, 'http_only': False, 'secu...
4  TESTCOOKIESENABLED              www.google.com    /              0  1736265280594       0  [{'host_only': True, 'http_only': False, 'secu...
5       __Secure-ENID                 .google.com    /              0  1736265280415       0  [{'host_only': False, 'http_only': True, 'secu...
6                 _ga  .chromewebstore.google.com    /              2  1736264250780       0  [{'host_only': False, 'http_only': False, 'sec...
7      _ga_KHZNC1Q6K0  .chromewebstore.google.com    /              2  1736264250792       0  [{'host_only': False, 'http_only': False, 'sec...

Fields:

name, domain, and path: identify the cookie
current_label: ICC UK's four categories (0 = Strictly necessary, 1 = Functionality, 2 = Analytics, 3 = Advertising/tracking cookies)
label_ts: timestamp of cookie creation
variable_data: list of cookie changes, each containing:
- http_only, secure, session, same_site: binary flags
- expirationDate: timestamp
- value: value of the cookie set in this change
- timestamp: change timestamp

CookieGraph Model

CookieGraph: Understanding and Detecting First-Party Tracking Cookies [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] extends CookieBlock to resist adversarial modifications by avoiding easily mutable features (e.g., name) and leveraging network graph features to capture cookie usage patterns. This approach requires even further instrumentation, available only in their custom crawler.

Find the CookieGraph artifact at https://github.com/cookiegraph/CookieGraph/.

Authentication Cookies

A Supervised Learning Approach to Protect Client Authentication on the Web [4Calzavara, Stefano; Tolomei, Gabriele; Casini, Andrea; Bugliesi, Michele; Orlando, Salvatore (2015): "A Supervised Learning Approach to Protect Client Authentication on the Web", ACM Trans. Web 9(3). (DOI) (Link)] investigates classifying cookies used for authentication. However, the code and model have not been published.

References

[1]: Roesner, Franziska; Kohno, Tadayoshi; Wetherall, David (2012): "Detecting and Defending Against Third-Party Tracking on the Web", in: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 155-168. USENIX Association, San Jose, CA. (Link)
[2]: Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[3]: Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)
[4]: Calzavara, Stefano; Tolomei, Gabriele; Casini, Andrea; Bugliesi, Michele; Orlando, Salvatore (2015): "A Supervised Learning Approach to Protect Client Authentication on the Web", ACM Trans. Web 9(3). (DOI) (Link)
[5]: Solomos, Konstantinos; Ilia, Panagiotis; Ioannidis, Sotiris; Kourtellis, Nicolas (2020): "Clash of the trackers: Measuring the evolution of the online tracking ecosystem".
[6]: Sanchez-Rola, Iskander; Dell'Amico, Matteo; Kotzias, Platon; Balzarotti, Davide; Bilge, Leyla; Vervier, Pierre-Antoine; Santos, Igor (2019): "Can I Opt Out Yet? GDPR and the Global Illusion of Cookie Control", pp. 340–351. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[7]: Kyi, Lin; Mhaidli, Abraham; Santos, Cristiana Teixeira; Roesner, Franziska; Biega, Asia J. (2024): "“It doesn’t tell me anything about how my data is used”: User Perceptions of Data Collection Purposes", in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[8]: Jiwani, Soha; Sasheendran, Rachna; Abhyankar, Adhishree; Bouma-Sims, Elijah; Cranor, Lorrie (2024): "Crumbling Cookie Categories: Deconstructing Common Cookie Categories to Create Categories that People Understand", Proceedings on Privacy Enhancing Technologies . (DOI)

¹⁾

ePrivacy Directive, Article 5.3

²⁾

Categories description is adapted from [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)]. Refer to the guide for the original descriptions, which were found to be unclear to users [7Kyi, Lin; Mhaidli, Abraham; Santos, Cristiana Teixeira; Roesner, Franziska; Biega, Asia J. (2024): "“It doesn’t tell me anything about how my data is used”: User Perceptions of Data Collection Purposes", in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA. (DOI) (Link), 8Jiwani, Soha; Sasheendran, Rachna; Abhyankar, Adhishree; Bouma-Sims, Elijah; Cranor, Lorrie (2024): "Crumbling Cookie Categories: Deconstructing Common Cookie Categories to Create Categories that People Understand", Proceedings on Privacy Enhancing Technologies . (DOI)].

Table of Contents