Browser cookies are still the most commonly used method for tracking the session state of websites and the identity of visitors. According to prior studies, between 80% in 2012 [1Roesner, Franziska; Kohno, Tadayoshi; Wetherall, David (2012): "Detecting and Defending Against Third-Party Tracking on the Web", in: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 155-168. USENIX Association, San Jose, CA. (Link)] and 90% in 2019 [5Solomos, Konstantinos; Ilia, Panagiotis; Ioannidis, Sotiris; Kourtellis, Nicolas (2020): "Clash of the trackers: Measuring the evolution of the online tracking ecosystem"., 6Sanchez-Rola, Iskander; Dell'Amico, Matteo; Kotzias, Platon; Balzarotti, Davide; Bilge, Leyla; Vervier, Pierre-Antoine; Santos, Igor (2019): "Can I Opt Out Yet? GDPR and the Global Illusion of Cookie Control", pp. 340–351. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] of websites use cookies for user tracking, often without users' knowledge. While other stateless tracking technologies, such as Fingerprinting or Link decorators, exist, cookies remain the primary choice, with stateless methods typically used in combination with cookies. This trend persists despite the discontinuation of third-party cookies.
Below, we discuss two main classification methods: using datasets of labeled cookies or machine learning (ML) to classify cookies based on their context and request URL. But first some preliminaries.
First-party cookies are set by the domain the user is directly visiting, while all other cookies are from third parties. A common misconception is that first-party cookies are always benign and third-party cookies are always intrusive. However, first-party cookies can also track users or even be set by third parties using CNAME cloaking. First-party cookies are restricted to the website's context, while third-party cookies can track users across multiple websites.
Munir et al. [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] observed that 89.86% of the top-million websites use first-party tracking cookies. Of these, 96.61% are ghostwritten by third-party scripts embedded in the first-party context, and some are set by fingerprinting scripts.
Law1) recognizes only two categories of cookies: those strictly necessary for the service and others. The industry, however, has developed various categorization schemes:
Using datasets of cookies or online classification services has significant disadvantages: they cannot classify unseen data or assign one cookie multiple classes based on dynamic content. However, they offer advantages over ML methods, such as post-crawl classification of detected cookies.
We discuss issues with dynamic cookie names, publicly released datasets, and two main online classification services: Cookiepedia and Cookiedatabase.
Some websites deviate from the typical key-value (cookie name and cookie value) scheme by storing data directly in the cookie name. There are several cases, explained by following examples:
_gat_UA-<ID>
and _ga_<ID>
(Google Analytics cookies), where the ID is unique to the Google Analytics configuration but not dynamic per user.AMCV_<ID>@<host>
(Adobe Experience Cloud Identity Service cookie), where the ID is unique per user. Such cookie names cannot be found in databases due to their dynamic nature.Many websites label cookies in their consent notices or privacy policies. Some CMPs unify the display of cookie lists, which enables large-scale scraping of these labels. Bollinger and Kubicek et al. [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)] crawled 30k websites using CMPs, collecting a dataset of 2.2 million declared cookies, with over 80% being third-party cookies.
The dataset is available here as part of this publication artifact. Scraped cookie categories are in /04_Cookie_Databases/tranco_05May_20210510_201615.sqlite
, in the table consent_data
, with the following columns:
name
and domain
: Identify cookies by name and domain.cat_name
: Category defined by website operators via the CMP, often aligning with the 12 IAB TCF purposes.cat_id
: Numeric representation of ICC UK categories: 0 = Strictly-necessary, 1 = Functionality, 2 = Analytics, 3 = Advertising/tracking.Note that 80% of cookies are third-party cookies, the majority of these involve multiple entries for a given name and domain. However, these entries might assign contradicting labels, in fact, 7.2% of cookies have labels that do not match the majority label. You should therefore aggregate labels for given cookie and domain pair, and then pick the most popular category.
There will be soon a new release based on December 2024 crawl, reach Karel Kubicek if you are reading this text and wanting the data.
The Open Cookie Database is a crowdsourced effort to describe and categorize major cookies. As of January 2025, the published CSV file contains 2203 cookies classified into categories similar to the ICC UK guide:
Cookiepedia is a commercial website by OneTrust, containing about 42M cookies. Categories align with ICC UK's four categories, Strictly Necessary, Functionality, Performance=Analytics, and Targeting/Advertising, read more about the labeling process. OneTrust uses Cookiepedia to simplify cookie category assignments for their users (website operators), which however incentivizes labeling cookies as strictly necessary to avoid website breakage. This was observed by Bollinger and Kubicek et al. [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)] in Section 4.5.
To you Cookiepedia directly on the website, you can either select to search by website or cookie name. In the first case, Cookiepedia overviews of all the first- and third-party cookies, and in the latter, it shows the aggregated purpose across all websites with this cookie. This has limitations for first-party cookies, as different websites may use cookies with the same name for different purposes. For instance user_id cookies is classified as Strictly Necessary despite that many websites use it to track users.
You can scrape Cookiepedia or download dataset of almost 1M cookies collected by [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] as a CSV here.
Cookiedatabase is an open alternative to Cookiepedia with 15.5k cookies operated by Complianz.io CMP. They use the following categories:
Using machine learning to classify cookies, rather than relying on static datasets, addresses the limitation of classifying unseen data. Research indicates that ML methods may even outperform human classification. However, practical deployment of ML-based approaches faces challenges similar to those in ML-based advertising blocking: they are prone to adversarial attacks, may disrupt website functionality, and can potentially be used for fingerprinting. These limitations however does not hinder application of ML-based detection in research.
In Automating Cookie Consent and GDPR Violation Detection [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)], researchers developed an ML model to classify cookies according to the four ICC UK purposes. They scraped data from 30k websites using CMPs like OneTrust and CookieBot, collecting over 2 million cookies labeled by website operators.
The researchers extracted statistically rich, domain-specific features from the collected cookies. These features are derived from multiple attributes, including the cookie name, domain, value, expiry, and flags such as “HttpOnly.” For example, the entropy of a cookie's value is a key feature for identifying unique identifiers used for tracking, while the presence of a language locale often signifies cookies necessary for language settings.
The authors trained an XGBoost model, achieving an 87.2% accuracy (84.4% balanced accuracy). This performance is competitive with the Cookiepedia dataset when considering website operators' labels as the ground truth. The confusion matrices below provide a detailed comparison.
Performance comparison of Cookiepedia to the automated XGBoost model, showing that CookieBlock model is competitive with human expertise according to [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)].
Since the CookieBlock model requires instrumentation of cookie events, such as cookie updates, it must be applied during the crawling process. You can use OpenWPM's cookie instrumentation or install the CookieBlock browser extension to classify cookies. Follow these steps to set up CookieBlock in your crawler:
chrome-extension://<ID>/options/cookieblock_options.html
.The following Python code illustrates these steps for a Selenium-operated Chrome browser. Note that while this step is manual, it is necessary to perform it only once.
from selenium import webdriver # requires the Selenium package # Initialize the ChromeDriver with options to set the profile path options = webdriver.ChromeOptions() options.add_argument("--user-data-dir=./profile_dir") driver = webdriver.Chrome(options=options) # Install the extension in the browser, retrieve its ID, and then close the browser. # Backup ./profile_dir directory.
Edit your crawler to load the profile before starting the crawl and to extract the cookie category data afterward:
import json from selenium import webdriver import pandas as pd # Before crawl: Initialize the ChromeDriver with the profile options = webdriver.ChromeOptions() options.add_argument("--user-data-dir=./profile_dir") driver = webdriver.Chrome(options=options) # Here is your crawler's logic driver.get('https://google.com') # example # After crawl: # Extract the crawled data EXTENSION_ID = 'fbhiolckidkciamgcobkokpelckgnnol' # replace with your ID identified in step 3. # Load extension settings driver.get(f'chrome-extension://{EXTENSION_ID}/options/cookieblock_options.html') # Extraction script indexeddb_script = """ function getCookieBlockHistory() { return new Promise((resolve, reject) => { var request = window.indexedDB.open("CookieBlockHistory", 1); request.onerror = function(event) { reject("Error opening IndexedDB: " + event.target.errorCode); }; request.onsuccess = function(event) { var db = event.target.result; var transaction = db.transaction(["cookies"], "readonly"); var objectStore = transaction.objectStore("cookies"); var data = []; objectStore.openCursor().onsuccess = function(event) { var cursor = event.target.result; if (cursor) { data.push(cursor.value); cursor.continue(); } }; transaction.oncomplete = function() { resolve(JSON.stringify(data)); }; transaction.onerror = function(event) { reject("Transaction error: " + event.target.errorCode); }; }; }); } return getCookieBlockHistory().then(data => { return data; }).catch(error => { return error; }); """ indexeddb_data = driver.execute_script(indexeddb_script) try: cookies = json.loads(indexeddb_data) except TypeError as e: print("Error:", e) cookies = [] df = pd.DataFrame(cookies) df.to_csv('./cookies.csv', index=False)
Example output:
name domain path current_label label_ts storeId variable_data 0 AEC .google.com / 0 1736265280414 0 [{'host_only': False, 'http_only': True, 'secu... 1 EUULE www.google.com / 0 1736265428337 0 [{'host_only': True, 'http_only': False, 'secu... 2 NID .google.com / 0 1736264250782 0 [{'host_only': False, 'http_only': True, 'secu... 3 OTZ chromewebstore.google.com / 0 1736264250788 0 [{'host_only': True, 'http_only': False, 'secu... 4 TESTCOOKIESENABLED www.google.com / 0 1736265280594 0 [{'host_only': True, 'http_only': False, 'secu... 5 __Secure-ENID .google.com / 0 1736265280415 0 [{'host_only': False, 'http_only': True, 'secu... 6 _ga .chromewebstore.google.com / 2 1736264250780 0 [{'host_only': False, 'http_only': False, 'sec... 7 _ga_KHZNC1Q6K0 .chromewebstore.google.com / 2 1736264250792 0 [{'host_only': False, 'http_only': False, 'sec...
Fields:
name
, domain
, and path
: identify the cookiecurrent_label
: ICC UK's four categories (0 = Strictly necessary, 1 = Functionality, 2 = Analytics, 3 = Advertising/tracking cookies)label_ts
: timestamp of cookie creationvariable_data
: list of cookie changes, each containing:http_only
, secure
, session
, same_site
: binary flagsexpirationDate
: timestampvalue
: value of the cookie set in this changetimestamp
: change timestampCookieGraph: Understanding and Detecting First-Party Tracking Cookies [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] extends CookieBlock to resist adversarial modifications by avoiding easily mutable features (e.g., name) and leveraging network graph features to capture cookie usage patterns. This approach requires even further instrumentation, available only in their custom crawler.
Find the CookieGraph artifact at https://github.com/cookiegraph/CookieGraph/.
A Supervised Learning Approach to Protect Client Authentication on the Web [4Calzavara, Stefano; Tolomei, Gabriele; Casini, Andrea; Bugliesi, Michele; Orlando, Salvatore (2015): "A Supervised Learning Approach to Protect Client Authentication on the Web", ACM Trans. Web 9(3). (DOI) (Link)] investigates classifying cookies used for authentication. However, the code and model have not been published.