Table of Contents
Classifying Cookies
Browser cookies are still the most commonly used method for tracking the session state of websites and the identity of visitors. According to prior studies, between 80% in 2012 [1Roesner, Franziska; Kohno, Tadayoshi; Wetherall, David (2012): "Detecting and Defending Against Third-Party Tracking on the Web", in: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 155-168. USENIX Association, San Jose, CA. (Link)] and 90% in 2019 [5Solomos, Konstantinos; Ilia, Panagiotis; Ioannidis, Sotiris; Kourtellis, Nicolas (2020): "Clash of the trackers: Measuring the evolution of the online tracking ecosystem"., 6Sanchez-Rola, Iskander; Dell'Amico, Matteo; Kotzias, Platon; Balzarotti, Davide; Bilge, Leyla; Vervier, Pierre-Antoine; Santos, Igor (2019): "Can I Opt Out Yet? GDPR and the Global Illusion of Cookie Control", pp. 340–351. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] of websites use cookies for user tracking, often without users' knowledge. While other stateless tracking technologies, such as Fingerprinting or Link decorators, exist, cookies remain the primary choice, with stateless methods typically used in combination with cookies. This trend persists despite the discontinuation of third-party cookies.
Below, we discuss two main classification methods: using datasets of labeled cookies or machine learning (ML) to classify cookies based on their context and request URL. But first some preliminaries.
First- and Third-Party Cookies
First-party cookies are set by the domain the user is directly visiting, while all other cookies are from third parties. A common misconception is that first-party cookies are always benign and third-party cookies are always intrusive. However, first-party cookies can also track users or even be set by third parties using CNAME cloaking. First-party cookies are restricted to the website's context, while third-party cookies can track users across multiple websites.
Munir et al. [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] observed that 89.86% of the top-million websites use first-party tracking cookies. Of these, 96.61% are ghostwritten by third-party scripts embedded in the first-party context, and some are set by fingerprinting scripts.
Categories
Law1) recognizes only two categories of cookies: those strictly necessary for the service and others. The industry, however, has developed various categorization schemes:
- Cookie guide by UK’s International Chamber of Commerce (ICC UK) from 2012 is the most widely adopted scheme, with the following four categories:2)
- Strictly-necessary cookies:
- Required to enable essential functions of the website, such as registration or shopping carts. They are always enabled to allow for a smooth and problem-free browsing experience.
- Functionality cookies:
- Enable non-essential convenience or usability features of a website, such as saving style preferences between visits. However, this may also include a minor number of tracking cookies. We recommend enabling these cookies to improve the user experience.
- Analytics cookies:
- Used to collect statistical information about how visitors use a website. This information is often used to measure the performance and guide the development of a website, but can include sensitive data and allow user tracking. Disable these if you are concerned over potential trackers from analytics services.
- Advertising/tracking cookies:
- Used to tailor advertisements to the viewer, help track the user and collect sensitive data of the user's browsing behavior, usually across multiple websites. Said data is then also often sold to third parties. We recommend rejecting these types of cookies to protect your privacy.
- 12 purposes by IAB, commonly used in CMPs implementing the TCF.
Datasets and Classification Services
Using datasets of cookies or online classification services has significant disadvantages: they cannot classify unseen data or assign one cookie multiple classes based on dynamic content. However, they offer advantages over ML methods, such as post-crawl classification of detected cookies.
We discuss issues with dynamic cookie names, publicly released datasets, and two main online classification services: Cookiepedia and Cookiedatabase.
Dynamic Cookie Names
Some websites deviate from the typical key-value (cookie name and cookie value) scheme by storing data directly in the cookie name. There are several cases, explained by following examples:
_gat_UA-<ID>
and_ga_<ID>
(Google Analytics cookies), where the ID is unique to the Google Analytics configuration but not dynamic per user.AMCV_<ID>@<host>
(Adobe Experience Cloud Identity Service cookie), where the ID is unique per user. Such cookie names cannot be found in databases due to their dynamic nature.
OneTrust and CookieBot Dataset
Many websites label cookies in their consent notices or privacy policies. Some CMPs unify the display of cookie lists, which enables large-scale scraping of these labels. Bollinger and Kubicek et al. [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)] crawled 30k websites using CMPs, collecting a dataset of 2.2 million declared cookies, with over 80% being third-party cookies.
The dataset is available here as part of this publication artifact. Scraped cookie categories are in /04_Cookie_Databases/tranco_05May_20210510_201615.sqlite
, in the table consent_data
, with the following columns:
name
anddomain
: Identify cookies by name and domain.cat_name
: Category defined by website operators via the CMP, often aligning with the 12 IAB TCF purposes.cat_id
: Numeric representation of ICC UK categories: 0 = Strictly-necessary, 1 = Functionality, 2 = Analytics, 3 = Advertising/tracking.
Note that 80% of cookies are third-party cookies, the majority of these involve multiple entries for a given name and domain. However, these entries might assign contradicting labels, in fact, 7.2% of cookies have labels that do not match the majority label. You should therefore aggregate labels for given cookie and domain pair, and then pick the most popular category.
There will be soon a new release based on December 2024 crawl, reach Karel Kubicek if you are reading this text and wanting the data.
Open Cookie Database
The Open Cookie Database is a crowdsourced effort to describe and categorize major cookies. As of January 2025, the published CSV file contains 2203 cookies classified into categories similar to the ICC UK guide:
- Functional (also known as technical, essential or strictly necessary)
- Personalization (also known as preferences)
- Analytics (also known as performance or statistics)
- Marketing (also known as tracking or social media)
- Security (custom category, used only by a few tens of cookies, which are mostly strictly necessary according to ICC UK's guide)
Cookiepedia
Cookiepedia is a commercial website by OneTrust, containing about 42M cookies. Categories align with ICC UK's four categories, Strictly Necessary, Functionality, Performance=Analytics, and Targeting/Advertising, read more about the labeling process. OneTrust uses Cookiepedia to simplify cookie category assignments for their users (website operators), which however incentivizes labeling cookies as strictly necessary to avoid website breakage. This was observed by Bollinger and Kubicek et al. [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)] in Section 4.5.
To you Cookiepedia directly on the website, you can either select to search by website or cookie name. In the first case, Cookiepedia overviews of all the first- and third-party cookies, and in the latter, it shows the aggregated purpose across all websites with this cookie. This has limitations for first-party cookies, as different websites may use cookies with the same name for different purposes. For instance user_id cookies is classified as Strictly Necessary despite that many websites use it to track users.
You can scrape Cookiepedia or download dataset of almost 1M cookies collected by [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] as a CSV here.
Cookiedatabase
Cookiedatabase is an open alternative to Cookiepedia with 15.5k cookies operated by Complianz.io CMP. They use the following categories:
- Statistics-Anonymous
- Statistics/analytics (also known as performance)
- Marketing/Tracking (also known as Ad-storage, or social media)
- Functional (also known as technical, essential or strictly necessary)
- Preferences Cookies (in some jurisdictions known as functionality)
Machine-Learning Classification
Using machine learning to classify cookies, rather than relying on static datasets, addresses the limitation of classifying unseen data. Research indicates that ML methods may even outperform human classification. However, practical deployment of ML-based approaches faces challenges similar to those in ML-based advertising blocking: they are prone to adversarial attacks, may disrupt website functionality, and can potentially be used for fingerprinting. These limitations however does not hinder application of ML-based detection in research.
CookieBlock Model
In Automating Cookie Consent and GDPR Violation Detection [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)], researchers developed an ML model to classify cookies according to the four ICC UK purposes. They scraped data from 30k websites using CMPs like OneTrust and CookieBot, collecting over 2 million cookies labeled by website operators.
The researchers extracted statistically rich, domain-specific features from the collected cookies. These features are derived from multiple attributes, including the cookie name, domain, value, expiry, and flags such as “HttpOnly.” For example, the entropy of a cookie's value is a key feature for identifying unique identifiers used for tracking, while the presence of a language locale often signifies cookies necessary for language settings.
The authors trained an XGBoost model, achieving an 87.2% accuracy (84.4% balanced accuracy). This performance is competitive with the Cookiepedia dataset when considering website operators' labels as the ground truth. The confusion matrices below provide a detailed comparison.
Performance comparison of Cookiepedia to the automated XGBoost model, showing that CookieBlock model is competitive with human expertise according to [3Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)].
Since the CookieBlock model requires instrumentation of cookie events, such as cookie updates, it must be applied during the crawling process. You can use OpenWPM's cookie instrumentation or install the CookieBlock browser extension to classify cookies. Follow these steps to set up CookieBlock in your crawler:
- Install the CookieBlock extension in the same browser used by your crawler (Chrome, Firefox, or other browsers listed on the CookieBlock page).
- During installation, ensure all cookie categories and “Keep Track of Cookie History” are enabled. To be safe, open the cookie popup and click “Pause Cookie Removal.”
- Identify the extension ID by navigating to CookieBlock's settings and checking the URL:
chrome-extension://<ID>/options/cookieblock_options.html
. - Export the profile and load it into your crawler.
- After the crawl, collect all data from CookieBlock's database.
The following Python code illustrates these steps for a Selenium-operated Chrome browser. Note that while this step is manual, it is necessary to perform it only once.
- create_profile.py
from selenium import webdriver # requires the Selenium package # Initialize the ChromeDriver with options to set the profile path options = webdriver.ChromeOptions() options.add_argument("--user-data-dir=./profile_dir") driver = webdriver.Chrome(options=options) # Install the extension in the browser, retrieve its ID, and then close the browser. # Backup ./profile_dir directory.
Edit your crawler to load the profile before starting the crawl and to extract the cookie category data afterward:
- create_profile.py
import json from selenium import webdriver import pandas as pd # Before crawl: Initialize the ChromeDriver with the profile options = webdriver.ChromeOptions() options.add_argument("--user-data-dir=./profile_dir") driver = webdriver.Chrome(options=options) # Here is your crawler's logic driver.get('https://google.com') # example # After crawl: # Extract the crawled data EXTENSION_ID = 'fbhiolckidkciamgcobkokpelckgnnol' # replace with your ID identified in step 3. # Load extension settings driver.get(f'chrome-extension://{EXTENSION_ID}/options/cookieblock_options.html') # Extraction script indexeddb_script = """ function getCookieBlockHistory() { return new Promise((resolve, reject) => { var request = window.indexedDB.open("CookieBlockHistory", 1); request.onerror = function(event) { reject("Error opening IndexedDB: " + event.target.errorCode); }; request.onsuccess = function(event) { var db = event.target.result; var transaction = db.transaction(["cookies"], "readonly"); var objectStore = transaction.objectStore("cookies"); var data = []; objectStore.openCursor().onsuccess = function(event) { var cursor = event.target.result; if (cursor) { data.push(cursor.value); cursor.continue(); } }; transaction.oncomplete = function() { resolve(JSON.stringify(data)); }; transaction.onerror = function(event) { reject("Transaction error: " + event.target.errorCode); }; }; }); } return getCookieBlockHistory().then(data => { return data; }).catch(error => { return error; }); """ indexeddb_data = driver.execute_script(indexeddb_script) try: cookies = json.loads(indexeddb_data) except TypeError as e: print("Error:", e) cookies = [] df = pd.DataFrame(cookies) df.to_csv('./cookies.csv', index=False)
Example output:
name domain path current_label label_ts storeId variable_data 0 AEC .google.com / 0 1736265280414 0 [{'host_only': False, 'http_only': True, 'secu... 1 EUULE www.google.com / 0 1736265428337 0 [{'host_only': True, 'http_only': False, 'secu... 2 NID .google.com / 0 1736264250782 0 [{'host_only': False, 'http_only': True, 'secu... 3 OTZ chromewebstore.google.com / 0 1736264250788 0 [{'host_only': True, 'http_only': False, 'secu... 4 TESTCOOKIESENABLED www.google.com / 0 1736265280594 0 [{'host_only': True, 'http_only': False, 'secu... 5 __Secure-ENID .google.com / 0 1736265280415 0 [{'host_only': False, 'http_only': True, 'secu... 6 _ga .chromewebstore.google.com / 2 1736264250780 0 [{'host_only': False, 'http_only': False, 'sec... 7 _ga_KHZNC1Q6K0 .chromewebstore.google.com / 2 1736264250792 0 [{'host_only': False, 'http_only': False, 'sec...
Fields:
name
,domain
, andpath
: identify the cookiecurrent_label
: ICC UK's four categories (0 = Strictly necessary, 1 = Functionality, 2 = Analytics, 3 = Advertising/tracking cookies)label_ts
: timestamp of cookie creationvariable_data
: list of cookie changes, each containing:http_only
,secure
,session
,same_site
: binary flagsexpirationDate
: timestampvalue
: value of the cookie set in this changetimestamp
: change timestamp
CookieGraph Model
CookieGraph: Understanding and Detecting First-Party Tracking Cookies [2Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] extends CookieBlock to resist adversarial modifications by avoiding easily mutable features (e.g., name) and leveraging network graph features to capture cookie usage patterns. This approach requires even further instrumentation, available only in their custom crawler.
Find the CookieGraph artifact at https://github.com/cookiegraph/CookieGraph/.
Authentication Cookies
A Supervised Learning Approach to Protect Client Authentication on the Web [4Calzavara, Stefano; Tolomei, Gabriele; Casini, Andrea; Bugliesi, Michele; Orlando, Salvatore (2015): "A Supervised Learning Approach to Protect Client Authentication on the Web", ACM Trans. Web 9(3). (DOI) (Link)] investigates classifying cookies used for authentication. However, the code and model have not been published.
References
- [1]
- Roesner, Franziska; Kohno, Tadayoshi; Wetherall, David (2012): "Detecting and Defending Against Third-Party Tracking on the Web", in: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 155-168. USENIX Association, San Jose, CA. (Link)
- [2]
- Munir, Shaoor; Siby, Sandra; Iqbal, Umar; Englehardt, Steven; Shafiq, Zubair; Troncoso, Carmela (2023): "CookieGraph: Understanding and Detecting First-Party Tracking Cookies", pp. 3490–3504. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
- [3]
- Bollinger, Dino; Kubicek, Karel; Cotrini, Carlos; Basin, David (2022): "Automating Cookie Consent and GDPR Violation Detection", in: 31st USENIX Security Symposium (USENIX Security 22), pp. 2893-2910. USENIX Association, Boston, MA. (Link)
- [4]
- Calzavara, Stefano; Tolomei, Gabriele; Casini, Andrea; Bugliesi, Michele; Orlando, Salvatore (2015): "A Supervised Learning Approach to Protect Client Authentication on the Web", ACM Trans. Web 9(3). (DOI) (Link)
- [5]
- Solomos, Konstantinos; Ilia, Panagiotis; Ioannidis, Sotiris; Kourtellis, Nicolas (2020): "Clash of the trackers: Measuring the evolution of the online tracking ecosystem".
- [6]
- Sanchez-Rola, Iskander; Dell'Amico, Matteo; Kotzias, Platon; Balzarotti, Davide; Bilge, Leyla; Vervier, Pierre-Antoine; Santos, Igor (2019): "Can I Opt Out Yet? GDPR and the Global Illusion of Cookie Control", pp. 340–351. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
- [7]
- Kyi, Lin; Mhaidli, Abraham; Santos, Cristiana Teixeira; Roesner, Franziska; Biega, Asia J. (2024): "“It doesn’t tell me anything about how my data is used”: User Perceptions of Data Collection Purposes", in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
- [8]
- Jiwani, Soha; Sasheendran, Rachna; Abhyankar, Adhishree; Bouma-Sims, Elijah; Cranor, Lorrie (2024): "Crumbling Cookie Categories: Deconstructing Common Cookie Categories to Create Categories that People Understand", Proceedings on Privacy Enhancing Technologies . (DOI)