User Tools

Site Tools


design:archives

Use of web archives for research

Various researchers have utilized web archives for longitudinal analyses of privacy [1Lerner, Ada; Simpson, Anna Kornfeld; Kohno, Tadayoshi; Roesner, Franziska (2016): "Internet Jones and the Raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016", in: 25th USENIX Security Symposium (USENIX Security 16). (Link), 2Jha, Nikhil; Trevisan, Martino; Mellia, Marco; Fernandez, Daniel; Irarrazaval, Rodrigo (2024): "Privacy Policies and Consent Management Platforms: Growth and Users' Interactions over Time", arXiv preprint arXiv:2402.18321 ., 3Amos, Ryan; Acar, Gunes; Lucherini, Eli; Kshirsagar, Mihir; Narayanan, Arvind; Mayer, Jonathan (2021): "Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset", in: Proceedings of the Web Conference 2021, pp. 2165–2176. Association for Computing Machinery, New York, NY, USA. (DOI) (Link), 4Dimova, Yana; Acar, Gunes; Olejnik, Lukasz; Joosen, Wouter; Van Goethem, Tom (2021): "The CNAME of the game: Large-scale analysis of DNS-based tracking evasion", Proceedings on Privacy Enhancing Technologies 2021:394–412. (DOI) (Link)] and security features [5Pletinckx, Stijn; Borgolte, Kevin; Fiebig, Tobias (2021): "Out of sight, out of mind: Detecting orphaned web pages at internet-scale", in: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 21–35. Association for Computing Machinery, New York, NY, USA. (DOI) (Link), 6Roth, Sebastian; Barron, Timothy; Calzavara, Stefano; Nikiforakis, Nick; Stock, Ben (2020): "Complex security policy? A longitudinal analysis of deployed content security policies", in: Proceedings of the 27th Network and Distributed System Security Symposium (NDSS)., 7Stock, Ben; Johns, Martin; Steffens, Marius; Backes, Michael (2017): "How the Web Tangled Itself: Uncovering the History of Client-Side Web (In) Security", in: 26th USENIX Security Symposium (USENIX Security 17), pp. 971-987. USENIX Association, Vancouver, BC. (Link)]. Using archives for research have several advantages. Their use helps towards solving the issue of reproducibility of research, given that a static snapshot is used for each page visit. It also eliminates the need to write project-specific crawlers. However, there are a number of pitfalls associated with the use of archives. Additionally, there are numerous archives out there, and there are a number of things to consider before selecting one.

Limitations

There are some general limitations related to the use of web archives, studied by prior work [8Hantke, Florian; Calzavara, Stefano; Wilhelm, Moritz; Rabitti, Alvise; Stock, Ben (2023): "You Call This Archaeology? Evaluating Web Archives for Reproducible Web Security Measurements", in: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 3168–3182. Association for Computing Machinery, New York, NY, USA. (DOI) (Link), 9Hantke, Florian; Snyder, Peter; Haddadi, Hamed; Stock, Ben (2025): "Web Execution Bundles: Reproducible, Accurate, and Archivable Web Measurements", arXiv preprint arXiv:2501.15911 ., 1Lerner, Ada; Simpson, Anna Kornfeld; Kohno, Tadayoshi; Roesner, Franziska (2016): "Internet Jones and the Raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016", in: 25th USENIX Security Symposium (USENIX Security 16). (Link)]. For instance, Hantke et al. [8Hantke, Florian; Calzavara, Stefano; Wilhelm, Moritz; Rabitti, Alvise; Stock, Ben (2023): "You Call This Archaeology? Evaluating Web Archives for Reproducible Web Security Measurements", in: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 3168–3182. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] find that replayed web pages can introduce errors because of missing HTTP headers. The reason for this type of inconsistencies is that the behavior of the web page at the time of archiving might differ from the behavior of the same page at replaying time.

A number of things can go wrong during the archiving process because of the dynamic nature of the web. For example, URLs might not be properly archived, which leads to missing resources when replaying the web page [1Lerner, Ada; Simpson, Anna Kornfeld; Kohno, Tadayoshi; Roesner, Franziska (2016): "Internet Jones and the Raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016", in: 25th USENIX Security Symposium (USENIX Security 16). (Link)]. This problem then cascades to additional resources which are dependent on the missing resource. Typically, a modern browser is used to replay an archived web page. However, this decision will have an influence on the results, as the behavior of the web page will differ when using an older version of the browser.

Additionally, at the time of archiving, crawlers typically read and store web resources but do not execute them. This means that any dynamic resources might not be archived, making it difficult to interact with archived web pages (e.g., to fill in a form, interact with a cookie banner, etc.).

All of these limitations impact our results and should be kept in mind when opting for using archives instead of crawling live websites. In the future, the community might evolve towards better standards for archiving websites which can mitigate at least some of the mentioned limitations [9Hantke, Florian; Snyder, Peter; Haddadi, Hamed; Stock, Ben (2025): "Web Execution Bundles: Reproducible, Accurate, and Archivable Web Measurements", arXiv preprint arXiv:2501.15911 .].

What to consider when selecting an archive

  • Crawl location: Both the Wayback Machine and HTTPArchive run their crawls with an IP address from the US. This has an impact on results if you want to examine properties that are related to a certain region or area (e.g., legislation in the EU).
  • The completeness of the data: Some archives crawl popular websites regularly and are more suited for large-scale measurements, such as the Wayback Machine [8Hantke, Florian; Calzavara, Stefano; Wilhelm, Moritz; Rabitti, Alvise; Stock, Ben (2023): "You Call This Archaeology? Evaluating Web Archives for Reproducible Web Security Measurements", in: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 3168–3182. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].
  • Availability: Depending on which websites you want to visit and how granular/coarse you need your analysis to be, you might need to check whether any snapshots are available around a specific date or time period.

So which archive should I choose?

The most widely known and used archives for research are The Internet Archive's Wayback Machine and HTTP Archive. Hantke et. al. [8Hantke, Florian; Calzavara, Stefano; Wilhelm, Moritz; Rabitti, Alvise; Stock, Ben (2023): "You Call This Archaeology? Evaluating Web Archives for Reproducible Web Security Measurements", in: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 3168–3182. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] performed a comparative study on 7 archives and their suitability for web security measurement studies. One of their conclusions is that the Wayback Machine is the best option in terms of completeness and freshness of the data regarding popular websites. There are other options for smaller archives hat might be less complete. For instance, Portugal's public library service (Arquivo.pt) includes archives for a number of websites in the EU. Also consider combining multiple archives for better completeness and/or availability.

HTTPArchive

HTTPArchive is an open-source project that keeps track of how the web is built. The dataset includes monthly crawls for millions of websites for both mobile and web (Chrome) clients. The crawls are based on the Chrome User Experience Report (CrUX) since the 1st of July 2018. Older data is also available starting from 2010. However, given that they did not crawl the full CrUX back then, significantly less websites were crawled between 2010 and July 2018, which makes it hard to compare across different data points before July 2018. For example, in February 2025, almost 13M desktop websites are included in the crawl. An overview of the sample size per date is available here https://httparchive.org/reports/state-of-the-web#numUrls.

WebPageTest is used to crawl the webpages, which records all requests and responses. In the end, HAR files are saved for each website visit. These HAR files are then stored in BigQuery tables.

In order to access the data:

  1. Go to https://developer.chrome.com/docs/crux and log in with Google account
  2. Click “Select a project” and then “New Project”
  3. Give your project a name and click the “Create” button
  4. Navigate to the BigQuery console https://console.cloud.google.com/bigquery
  5. In order to add the HTTP Archive tables to your project, click on the “+ Add” button on top of the Explorer sidebar and choose the Star a project by name option from the side menu.
  6. Type in “httparchive” and click “STAR”.
  7. You can now browse the dataset, it is publicly accessible

Full tutorial https://har.fyi/guides/getting-started/.

The 'crawls' table contains all of the HTTP requests and responses.

It is subdivided in two main tables:

  • pages: contains information about the crawled webpages
  • requests: contains headers and payload for requests and responses

The tables are partitioned by date. As an example, the following query will return all requests on desktop pages for February 2025:

SELECT
  *
FROM
  `httparchive.crawl.requests`
WHERE
  DATE = '2025-02-01'
  AND client = 'desktop'

This query costs 200TB in credits. It is therefore likely that the free credits will not be sufficient when you want to query the actual data. There is also a dataset that can be used for exploratory purposes: 'sample_data', or consider using LIMIT keyword.

One advantage to using HTTP Archive is that they crawl the root page of each website as well as a number of subpages. Subpages can be distinguished in the dataset with the 'is_root_page' attribute. For example, we can rewrite the query above to only give us the landing pages of each website:

SELECT
  *
FROM
  `httparchive.crawl.requests`
WHERE
  DATE = '2025-02-01'
  AND client = 'desktop'
  AND is_root_page

Another advantage is that CrUX includes ranking of website popularity per buckets (e.g., top 1000, top 10000) which is available in HTTPArchive.

For instance, let's take the 1000 most popular websites of February 2025 with the following SQL query:

SELECT
  *
FROM
  `httparchive.crawl.pages`
WHERE
  DATE = '2025-02-01'
  AND client = 'desktop'
  AND is_root_page
  AND rank = 1000

If you only need a specific subset of websites e.g. you want to a longitudinal analysis on the same set of websites, than another option is to save the HAR files and them locally. There is no cost associated with downloading files from Google storage. You need to use the WebpageTest id (wptid from the pages table) in order to search for specific URLs.

Wayback Machine

The Internet Archive's Wayback Machine is also an open source project that aims to build a digital library. The archive includes 800+ archived web pages in WARC format. Anyone can store a snapshot of their webpage on the Wayback Machine. Along with that, the organization also crawls the web sporadically. Unlike HTTP Archive, they don't schedule monthly crawls, it is highly dependent on the website whether it has been captured at a specific time. Therefore, if you are doing research on less popular websites, you cannot be certain about their availability in the archive. You can verify if a snapshot is available for a web page or a resource with the Wayback Availablity API.

Another limitation is that Wayback Machine applies rate limiting to their API to no more than 15 requests per minute. For this reason, consider reducing the traffic, for example by excluding image- and font-type resources.

On the other hand, it is more straightforward to work with than HTTP Archive since no Google account, SQL querying etc. is needed. Additionally, it is possible to request a snapshot for a specific resource, e.g., for a fingprint.js script. However, depending on the resource, there might be less available snapshots for the resource than the whole webpage.

Other archives

Wikipedia lists archiving projects, which might be worth considering for localized studies. Many of them are compatible with Wayback Machine crawling methodology.

Adding more information to alternative archives would be helpful.

References

[1]
Lerner, Ada; Simpson, Anna Kornfeld; Kohno, Tadayoshi; Roesner, Franziska (2016): "Internet Jones and the Raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016", in: 25th USENIX Security Symposium (USENIX Security 16). (Link)
[2]
Jha, Nikhil; Trevisan, Martino; Mellia, Marco; Fernandez, Daniel; Irarrazaval, Rodrigo (2024): "Privacy Policies and Consent Management Platforms: Growth and Users' Interactions over Time", arXiv preprint arXiv:2402.18321 .
[3]
Amos, Ryan; Acar, Gunes; Lucherini, Eli; Kshirsagar, Mihir; Narayanan, Arvind; Mayer, Jonathan (2021): "Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset", in: Proceedings of the Web Conference 2021, pp. 2165–2176. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[4]
Dimova, Yana; Acar, Gunes; Olejnik, Lukasz; Joosen, Wouter; Van Goethem, Tom (2021): "The CNAME of the game: Large-scale analysis of DNS-based tracking evasion", Proceedings on Privacy Enhancing Technologies 2021:394–412. (DOI) (Link)
[5]
Pletinckx, Stijn; Borgolte, Kevin; Fiebig, Tobias (2021): "Out of sight, out of mind: Detecting orphaned web pages at internet-scale", in: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 21–35. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[6]
Roth, Sebastian; Barron, Timothy; Calzavara, Stefano; Nikiforakis, Nick; Stock, Ben (2020): "Complex security policy? A longitudinal analysis of deployed content security policies", in: Proceedings of the 27th Network and Distributed System Security Symposium (NDSS).
[7]
Stock, Ben; Johns, Martin; Steffens, Marius; Backes, Michael (2017): "How the Web Tangled Itself: Uncovering the History of Client-Side Web (In) Security", in: 26th USENIX Security Symposium (USENIX Security 17), pp. 971-987. USENIX Association, Vancouver, BC. (Link)
[8]
Hantke, Florian; Calzavara, Stefano; Wilhelm, Moritz; Rabitti, Alvise; Stock, Ben (2023): "You Call This Archaeology? Evaluating Web Archives for Reproducible Web Security Measurements", in: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 3168–3182. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)
[9]
Hantke, Florian; Snyder, Peter; Haddadi, Hamed; Stock, Ben (2025): "Web Execution Bundles: Reproducible, Accurate, and Archivable Web Measurements", arXiv preprint arXiv:2501.15911 .
You could leave a comment if you were logged in.
design/archives.txt · Last modified: 2025/03/18 14:49 by karelkubicek