Differences

This shows you the differences between two versions of the page.

--- design:archives [2025/02/20 08:13] – Reference to other papers karelkubicek
+++ design:archives [2025/03/18 14:49] (current) – polishing karelkubicek
@@ Line 2: / Line 2: @@
-<wrap todo>This page contains only notes</wrap>
+Various researchers have utilized web archives for longitudinal analyses of privacy {[lerner2016internet,jha2024privacy,amos2021privacy,dimova2021cname]} and security features {[pletinckx2021out,roth2020complex,stock2017web]}.
+Using archives for research have several advantages. Their use helps towards solving the issue of reproducibility of research, given that a static snapshot is used for each page visit. It also eliminates the need to write project-specific crawlers. However, there are a number of pitfalls associated with the use of archives. Additionally, there are numerous archives out there, and there are a number of things to consider before selecting one.
-  * https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
-  * [[https://publications.cispa.de/articles/conference_contribution/You_Call_This_Archaeology_Evaluating_Web_Archives_for_Reproducible_Web_Security_Measurements/24614574?file=43248819|You Call This Archaeology? Evaluating Web Archives for Reproducible Web Security Measurements]] - evaluation of archives
+===== Limitations =====
-Papers using archives
+There are some general limitations related to the use of web archives, studied by prior work {[hantke2023you,hantke2025web,lerner2016internet]}. For instance, Hantke et al. {[hantke2023you]} find that replayed web pages can introduce errors because of missing HTTP headers. The reason for this type of inconsistencies is that the behavior of the web page at the time of archiving might differ from the behavior of the same page at replaying time.
-  * Internet Archive: https://privacypolicies.cs.princeton.edu/
+A number of things can go wrong during the archiving process because of the dynamic nature of the web. For example, URLs might not be properly archived, which leads to missing resources when replaying the web page {[lerner2016internet]}. This problem then cascades to additional resources which are dependent on the missing resource.
-  * HTTPArchive: https://arxiv.org/pdf/2402.18321
+Typically, a modern browser is used to replay an archived web page. However, this decision will have an influence on the results, as the behavior of the web page will differ when using an older version of the browser.
+Additionally, at the time of archiving, crawlers typically read and store web resources but do not execute them. This means that any dynamic resources might not be archived, making it difficult to interact with archived web pages (e.g., to fill in a form, interact with a cookie banner, etc.).
+All of these limitations impact our results and should be kept in mind when opting for using archives instead of crawling live websites. In the future, the community might evolve towards better standards for archiving websites which can mitigate at least some of the mentioned limitations {[hantke2025web]}.
-/*
-This is a comment not visible on the page. It outlines the syntax (for more, go to https://measuretheweb.org/wiki/syntax), especially that related to bibliography. Remove it once you created the page. If you use any citations (documented at bottom), keep the References section.
-===== Header Level 2 =====
+===== What to consider when selecting an archive =====
-==== Header Level 3 ====
-=== Links ===
+  * Crawl location: Both the Wayback Machine and HTTPArchive run their crawls with an IP address from the US. This has an impact on results if you want to examine properties that are related to a certain region or area (e.g., legislation in the EU).
+  * The completeness of the data: Some archives crawl popular websites regularly and are more suited for large-scale measurements, such as the Wayback Machine {[hantke2023you]}.
+  * Availability: Depending on which websites you want to visit and how granular/coarse you need your analysis to be, you might need to check whether any snapshots are available around a specific date or time period.
-External links are recognized automatically: www.google.com, but if you want a link text: [[http://www.google.com|This Link points to google]].
-Internal links are created by using square brackets. You can either just give a [[pagename]] or use an additional [[pagename|link text]].
-=== Lists ===
+===== So which archive should I choose? =====
-Lists and their levels are decided by indentation (2 spaces = 1 level)
+The most widely known and used archives for research are The Internet Archive's Wayback Machine and HTTP Archive. Hantke et. al. {[hantke2023you]} performed a comparative study on 7 archives and their suitability for web security measurement studies. One of their conclusions is that the Wayback Machine is the  best option in terms of completeness and freshness of the data regarding popular websites. There are other options for smaller archives hat might be less complete. For instance, Portugal's public library service [[https://arquivo.pt/|(Arquivo.pt)]] includes archives for a number of websites in the EU. Also consider combining multiple archives for better completeness and/or availability.
-  - this is 1. item
-  - 2. item
-	- nested a. item
-  * bullet-point item
-=== Code ===
+==== HTTPArchive ====
-For a short inline monospace, use ''double quote''. For proper code (but in a separate paragraph), use ''<code LANG>'':
+[[https://httparchive.org/|HTTPArchive]] is an open-source project that keeps track of how the web is built. The dataset includes monthly crawls for millions of websites for both mobile and web (Chrome) clients. The crawls are based on the [[programming:crux|Chrome User Experience Report (CrUX)]] since the 1st of July 2018.
+Older data is also available starting from 2010. However, given that they did not crawl the full CrUX back then, significantly less websites were crawled between 2010 and July 2018, which makes it hard to compare across different data points before July 2018.
+For example, in February 2025, almost 13M desktop websites are included in the crawl. An overview of the sample size per date is available here https://httparchive.org/reports/state-of-the-web#numUrls. \\
-<code python>
+WebPageTest is used to crawl the webpages, which records all requests and responses. In the end, HAR files are saved for each website visit. These HAR files are then stored in BigQuery tables.
-string = "World"
-print(f'Hello {string}')
-</code>
-For large code, use ''<file LANG filename>'', it will make code downloadable. For instance:
+In order to access the data:
+  - Go to https://developer.chrome.com/docs/crux and log in with Google account
+  - Click "Select a project" and then "New Project"
+  - Give your project a name and click the "Create" button
+  - Navigate to the BigQuery console https://console.cloud.google.com/bigquery
+  - In order to add the HTTP Archive tables to your project, click on the "+ Add" button on top of the Explorer sidebar and choose the Star a project by name option from the side menu.
+  - Type in "httparchive" and click "STAR".
+  - You can now browse the dataset, it is publicly accessible
-<file php example.php>
+Full tutorial https://har.fyi/guides/getting-started/.
-<?php echo "hello world!"; ?>
-</file>
-=== Figures ===
-To use floats, you have to use the ''<WRAP>'' tag. For instance, the following will create figure on the right side 50% large:
+The 'crawls' table contains all of the HTTP requests and responses.
-<WRAP right 50% box>
-{{PATH_TO_FILE|ALT_TEXT}}
+It is subdivided in two main tables:
-<div>CAPTION</div>
+  * ''pages'': contains information about the crawled webpages
+  * ''requests'': contains headers and payload for requests and responses
+The tables are partitioned by date. As an example, the following query will return all requests on desktop pages for February 2025:
+<code sql>
+SELECT
+  *
+FROM
+  `httparchive.crawl.requests`
+WHERE
+  date = '2025-02-01'
+  AND client = 'desktop'
+</code>
+<WRAP important>
+This query costs 200TB in credits. It is therefore likely that the free credits will not be sufficient when you want to query the actual data. There is also a dataset that can be used for exploratory purposes: ''%%'sample_data'%%'', or consider using ''LIMIT'' keyword.
 </WRAP>
-*/
+One advantage to using HTTP Archive is that they crawl the root page of each website as well as a number of subpages. Subpages can be distinguished in the dataset with the ''%%'is_root_page'%%'' attribute. For example, we can rewrite the query above to only give us the landing pages of each website:
-====== References ======
+<code sql>
+SELECT
+  *
+FROM
+  `httparchive.crawl.requests`
+WHERE
+  date = '2025-02-01'
+  AND client = 'desktop'
+  AND is_root_page
+</code>
-/*
+Another advantage is that CrUX includes ranking of website popularity per buckets (e.g., top 1000, top 10000) which is available in HTTPArchive.
-To insert citations, follow these steps:
-  - Verify the BibTeX entry exists in https://measuretheweb.org/literature/bibliography. If not, add it there.
+For instance, let's take the 1000 most popular websites of February 2025 with the following SQL query:
-  - Use {[CitationKey]} where needed in the text; it will render as a numbered reference.
-  - Keep this section unchanged to display the bibliography.
-If any step fails, a purple warning will appear on the preview page.
+<code sql>
-*/
+SELECT
+  *
+FROM
+  `httparchive.crawl.pages`
+WHERE
+  date = '2025-02-01'
+  AND client = 'desktop'
+  AND is_root_page
+  AND rank = 1000
+</code>
+If you only need a specific subset of websites e.g. you want to a longitudinal analysis on the same set of websites, than another option is to [[https://discuss.httparchive.org/t/how-to-download-the-http-archive-data/679|save the HAR files and them locally]]. There is no cost associated with downloading files from Google storage. You need to use the WebpageTest id (''wptid'' from the ''pages'' table) in order to search for specific URLs.
+==== Wayback Machine ====
+The Internet Archive's Wayback Machine is also an open source project that aims to build a digital library. The archive includes 800+ archived web pages in WARC format. Anyone can store a snapshot of their webpage on the Wayback Machine. Along with that, the organization also crawls the web sporadically.
+Unlike HTTP Archive, they don't schedule monthly crawls, it is highly dependent on the website whether it has been captured at a specific time. Therefore, if you are doing research on less popular websites, you cannot be certain about their availability in the archive. You can verify if a snapshot is available for a web page or a resource with the [[https://archive.org/help/wayback_api.php|Wayback Availablity API]].
+Another limitation is that Wayback Machine applies [[https://archive.org/details/toomanyrequests_20191110|rate limiting]] to their API to no more than 15 requests per minute. For this reason, consider reducing the traffic, for example by excluding image- and font-type resources.
+On the other hand, it is more straightforward to work with than HTTP Archive since no Google account, SQL querying etc. is needed. Additionally, it is possible to request a snapshot for a specific resource, e.g., for a fingprint.js script. However, depending on the resource, there might be less available snapshots for the resource than the whole webpage.
+==== Other archives ====
+[[https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives|Wikipedia]] lists archiving projects, which might be worth considering for localized studies. Many of them are compatible with Wayback Machine crawling methodology.
+<wrap todo>Adding more information to alternative archives would be helpful.</wrap>
+====== References ======
 <bibtex bibliography></bibtex>
 /* This enables discussion under this article. */
 ~~DISCUSSION~~