User Tools

Site Tools


design:archives

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
design:archives [2025/03/18 14:30] – formatting karelkubicekdesign:archives [2025/03/18 14:49] (current) – polishing karelkubicek
Line 13: Line 13:
 Typically, a modern browser is used to replay an archived web page. However, this decision will have an influence on the results, as the behavior of the web page will differ when using an older version of the browser. Typically, a modern browser is used to replay an archived web page. However, this decision will have an influence on the results, as the behavior of the web page will differ when using an older version of the browser.
  
-Additionally, at the time of archiving, crawlers typically read and store web resources but do not execute them. This means that any dynamic resources might not be archived, making it difficult to interact with archived web pages (e.g., to fill in a form, interact with a cookie banner...).+Additionally, at the time of archiving, crawlers typically read and store web resources but do not execute them. This means that any dynamic resources might not be archived, making it difficult to interact with archived web pages (e.g., to fill in a form, interact with a cookie banner, etc.).
  
 All of these limitations impact our results and should be kept in mind when opting for using archives instead of crawling live websites. In the future, the community might evolve towards better standards for archiving websites which can mitigate at least some of the mentioned limitations {[hantke2025web]}.  All of these limitations impact our results and should be kept in mind when opting for using archives instead of crawling live websites. In the future, the community might evolve towards better standards for archiving websites which can mitigate at least some of the mentioned limitations {[hantke2025web]}. 
Line 20: Line 20:
 ===== What to consider when selecting an archive ===== ===== What to consider when selecting an archive =====
  
-  * Crawl location: Both the Wayback Machine and HTTPArchive run their crawls with an IP address from the US. This has an impact on results if you want to examine properties that are related to a certain region or area (e.g. legislation in the EU).  +  * Crawl location: Both the Wayback Machine and HTTPArchive run their crawls with an IP address from the US. This has an impact on results if you want to examine properties that are related to a certain region or area (e.g.legislation in the EU).  
-  * The completeness of the data: Some archives crawl popular websites regularly and are more suited for large-scale measurements, such as the Wayback machine {[hantke2023you]}. +  * The completeness of the data: Some archives crawl popular websites regularly and are more suited for large-scale measurements, such as the Wayback Machine {[hantke2023you]}. 
   * Availability: Depending on which websites you want to visit and how granular/coarse you need your analysis to be, you might need to check whether any snapshots are available around a specific date or time period.    * Availability: Depending on which websites you want to visit and how granular/coarse you need your analysis to be, you might need to check whether any snapshots are available around a specific date or time period. 
  
Line 28: Line 28:
 ===== So which archive should I choose? ===== ===== So which archive should I choose? =====
  
-The most widely known and used archives for research are The Web Archive's Wayback Machine and HTTP Archive. Hantke et. al.{[hantke2023you]} performed a comparative study on 7 archives and their suitability for web security measurement studies. One of their conclusions is that the Wayback machine is the  best option in terms of completeness and freshness of the data regarding popular websites. There are other options for smaller archives hat might be less complete. For instance, Portugal's public library service [[https://arquivo.pt/|(Arquivo.pt)]] includes archives for a number of websites in the EU. Also consider combining multiple archives for better completeness and/or availability.+The most widely known and used archives for research are The Internet Archive's Wayback Machine and HTTP Archive. Hantke et. al. {[hantke2023you]} performed a comparative study on 7 archives and their suitability for web security measurement studies. One of their conclusions is that the Wayback Machine is the  best option in terms of completeness and freshness of the data regarding popular websites. There are other options for smaller archives hat might be less complete. For instance, Portugal's public library service [[https://arquivo.pt/|(Arquivo.pt)]] includes archives for a number of websites in the EU. Also consider combining multiple archives for better completeness and/or availability.
  
  
 ==== HTTPArchive ==== ==== HTTPArchive ====
  
-[[https://httparchive.org/|HTTPArchive]] is an open-source project that keeps track of how the web is built. The dataset includes monthly crawls for millions of websites for both mobile and web (Chrome) clients. The crawls are based on the [[https://developer.chrome.com/docs/crux|Chrome User Experience Report (CrUX)]] since the 1st of July 2018. +[[https://httparchive.org/|HTTPArchive]] is an open-source project that keeps track of how the web is built. The dataset includes monthly crawls for millions of websites for both mobile and web (Chrome) clients. The crawls are based on the [[programming:crux|Chrome User Experience Report (CrUX)]] since the 1st of July 2018. 
-Older data is also available starting from 2010. However, given that they did not crawl the full Crux back then, significantly less websites were crawled between 2010 and July 2018, which makes it hard to compare across different data points before July 2018. +Older data is also available starting from 2010. However, given that they did not crawl the full CrUX back then, significantly less websites were crawled between 2010 and July 2018, which makes it hard to compare across different data points before July 2018. 
 For example, in February 2025, almost 13M desktop websites are included in the crawl. An overview of the sample size per date is available here https://httparchive.org/reports/state-of-the-web#numUrls. \\ For example, in February 2025, almost 13M desktop websites are included in the crawl. An overview of the sample size per date is available here https://httparchive.org/reports/state-of-the-web#numUrls. \\
  
Line 51: Line 51:
  
  
-The 'crawls' table contains all of the HTTP requests and responses. \\+The 'crawls' table contains all of the HTTP requests and responses. 
 It is subdivided in two main tables:  It is subdivided in two main tables: 
- * pages contains information about the crawled webpages +  ''pages'': contains information about the crawled webpages 
- * requests contains headers and payload for requests and responses+  ''requests'': contains headers and payload for requests and responses 
 The tables are partitioned by date. As an example, the following query will return all requests on desktop pages for February 2025: The tables are partitioned by date. As an example, the following query will return all requests on desktop pages for February 2025:
  
Line 67: Line 69:
 </code> </code>
  
-This query costs 200TB in credits. It is therefore likely that the free credits will not be sufficient when you want to query the actual data.  +<WRAP important> 
-There is also a dataset that can be used for exploratory purposes: 'sample_data'\\+This query costs 200TB in credits. It is therefore likely that the free credits will not be sufficient when you want to query the actual data. There is also a dataset that can be used for exploratory purposes: ''%%'sample_data'%%'', or consider using ''LIMIT'' keyword. 
 +</WRAP>
  
-One advantage to using HTTP Archive is that they crawl the root page of each website as well as a number of subpages. Subpages can be distinguished in the dataset with the 'is_root_page' attribute. For example, we can rewrite the query above to only give us the landing pages of each website: +One advantage to using HTTP Archive is that they crawl the root page of each website as well as a number of subpages. Subpages can be distinguished in the dataset with the ''%%'is_root_page'%%'' attribute. For example, we can rewrite the query above to only give us the landing pages of each website: 
  
 <code sql> <code sql>
Line 83: Line 86:
 </code> </code>
  
-Another advantage is that CRUX includes ranking of website popularity per buckets (e.g. top 1000, top 10000...) which is available in HTTPArchive. +Another advantage is that CrUX includes ranking of website popularity per buckets (e.g.top 1000, top 10000) which is available in HTTPArchive.  
 For instance, let's take the 1000 most popular websites of February 2025 with the following SQL query: For instance, let's take the 1000 most popular websites of February 2025 with the following SQL query:
  
Line 98: Line 102:
 </code> </code>
  
-If you only need a specific subset of websites e.g. you want to a longitudinal analysis on the same set of websites, than another option is to [[https://discuss.httparchive.org/t/how-to-download-the-http-archive-data/679|save the HAR files and them locally]]. There is no cost associated with downloading files from Google storage. You need to use the WebpageTest id (wptid from the pages table) in order to search for specific urls.  +If you only need a specific subset of websites e.g. you want to a longitudinal analysis on the same set of websites, than another option is to [[https://discuss.httparchive.org/t/how-to-download-the-http-archive-data/679|save the HAR files and them locally]]. There is no cost associated with downloading files from Google storage. You need to use the WebpageTest id (''wptid'' from the ''pages'' table) in order to search for specific URLs.  
  
  
 ==== Wayback Machine ==== ==== Wayback Machine ====
-The Internet Archive's Wayback machine is also an open source project that aims to build a digital library. The archive includes 800+ archived web pages in WARC format. Anyone can store a snapshot of their webpage on the Wayback Machine. Along with that, the organization also crawls the web sporadically.+The Internet Archive's Wayback Machine is also an open source project that aims to build a digital library. The archive includes 800+ archived web pages in WARC format. Anyone can store a snapshot of their webpage on the Wayback Machine. Along with that, the organization also crawls the web sporadically.
 Unlike HTTP Archive, they don't schedule monthly crawls, it is highly dependent on the website whether it has been captured at a specific time. Therefore, if you are doing research on less popular websites, you cannot be certain about their availability in the archive. You can verify if a snapshot is available for a web page or a resource with the [[https://archive.org/help/wayback_api.php|Wayback Availablity API]]. Unlike HTTP Archive, they don't schedule monthly crawls, it is highly dependent on the website whether it has been captured at a specific time. Therefore, if you are doing research on less popular websites, you cannot be certain about their availability in the archive. You can verify if a snapshot is available for a web page or a resource with the [[https://archive.org/help/wayback_api.php|Wayback Availablity API]].
  
 Another limitation is that Wayback Machine applies [[https://archive.org/details/toomanyrequests_20191110|rate limiting]] to their API to no more than 15 requests per minute. For this reason, consider reducing the traffic, for example by excluding image- and font-type resources.  Another limitation is that Wayback Machine applies [[https://archive.org/details/toomanyrequests_20191110|rate limiting]] to their API to no more than 15 requests per minute. For this reason, consider reducing the traffic, for example by excluding image- and font-type resources. 
  
-On the other hand, it is more straightforward to work with than HTTP Archive since no Google account, SQL querying etc. is needed. Additionally, it is possible to request a snapshot for a specific resource e.g. for a fingprint.js script. However, depending on the resource, there might be less available snapshots for the resource than the whole webpage.  +On the other hand, it is more straightforward to work with than HTTP Archive since no Google account, SQL querying etc. is needed. Additionally, it is possible to request a snapshot for a specific resourcee.g.for a fingprint.js script. However, depending on the resource, there might be less available snapshots for the resource than the whole webpage.   
 + 
 +==== Other archives ==== 
 + 
 +[[https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives|Wikipedia]] lists archiving projects, which might be worth considering for localized studies. Many of them are compatible with Wayback Machine crawling methodology. 
 + 
 +<wrap todo>Adding more information to alternative archives would be helpful.</wrap>
  
 ====== References ====== ====== References ======
design/archives.txt · Last modified: 2025/03/18 14:49 by karelkubicek