design:archives
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
design:archives [2025/03/18 14:30] – formatting karelkubicek | design:archives [2025/03/18 14:49] (current) – polishing karelkubicek | ||
---|---|---|---|
Line 13: | Line 13: | ||
Typically, a modern browser is used to replay an archived web page. However, this decision will have an influence on the results, as the behavior of the web page will differ when using an older version of the browser. | Typically, a modern browser is used to replay an archived web page. However, this decision will have an influence on the results, as the behavior of the web page will differ when using an older version of the browser. | ||
- | Additionally, | + | Additionally, |
All of these limitations impact our results and should be kept in mind when opting for using archives instead of crawling live websites. In the future, the community might evolve towards better standards for archiving websites which can mitigate at least some of the mentioned limitations {[hantke2025web]}. | All of these limitations impact our results and should be kept in mind when opting for using archives instead of crawling live websites. In the future, the community might evolve towards better standards for archiving websites which can mitigate at least some of the mentioned limitations {[hantke2025web]}. | ||
Line 20: | Line 20: | ||
===== What to consider when selecting an archive ===== | ===== What to consider when selecting an archive ===== | ||
- | * Crawl location: Both the Wayback Machine and HTTPArchive run their crawls with an IP address from the US. This has an impact on results if you want to examine properties that are related to a certain region or area (e.g. legislation in the EU). | + | * Crawl location: Both the Wayback Machine and HTTPArchive run their crawls with an IP address from the US. This has an impact on results if you want to examine properties that are related to a certain region or area (e.g., legislation in the EU). |
- | * The completeness of the data: Some archives crawl popular websites regularly and are more suited for large-scale measurements, | + | * The completeness of the data: Some archives crawl popular websites regularly and are more suited for large-scale measurements, |
* Availability: | * Availability: | ||
Line 28: | Line 28: | ||
===== So which archive should I choose? ===== | ===== So which archive should I choose? ===== | ||
- | The most widely known and used archives for research are The Web Archive' | + | The most widely known and used archives for research are The Internet |
==== HTTPArchive ==== | ==== HTTPArchive ==== | ||
- | [[https:// | + | [[https:// |
- | Older data is also available starting from 2010. However, given that they did not crawl the full Crux back then, significantly less websites were crawled between 2010 and July 2018, which makes it hard to compare across different data points before July 2018. | + | Older data is also available starting from 2010. However, given that they did not crawl the full CrUX back then, significantly less websites were crawled between 2010 and July 2018, which makes it hard to compare across different data points before July 2018. |
For example, in February 2025, almost 13M desktop websites are included in the crawl. An overview of the sample size per date is available here https:// | For example, in February 2025, almost 13M desktop websites are included in the crawl. An overview of the sample size per date is available here https:// | ||
Line 51: | Line 51: | ||
- | The ' | + | The ' |
It is subdivided in two main tables: | It is subdivided in two main tables: | ||
- | * pages - contains information about the crawled webpages | + | |
- | * requests | + | * '' |
The tables are partitioned by date. As an example, the following query will return all requests on desktop pages for February 2025: | The tables are partitioned by date. As an example, the following query will return all requests on desktop pages for February 2025: | ||
Line 67: | Line 69: | ||
</ | </ | ||
- | This query costs 200TB in credits. It is therefore likely that the free credits will not be sufficient when you want to query the actual data. | + | <WRAP important> |
- | There is also a dataset that can be used for exploratory purposes: ' | + | This query costs 200TB in credits. It is therefore likely that the free credits will not be sufficient when you want to query the actual data. There is also a dataset that can be used for exploratory purposes: |
+ | </ | ||
- | One advantage to using HTTP Archive is that they crawl the root page of each website as well as a number of subpages. Subpages can be distinguished in the dataset with the ' | + | One advantage to using HTTP Archive is that they crawl the root page of each website as well as a number of subpages. Subpages can be distinguished in the dataset with the '' |
<code sql> | <code sql> | ||
Line 83: | Line 86: | ||
</ | </ | ||
- | Another advantage is that CRUX includes ranking of website popularity per buckets (e.g. top 1000, top 10000...) which is available in HTTPArchive. | + | Another advantage is that CrUX includes ranking of website popularity per buckets (e.g., top 1000, top 10000) which is available in HTTPArchive. |
For instance, let's take the 1000 most popular websites of February 2025 with the following SQL query: | For instance, let's take the 1000 most popular websites of February 2025 with the following SQL query: | ||
Line 98: | Line 102: | ||
</ | </ | ||
- | If you only need a specific subset of websites e.g. you want to a longitudinal analysis on the same set of websites, than another option is to [[https:// | + | If you only need a specific subset of websites e.g. you want to a longitudinal analysis on the same set of websites, than another option is to [[https:// |
==== Wayback Machine ==== | ==== Wayback Machine ==== | ||
- | The Internet Archive' | + | The Internet Archive' |
Unlike HTTP Archive, they don't schedule monthly crawls, it is highly dependent on the website whether it has been captured at a specific time. Therefore, if you are doing research on less popular websites, you cannot be certain about their availability in the archive. You can verify if a snapshot is available for a web page or a resource with the [[https:// | Unlike HTTP Archive, they don't schedule monthly crawls, it is highly dependent on the website whether it has been captured at a specific time. Therefore, if you are doing research on less popular websites, you cannot be certain about their availability in the archive. You can verify if a snapshot is available for a web page or a resource with the [[https:// | ||
Another limitation is that Wayback Machine applies [[https:// | Another limitation is that Wayback Machine applies [[https:// | ||
- | On the other hand, it is more straightforward to work with than HTTP Archive since no Google account, SQL querying etc. is needed. Additionally, | + | On the other hand, it is more straightforward to work with than HTTP Archive since no Google account, SQL querying etc. is needed. Additionally, |
+ | |||
+ | ==== Other archives ==== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | <wrap todo> | ||
====== References ====== | ====== References ====== |
design/archives.txt · Last modified: 2025/03/18 14:49 by karelkubicek