design:archives
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
design:archives [2025/02/20 08:13] – Reference to other papers karelkubicek | design:archives [2025/03/18 14:49] (current) – polishing karelkubicek | ||
---|---|---|---|
Line 2: | Line 2: | ||
- | <wrap todo> | + | Various researchers have utilized web archives for longitudinal analyses of privacy {[lerner2016internet, |
+ | Using archives for research have several advantages. Their use helps towards solving the issue of reproducibility of research, given that a static snapshot is used for each page visit. It also eliminates the need to write project-specific crawlers. However, there are a number of pitfalls associated with the use of archives. Additionally, | ||
- | * https:// | ||
- | * [[https:// | ||
+ | ===== Limitations ===== | ||
- | Papers using archives | + | There are some general limitations related to the use of web archives, studied by prior work {[hantke2023you, |
- | * Internet Archive: https:// | + | A number of things can go wrong during the archiving process because of the dynamic nature of the web. For example, URLs might not be properly archived, which leads to missing resources when replaying the web page {[lerner2016internet]}. This problem then cascades to additional resources which are dependent on the missing resource. |
- | * HTTPArchive: | + | Typically, a modern browser is used to replay an archived web page. However, this decision will have an influence on the results, as the behavior of the web page will differ when using an older version of the browser. |
+ | Additionally, | ||
+ | All of these limitations impact our results and should be kept in mind when opting for using archives instead of crawling live websites. In the future, the community might evolve towards better standards for archiving websites which can mitigate at least some of the mentioned limitations {[hantke2025web]}. | ||
- | /* | ||
- | This is a comment not visible on the page. It outlines the syntax (for more, go to https:// | ||
- | ===== Header Level 2 ===== | + | ===== What to consider when selecting an archive |
- | ==== Header Level 3 ==== | + | |
- | === Links === | + | * Crawl location: Both the Wayback Machine and HTTPArchive run their crawls with an IP address from the US. This has an impact on results if you want to examine properties that are related to a certain region or area (e.g., legislation in the EU). |
+ | * The completeness of the data: Some archives crawl popular websites regularly and are more suited for large-scale measurements, | ||
+ | * Availability: | ||
- | External links are recognized automatically: | ||
- | Internal links are created by using square brackets. You can either just give a [[pagename]] or use an additional [[pagename|link text]]. | ||
- | === Lists === | + | ===== So which archive should I choose? ===== |
- | Lists and their levels | + | The most widely known and used archives for research are The Internet Archive' |
- | - this is 1. item | ||
- | - 2. item | ||
- | - nested a. item | ||
- | * bullet-point item | ||
- | === Code === | + | ==== HTTPArchive ==== |
- | For a short inline monospace, use '' | + | [[https:// |
+ | Older data is also available starting from 2010. However, given that they did not crawl the full CrUX back then, significantly less websites were crawled between 2010 and July 2018, which makes it hard to compare across different data points before July 2018. | ||
+ | For example, | ||
- | <code python> | + | WebPageTest is used to crawl the webpages, which records all requests and responses. In the end, HAR files are saved for each website visit. These HAR files are then stored in BigQuery tables. |
- | string = " | + | |
- | print(f' | + | |
- | </ | + | |
- | For large code, use ''< | + | In order to access the data: |
+ | - Go to https:// | ||
+ | - Click " | ||
+ | - Give your project a name and click the " | ||
+ | - Navigate to the BigQuery console https:// | ||
+ | - In order to add the HTTP Archive tables to your project, click on the "+ Add" button on top of the Explorer sidebar and choose the Star a project by name option from the side menu. | ||
+ | - Type in " | ||
+ | - You can now browse the dataset, it is publicly accessible | ||
- | <file php example.php> | + | Full tutorial https://har.fyi/guides/ |
- | <?php echo "hello world!"; | + | |
- | </file> | + | |
- | === Figures === | ||
- | To use floats, you have to use the '' | + | The ' |
- | <WRAP right 50% box> | + | |
- | {{PATH_TO_FILE|ALT_TEXT}} | + | It is subdivided in two main tables: |
- | <div>CAPTION</div> | + | * '' |
+ | * '' | ||
+ | |||
+ | The tables are partitioned by date. As an example, the following | ||
+ | |||
+ | <code sql> | ||
+ | SELECT | ||
+ | * | ||
+ | FROM | ||
+ | `httparchive.crawl.requests` | ||
+ | WHERE | ||
+ | date = ' | ||
+ | AND client = ' | ||
+ | </code> | ||
+ | |||
+ | <WRAP important> | ||
+ | This query costs 200TB in credits. It is therefore likely that the free credits will not be sufficient when you want to query the actual data. There is also a dataset that can be used for exploratory purposes: '' | ||
</ | </ | ||
- | */ | + | One advantage to using HTTP Archive is that they crawl the root page of each website as well as a number of subpages. Subpages can be distinguished in the dataset with the '' |
- | ====== References ====== | + | <code sql> |
+ | SELECT | ||
+ | * | ||
+ | FROM | ||
+ | `httparchive.crawl.requests` | ||
+ | WHERE | ||
+ | date = ' | ||
+ | AND client | ||
+ | AND is_root_page | ||
+ | </ | ||
- | /* | + | Another advantage is that CrUX includes ranking of website popularity per buckets (e.g., top 1000, top 10000) which is available in HTTPArchive. |
- | To insert citations, follow these steps: | + | |
- | - Verify the BibTeX entry exists in https:// | + | For instance, let's take the 1000 most popular websites of February 2025 with the following SQL query: |
- | - Use {[CitationKey]} where needed in the text; it will render as a numbered reference. | + | |
- | - Keep this section unchanged to display | + | |
- | If any step fails, a purple warning will appear | + | <code sql> |
- | */ | + | SELECT |
+ | * | ||
+ | FROM | ||
+ | `httparchive.crawl.pages` | ||
+ | WHERE | ||
+ | date = ' | ||
+ | AND client = ' | ||
+ | AND is_root_page | ||
+ | AND rank = 1000 | ||
+ | </ | ||
+ | |||
+ | If you only need a specific subset of websites e.g. you want to a longitudinal analysis on the same set of websites, than another option is to [[https:// | ||
+ | |||
+ | |||
+ | ==== Wayback Machine ==== | ||
+ | The Internet Archive' | ||
+ | Unlike HTTP Archive, they don't schedule monthly crawls, it is highly dependent on the website whether it has been captured at a specific time. Therefore, if you are doing research on less popular websites, you cannot be certain about their availability in the archive. You can verify if a snapshot is available for a web page or a resource with the [[https:// | ||
+ | |||
+ | Another limitation is that Wayback Machine applies [[https:// | ||
+ | |||
+ | On the other hand, it is more straightforward to work with than HTTP Archive since no Google account, SQL querying etc. is needed. Additionally, | ||
+ | |||
+ | ==== Other archives ==== | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | <wrap todo> | ||
+ | |||
+ | ====== References ====== | ||
<bibtex bibliography></ | <bibtex bibliography></ | ||
+ | |||
+ | |||
/* This enables discussion under this article. */ | /* This enables discussion under this article. */ | ||
~~DISCUSSION~~ | ~~DISCUSSION~~ |
design/archives.1740039210.txt.gz · Last modified: 2025/02/20 08:13 by karelkubicek