====== Multilingual Support ======
Much of the web research focuses on English websites, often resonating with the target group of English-written publications. However, this focus causes our research field to neglect other target audiences, especially those where security and privacy issues may have larger implications than in the Western world. This limitation in research scope was also identified by Mhaidli et al. in Sec. 4.4.2 of {[mhaidli2023researchers]}.
Below, we cover three aspects of multilingual crawling: language detection, machine translation, and full-pipeline multilingual support. Additionally, you should consider sampling websites from an [[Design:Website selection|appropriate list with local data]] and using other data sources that better fit the hypothetical user profiles of websites.
===== Language Detection =====
Several libraries can automatically detect the language of input text, such as [[https://github.com/shuyo/language-detection|langdetect]], Google's [[https://github.com/google/cld3|CLD3]], and Facebook's [[https://pypi.org/project/fasttext/|fasttext]]. A useful comparison of these tools, along with code examples, can be found at https://modelpredict.com/language-identification-survey.
Key takeaways indicate that the clear winner is fasttext:
* It is the most accurate, followed only by the considerably slower langdetect.
* While pycld2 is about twice as fast, it loses 10% accuracy (98% vs. 87%).
* If a small memory footprint is required (e.g., for parallel crawls), you can use fasttext-compressed, though it has slightly worse accuracy and is marginally slower.((If the memory footprint over multiple parallel crawlers is an issue, consider moving the language detection to a single separate service serving all crawlers.))
This performance assessment might be skewed by evaluation on specific datasets, as fasttext uses one of the benchmarks as training data. However, fasttext also performs best on other benchmarks. Hosseini et al. {[hosseini2021unifying]} reported the best performance on long text with the langdetect library. Thus, you might want to adopt Hosseini et al.'s approach: running an ensemble of multiple methods and selecting the top result. Note that this approach increases runtime and memory usage compared to the worst-performing method.
Avoid using the language definition provided by websites in the '''' tag. Most non-English websites incorrectly indicate English (EN) because of CMS or template defaults.
===== Machine Translation =====
A straightforward way to support multiple languages in your study is to use machine translation. While many paid translation APIs exist (e.g., Google Translate, DeepL), they quickly become prohibitively expensive at crawl scale. Although Google Translate offers a free API, its Terms of Service prohibit automated processing, and its reliability due t obot detection varies. Therefore, this section focuses on open-source, self-hosted translation methods.
Note that these methods are often slow.((In our experience, LibreTranslate utilizes 20-50% of computational resources during crawls. --Karel Kubicek)) Using LLMs is even slower, so you might want to explore the next section on keeping the whole process multilingual, which is significantly more efficient.
==== LibreTranslate (Argos) ====
[[https://github.com/LibreTranslate/LibreTranslate|LibreTranslate]] is an interface for the [[https://github.com/argosopentech/argos-translate/|Argos Translate]] project, simplifying the handling of multiple language models in parallel. We recommend starting with it using [[Programming:Docker]].
services:
libretranslate:
image: libretranslate/libretranslate:latest # Use libretranslate/libretranslate:latest-cuda for CUDA support
restart: unless-stopped
ports:
- 5000:5000
healthcheck:
test: ['CMD-SHELL', './venv/bin/python scripts/healthcheck.py']
environment:
LT_THREADS: 8 # Set this to the number of your CPUs
LT_FRONTEND_TIMEOUT: 180 # 3 minutes timeout
# More options here: https://github.com/LibreTranslate/LibreTranslate?tab=readme-ov-file#settings--flags
volumes:
- lt-local:/home/libretranslate/.local
# For CUDA support, uncomment the following lines:
#deploy:
# resources:
# reservations:
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
volumes:
lt-local:
name: lt-local
external: true
Save this file to a new folder and run: ''docker compose up --detach''. The initial model download may take a while. You can check the status using ''docker ps''. Once the ''libretranslate'' service is listed as ''healthy'', it is operational.
For translating text, use the documented API examples on [[https://github.com/LibreTranslate/LibreTranslate?tab=readme-ov-file#api-examples|LibreTranslate GitHub]] or one of the [[https://github.com/LibreTranslate/LibreTranslate?tab=readme-ov-file#language-bindings|API libraries]]. LibreTranslate also supports translation with automated language detection, currently using langdetect for this purpose.
==== LLMs ====
Most large language models (LLMs) are multilingual and outperform traditional translation methods. To run LLMs locally, consider the [[https://github.com/ollama/ollama?tab=readme-ov-file|Ollama library]] and language API bindings. Prompts can be as simple as ''Translate from language to language the following text: '' or ''Translate the following to : ''.
Expect throughput to be 50-1000 times slower than LibreTranslate, depending on the model. For shorter texts, LLMs often provide much higher accuracy. However, for longer texts, attention mechanisms may fail, resulting in issues like untranslated text, repeated words, shortened translations, or loss of meaning.
===== Multilingual Pipeline =====
To make your crawl multilingual, address the following tasks:
- Language detection (see above).
- Switching websites to supported languages by changing browser locale, modifying URL locale strings, or clicking links/buttons labeled with the target language name.
- Interacting with websites based on detection of specific keywords in links or forms.
- Classifying multilingual content.
- Processing multilingual data.
=== Interaction ===
FIXME: This section is subjective. Can we find references?
Most crawlers interact with websites by detecting specific keywords on pages (e.g., in links or form fields). To construct keyword lists, navigate target-language websites and collect keywords with the help of translation tools or native speakers (e.g., via Amazon MTurk). Alternatively, translate keywords—while simple translation tools may result in unsuitable synonyms, LLMs with task-specific context can perform reasonably well.
=== Classification ===
Many NLP models now support multiple languages (e.g., multilingual BERT). These models perform better when fine-tuned on multilingual training data. You can collect such data from a sample crawl and use translation for data labeling. Alternatively, use LLMs to generate multilingual training data from a single-language annotated dataset and train a multilingual model. This approach balances performance with computational costs, as BERT is faster than traditional LLMs.
=== Postprocessing ===
Watch for potential pitfalls, such as:
* Proper encoding support for non-Latin alphabets.
====== References ======
/*
To insert citations, follow these steps:
- Verify the BibTeX entry exists in https://measuretheweb.org/literature/bibliography. If not, add it there.
- Use {[CitationKey]} where needed in the text; it will render as a numbered reference.
- Keep this section unchanged to display the bibliography.
If any step fails, a purple warning will appear on the page.
*/
/* This enables discussion under this article. */
~~DISCUSSION~~