Table of Contents
Multilingual Support
Much of the web research focuses on English websites, often resonating with the target group of English-written publications. However, this focus causes our research field to neglect other target audiences, especially those where security and privacy issues may have larger implications than in the Western world. This limitation in research scope was also identified by Mhaidli et al. in Sec. 4.4.2 of [1Mhaidli, Abraham; Fidan, Selin; Doan, An; Herakovic, Gina; Srinath, Mukund; Matheson, Lee; Wilson, Shomir; Schaub, Florian (2023): "Researchers’ Experiences in Analyzing Privacy Policies: Challenges and Opportunities", Proceedings on Privacy Enhancing Technologies 2023:287-305. (DOI) (Link)].
Below, we cover three aspects of multilingual crawling: language detection, machine translation, and full-pipeline multilingual support. Additionally, you should consider sampling websites from an appropriate list with local data and using other data sources that better fit the hypothetical user profiles of websites.
Language Detection
Several libraries can automatically detect the language of input text, such as langdetect, Google's CLD3, and Facebook's fasttext. A useful comparison of these tools, along with code examples, can be found at https://modelpredict.com/language-identification-survey.
Key takeaways indicate that the clear winner is fasttext:
- It is the most accurate, followed only by the considerably slower langdetect.
- While pycld2 is about twice as fast, it loses 10% accuracy (98% vs. 87%).
- If a small memory footprint is required (e.g., for parallel crawls), you can use fasttext-compressed, though it has slightly worse accuracy and is marginally slower.1)
This performance assessment might be skewed by evaluation on specific datasets, as fasttext uses one of the benchmarks as training data. However, fasttext also performs best on other benchmarks. Hosseini et al. [2Hosseini, Henry; Degeling, Martin; Utz, Christine; Hupperich, Thomas (2021): "Unifying privacy policy detection", Proceedings on Privacy Enhancing Technologies 2021:480–499. (DOI)] reported the best performance on long text with the langdetect library. Thus, you might want to adopt Hosseini et al.'s approach: running an ensemble of multiple methods and selecting the top result. Note that this approach increases runtime and memory usage compared to the worst-performing method.
Avoid using the language definition provided by websites in the <html>
tag. Most non-English websites incorrectly indicate English (EN) because of CMS or template defaults.
Machine Translation
A straightforward way to support multiple languages in your study is to use machine translation. While many paid translation APIs exist (e.g., Google Translate, DeepL), they quickly become prohibitively expensive at crawl scale. Although Google Translate offers a free API, its Terms of Service prohibit automated processing, and its reliability due t obot detection varies. Therefore, this section focuses on open-source, self-hosted translation methods.
Note that these methods are often slow.2) Using LLMs is even slower, so you might want to explore the next section on keeping the whole process multilingual, which is significantly more efficient.
LibreTranslate (Argos)
LibreTranslate is an interface for the Argos Translate project, simplifying the handling of multiple language models in parallel. We recommend starting with it using Docker.
- docker-compose.yaml
services: libretranslate: image: libretranslate/libretranslate:latest # Use libretranslate/libretranslate:latest-cuda for CUDA support restart: unless-stopped ports: - 5000:5000 healthcheck: test: ['CMD-SHELL', './venv/bin/python scripts/healthcheck.py'] environment: LT_THREADS: 8 # Set this to the number of your CPUs LT_FRONTEND_TIMEOUT: 180 # 3 minutes timeout # More options here: https://github.com/LibreTranslate/LibreTranslate?tab=readme-ov-file#settings--flags volumes: - lt-local:/home/libretranslate/.local # For CUDA support, uncomment the following lines: #deploy: # resources: # reservations: # devices: # - driver: nvidia # count: 1 # capabilities: [gpu] volumes: lt-local: name: lt-local external: true
Save this file to a new folder and run: docker compose up –detach
. The initial model download may take a while. You can check the status using docker ps
. Once the libretranslate
service is listed as healthy
, it is operational.
For translating text, use the documented API examples on LibreTranslate GitHub or one of the API libraries. LibreTranslate also supports translation with automated language detection, currently using langdetect for this purpose.
LLMs
Most large language models (LLMs) are multilingual and outperform traditional translation methods. To run LLMs locally, consider the Ollama library and language API bindings. Prompts can be as simple as Translate from language <lang-A> to language <lang-B> the following text: <text>
or Translate the following to <lang>: <text>
.
Expect throughput to be 50-1000 times slower than LibreTranslate, depending on the model. For shorter texts, LLMs often provide much higher accuracy. However, for longer texts, attention mechanisms may fail, resulting in issues like untranslated text, repeated words, shortened translations, or loss of meaning.
Multilingual Pipeline
To make your crawl multilingual, address the following tasks:
- Language detection (see above).
- Switching websites to supported languages by changing browser locale, modifying URL locale strings, or clicking links/buttons labeled with the target language name.
- Interacting with websites based on detection of specific keywords in links or forms.
- Classifying multilingual content.
- Processing multilingual data.
Interaction
: This section is subjective. Can we find references?
Most crawlers interact with websites by detecting specific keywords on pages (e.g., in links or form fields). To construct keyword lists, navigate target-language websites and collect keywords with the help of translation tools or native speakers (e.g., via Amazon MTurk). Alternatively, translate keywords—while simple translation tools may result in unsuitable synonyms, LLMs with task-specific context can perform reasonably well.
Classification
Many NLP models now support multiple languages (e.g., multilingual BERT). These models perform better when fine-tuned on multilingual training data. You can collect such data from a sample crawl and use translation for data labeling. Alternatively, use LLMs to generate multilingual training data from a single-language annotated dataset and train a multilingual model. This approach balances performance with computational costs, as BERT is faster than traditional LLMs.
Postprocessing
Watch for potential pitfalls, such as:
- Proper encoding support for non-Latin alphabets.
References
- [1]
- Mhaidli, Abraham; Fidan, Selin; Doan, An; Herakovic, Gina; Srinath, Mukund; Matheson, Lee; Wilson, Shomir; Schaub, Florian (2023): "Researchers’ Experiences in Analyzing Privacy Policies: Challenges and Opportunities", Proceedings on Privacy Enhancing Technologies 2023:287-305. (DOI) (Link)
- [2]
- Hosseini, Henry; Degeling, Martin; Utz, Christine; Hupperich, Thomas (2021): "Unifying privacy policy detection", Proceedings on Privacy Enhancing Technologies 2021:480–499. (DOI)