====== Overview ======
Similarly to [[Design:Website Selection]], also the classification of websites according to their topic, properties of their companies (industry type, employees count) is often important step in results processing of web measurement studies. Awareness of the available services, their capabilities but also limitations is essential for high-quality research.

This page examines existing website classification services and datasets that work both by domain names or company names. It outlines best practices for their usage based on existing literature, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with [[Design:Website Selection]], so refer there for classification of websites based on their popularity.

Note that multiple services are not meant for large-scale data mining that we utilize in our research, which might be violating their terms and conditions. MeasureTheWeb authors take no responsibility for your use of these services. Check with your IRB and legal department if you want to be sure that you can use it.

====== Existing Services: Data Types, Advantages, and Disadvantages ======
A variety of services and datasets about websites and their companies are utilized in research. Domain classification services provide varying types of data for different purposes. These can be grouped based on their data output and intended applications.

===== Content Filtering and Cybersecurity =====

==== McAfee ====
  * **Advantages**:
    - High coverage across both popular and non-popular domains (>90%).
    - Real-time classification capabilities for enhanced accuracy.
    - Provides threat assessments, suitable for cybersecurity-focused applications.
    - Automated approaches supported by manual oversight for balanced precision.
  * **Disadvantages**:
    - Limited granularity in labels, which may not suit detailed marketing or behavioral analysis.
    - Documentation includes deprecated categories, leading to potential misinterpretations.
  TODO: URL, API

==== FortiGuard ====
  * **Advantages**:
    - High coverage across both popular and non-popular domains (>90%).
    - Real-time classification capabilities for enhanced accuracy.
    - Specifically geared towards security, excelling in detecting malicious or high-risk domains.
  * **Disadvantages**:
    - Lower label granularity may restrict its applicability outside security domains.
    - Limited documentation transparency in certain sensitive categories.
  TODO: URL, API

==== Symantec ====
  * **Advantages**:
    - High accuracy and comprehensiveness in top-tier domains.
    - Effective for both threat assessment and content filtering.
  * **Disadvantages**:
    - Taxonomy is less diverse compared to marketing-oriented services.
    - Limited coverage for obscure or long-tail domains.
  TODO: URL, API

==== Trend Micro ====
  * **Advantages**:
    - High coverage for high-popularity domains, with rapid updates.
    - Labels aligned with threat intelligence, enhancing usability in cybersecurity contexts.
  * **Disadvantages**:
    - (TODO: URL, API)

==== Forcepoint ====
  * **Advantages**:
    - Excels in threat assessment with quick and scalable label updates.
    - Suitable for security applications requiring dynamic updates.
  * **Disadvantages**:
    - Limited multi-labeling capabilities restrict nuanced classification.
    - Challenges in documenting clear and concise taxonomies.
  TODO: URL, API

==== Dr.Web ====
  * **Advantages**:
    - Minimalistic labeling approach ensures straightforward and consistent outputs.
    - Good fit for basic security-focused use cases.
  * **Disadvantages**:
    - Very low coverage.
    - Lack of nuanced or detailed labeling reduces utility in research or marketing.
  TODO: URL, API

===== Marketing and Content Discovery =====

==== SimilarWeb ====
  * **Advantages**:
    - High-quality data, accessible via unofficial API.
    - Industry, region- and origin-based popularity.
  * **Disadvantages**:
    - Official API is paid; use of the unofficial one likely violates SimilarWeb's terms and conditions.
  See [[Programming:SimilarWeb]] for documentation of the unofficial API, code and output example.

==== Webshrinker ====
  * **Advantages**:
    - Offers a marketing-oriented taxonomy aligned with advertising industry standards (IAB).
    - Automated real-time updates enhance coverage and relevance.
  * **Disadvantages**:
    - Precision and granularity can vary, sometimes complicating results.
    - Documentation and taxonomy definitions require improvement for research usability.
  TODO: URL, API

===== General Classification with Human Contributions =====

==== OpenDNS ====
  * **Advantages**:
    - Human-involved process can ensure high-quality and nuanced labels for popular domains.
    - Open voting and moderation improve transparency in label assignments.
  * **Disadvantages**:
    - Scalability issues due to reliance on human volunteers.
    - Low coverage and subjective biases in labeling.
  TODO: URL, API

==== DMOZ (Curlie) ====
  * **Advantages**:
    - Human-curated labels provide depth and relevance, ideal for content discovery.
    - Detailed hierarchical taxonomy supports diverse classification needs.
  * **Disadvantages**:
    - Extremely limited scalability due to a small number of editors.
    - Labels may be outdated due to infrequent updates for many categories.
  TODO: URL, API

===== Aggregated Services =====

==== VirusTotal ====
  * **Advantages**:
    - Aggregates data from multiple providers, offering a combined perspective.
    - Widely used in academic and industrial research due to ease of access.
  * **Disadvantages**:
    - Inconsistencies due to integration of outdated or non-standardized data.
    - Lack of direct control over taxonomies used by aggregated providers.
  TODO: URL, API

===== Company Datasets =====
Compared to other services listed before, the following datasets are company-oriented instead of website-oriented. Some of them include website of the company, but this matching might be incomplete and might cause the following issues:

  * If company owns multiple websites:
    * Likely only the main website will be listed.
    * This is especially pronounced with international versions of the website.
  * Likewise, the dataset may contain multiple companies for a given website:
    * Because of sister companies in a corporate.
    * Many small businesses list as their website social media. Sometimes, this link does not include the full path so a single-person company might indicate facebook.com as domain.

==== PeopleDataLabs ====
  * **Advantages**:
    - Freely available from [[https://docs.peopledatalabs.com/docs/free-company-dataset|PeopleDataLabs website]], which also includes nice documentation. No registration is needed.
    - Contain field for website, geographical location of headquarters, industry, in which year was the company funded, and firm size.
  * **Disadvantages**:
    - Based on LinkedIn profiles that are self-reported - prone to adversarial data.
    - Only a subset of PeopleDataLabs' full dataset (22M/70M rows, 10/78 columns).
  TODO: cite ''Machine Learning Compliance Analysis for Email Regulation'' when it is public.

==== Crunchbase ====
Despite that Crunchbase is a proprietary dataset, they offer [[https://about.crunchbase.com/partners/academic-research-access/|academic license upon request]]. Also, the [[https://data.crunchbase.com/v4-legacy/docs/legacy-csv-export#sample-csv-export|legacy API offered a free sample download]], this feature is still supported but might disappear at any moment (please remove this note if you observe this change). You can also check individual entries (e.g., [[https://www.crunchbase.com/organization/google|Google]]) on their website in visual form, this is useful for quick check of the content, but it might be exploited by scraping.

  * **Advantages**:
    - Contain very rich data, e.g., website, rank, region, industries, financial data.
    - Offers free academic license upon request.
  * **Disadvantages**:
    - URLs are extremely noisy (they are not the priority) (Source: Karel Kubicek's experience).
    - Focuses mostly on variables useful for investments and market competitiveness.
  TODO: cite ''Machine Learning Compliance Analysis for Email Regulation'' when it is public.

==== Orbis ====
Orbis is in many ways similar to Crunchbase. It is proprietary but it offers academic licensing upon request. It is also primarily focused on companies, but it contains their URLs. Orbis contains data on the longer tail of private companies, but this means that the data is even more noisy.

  * **Advantages**:
    - Longer tail data than Crunchbase.
    - Offers free academic license upon request.
  * **Disadvantages**:
    - Extremely noisy.
    - Fewer fields than Crunchbase.

===== Scraping Services =====
There are many scraping services for datasets like LinkedIn, Glassdoor, Yahoo Finance business information, Yelp businesses, Indeed, etc. Their use might be violating terms and conditions of primary data sources, but the whole industry is built up by scraping each other's data, so you might be causing very limited harm. Nevertheless, check with your IRB and potentially legal department before using these services.

  * [[https://brightdata.com/products/datasets|BrightData]]: They sometimes offer academic licensing. They also have free demo upon signup.
  * [[https://docs.coresignal.com|Coresignal]]
  * Amazon AWS Marketplace offers various datasets through external services.
  * [[https://www.kaggle.com/|Kaggle]]: This website often contains smaller samples of various company datasets.

===== Discontinued Services =====

==== Alexa ====
  * **Advantages**:
    - Highly granular taxonomy, offering fine-tuned marketing and content analysis.
    - Useful for top-tier domain classification in behavioral research.
  * **Disadvantages**:
    - Discontinued.
    - Low coverage, issues with taxonomy hierarchy may lead to inconsistent results.

===== Topic-Specific Datasets =====

==== Adult Websites, Security, and Privacy Protection ====
Multiple lists exist, mostly maintained for child protection in routers and similar. They are simple to access, and you can directly download a large list of URLs.

  * [[https://github.com/Bon-Appetit/porn-domains]]: Daily updated list of porn websites.
  * [[https://github.com/hagezi/dns-blocklists]]: DNS blocklists for adult content, piracy, gambling, security (scam, malware), and privacy (ads, tracking).
  * [[https://github.com/search?q=pihole%20blocklist&type=repositories]]: Search for pi-hole blocklists for specific uses.
  * [[https://oisd.nl/]], [[https://firebog.net/]]

Visit individual privacy-oriented pages for more details regarding classification of [[Privacy:Requests]], [[Privacy:Cookies]], [[Privacy:Fingerprinting]], and [[Privacy:JavaScript]].

==== Marketing Industry ====
  * Martech provides data on 17k companies in the advertising industry. They are easy to download after registering on [[https://martechmap.com/]], search for ''martech_data_N.json'' files in the network tab of browser dev tools. 
  * IAB Europe lists advertising and tracking third parties. On [[https://iabeurope.eu/vendor-list-tcf/]], open the source code and find a huge table in it ''<table id="tablepress-72" class="tablepress tablepress-id-72">''. Copy that table to [[https://www.convertcsv.com/html-table-to-csv.htm Converter]] to get a CSV.

====== Best Practices ======
Based on a recent study {[vallina2020_misshapes]}, the following recommendations can enhance the quality of research using website classification:

  * **Representativeness of Services**:
    - Coverage Analysis: McAfee and FortiGuard provide comprehensive coverage (>90% for popular domains), suitable for large-scale applications. See below the Table 2.
    - Labeling Diversity: Alexa and Webshrinker offer extensive marketing-oriented categories, better for specialized research.
    - Taxonomy Documentation: Researchers should verify documented categories against observed labels to mitigate discrepancies.

<WRAP right box>
{{design:website_classification_coverage_vallina2020.png?nolink&500|Website classification coverage according to their popularity according to Vallina et al.}}
<div>Website classification coverage according to their popularity according to {[vallina2020_misshapes]}.</div>
</WRAP>

  * **From the Discussion Section**:
    - Avoid sole reliance on a single service for critical decisions. Combining outputs, despite challenges, may improve representativeness.
    - For human-driven services like OpenDNS, users should account for labeling biases and update lags due to manual moderation.

  * **Recommendations**:
    - Use automated systems for scalability but validate portions of data manually to evaluate its quality.
    - Use the most recent datasets to account for dynamic web content and updated labels.

====== Use in Publications ======
Vallina et al. {[vallina2020_misshapes]} have surveyed the usage of classification services in academic research. The figures below illustrate their findings. Note that most recent trends are not covered due to the lag in the research process.

<WRAP right box>
{{design:website_classification_popularity_vallina2020.png?nolink&500|Popularity of website classification in web measurement publications according to Vallina et al.}}
<div>Popularity of website classification in web measurement publications according to {[vallina2020_misshapes]}.</div>
</WRAP>

  * **Commonly Used Services**:
    - VirusTotal: Frequently used in 46% of analyzed research papers due to its aggregation, though inconsistencies require cautious interpretation.
    - Alexa: Leveraged for domain datasets by 27% of the surveyed studies.

  * **Impact of Choice**:
    - The selection of domain classification services significantly influences outcomes. Disparate coverage rates (ranging from <1% to >90%) mean some domains are entirely excluded depending on the service used.
    - Misalignment in label semantics and granularity further complicates research validity.

  * **Examples of Research Areas**:
    - Security-focused research benefits from services like McAfee and FortiGuard due to their robust threat assessments.
    - Marketing and web behavior analysis rely on detailed taxonomies from Alexa and Webshrinker.

====== References ======
<bibtex bibliography></bibtex>


~~DISCUSSION~~