Overview
Similarly to Website Selection, also the classification of websites according to their topic, properties of their companies (industry type, employees count) is often important step in results processing of web measurement studies. Awareness of the available services, their capabilities but also limitations is essential for high-quality research.
This page examines existing website classification services and datasets that work both by domain names or company names. It outlines best practices for their usage based on existing literature, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with Website Selection, so refer there for classification of websites based on their popularity.
Note that multiple services are not meant for large-scale data mining that we utilize in our research, which might be violating their terms and conditions. MeasureTheWeb authors take no responsibility for your use of these services. Check with your IRB and legal department if you want to be sure that you can use it.
Existing Services: Data Types, Advantages, and Disadvantages
A variety of services and datasets about websites and their companies are utilized in research. Domain classification services provide varying types of data for different purposes. These can be grouped based on their data output and intended applications.
Content Filtering and Cybersecurity
McAfee
Advantages:
High coverage across both popular and non-popular domains (>90%).
Real-time classification capabilities for enhanced accuracy.
Provides threat assessments, suitable for cybersecurity-focused applications.
Automated approaches supported by manual oversight for balanced precision.
Disadvantages:
Limited granularity in labels, which may not suit detailed marketing or behavioral analysis.
Documentation includes deprecated categories, leading to potential misinterpretations.
TODO: URL, API
FortiGuard
Advantages:
High coverage across both popular and non-popular domains (>90%).
Real-time classification capabilities for enhanced accuracy.
Specifically geared towards security, excelling in detecting malicious or high-risk domains.
Disadvantages:
Lower label granularity may restrict its applicability outside security domains.
Limited documentation transparency in certain sensitive categories.
TODO: URL, API
Symantec
Advantages:
High accuracy and comprehensiveness in top-tier domains.
Effective for both threat assessment and content filtering.
Disadvantages:
Taxonomy is less diverse compared to marketing-oriented services.
Limited coverage for obscure or long-tail domains.
TODO: URL, API
Trend Micro
Advantages:
High coverage for high-popularity domains, with rapid updates.
Labels aligned with threat intelligence, enhancing usability in cybersecurity contexts.
Disadvantages:
-
Forcepoint
Advantages:
Excels in threat assessment with quick and scalable label updates.
Suitable for security applications requiring dynamic updates.
Disadvantages:
Limited multi-labeling capabilities restrict nuanced classification.
Challenges in documenting clear and concise taxonomies.
TODO: URL, API
Dr.Web
Advantages:
Minimalistic labeling approach ensures straightforward and consistent outputs.
Good fit for basic security-focused use cases.
Disadvantages:
Very low coverage.
Lack of nuanced or detailed labeling reduces utility in research or marketing.
TODO: URL, API
Marketing and Content Discovery
SimilarWeb
Advantages:
High-quality data, accessible via unofficial
API.
Industry, region- and origin-based popularity.
Disadvantages:
Official
API is paid; use of the unofficial one likely violates SimilarWeb's terms and conditions.
See SimilarWeb for documentation of the unofficial API, code and output example.
Webshrinker
Advantages:
Offers a marketing-oriented taxonomy aligned with advertising industry standards (
IAB).
Automated real-time updates enhance coverage and relevance.
Disadvantages:
Precision and granularity can vary, sometimes complicating results.
Documentation and taxonomy definitions require improvement for research usability.
TODO: URL, API
General Classification with Human Contributions
OpenDNS
Advantages:
Human-involved process can ensure high-quality and nuanced labels for popular domains.
Open voting and moderation improve transparency in label assignments.
Disadvantages:
Scalability issues due to reliance on human volunteers.
Low coverage and subjective biases in labeling.
TODO: URL, API
DMOZ (Curlie)
Advantages:
Human-curated labels provide depth and relevance, ideal for content discovery.
Detailed hierarchical taxonomy supports diverse classification needs.
Disadvantages:
Extremely limited scalability due to a small number of editors.
Labels may be outdated due to infrequent updates for many categories.
TODO: URL, API
Aggregated Services
VirusTotal
Advantages:
Aggregates data from multiple providers, offering a combined perspective.
Widely used in academic and industrial research due to ease of access.
Disadvantages:
Inconsistencies due to integration of outdated or non-standardized data.
Lack of direct control over taxonomies used by aggregated providers.
TODO: URL, API
Company Datasets
Compared to other services listed before, the following datasets are company-oriented instead of website-oriented. Some of them include website of the company, but this matching might be incomplete and might cause the following issues:
If company owns multiple websites:
Likewise, the dataset may contain multiple companies for a given website:
Because of sister companies in a corporate.
Many small businesses list as their website social media. Sometimes, this link does not include the full path so a single-person company might indicate facebook.com as domain.
PeopleDataLabs
Advantages:
Freely available from
PeopleDataLabs website, which also includes nice documentation. No registration is needed.
Contain field for website, geographical location of headquarters, industry, in which year was the company funded, and firm size.
Disadvantages:
Based on LinkedIn profiles that are self-reported - prone to adversarial data.
Only a subset of PeopleDataLabs' full dataset (22M/70M rows, 10/78 columns).
TODO: cite Machine Learning Compliance Analysis for Email Regulation
when it is public.
Crunchbase
Despite that Crunchbase is a proprietary dataset, they offer academic license upon request. Also, the legacy API offered a free sample download, this feature is still supported but might disappear at any moment (please remove this note if you observe this change). You can also check individual entries (e.g., Google) on their website in visual form, this is useful for quick check of the content, but it might be exploited by scraping.
Advantages:
Contain very rich data, e.g., website, rank, region, industries, financial data.
Offers free academic license upon request.
Disadvantages:
URLs are extremely noisy (they are not the priority) (Source: Karel Kubicek's experience).
Focuses mostly on variables useful for investments and market competitiveness.
TODO: cite Machine Learning Compliance Analysis for Email Regulation
when it is public.
Orbis
Orbis is in many ways similar to Crunchbase. It is proprietary but it offers academic licensing upon request. It is also primarily focused on companies, but it contains their URLs. Orbis contains data on the longer tail of private companies, but this means that the data is even more noisy.
Advantages:
Longer tail data than Crunchbase.
Offers free academic license upon request.
Disadvantages:
Extremely noisy.
Fewer fields than Crunchbase.
Scraping Services
There are many scraping services for datasets like LinkedIn, Glassdoor, Yahoo Finance business information, Yelp businesses, Indeed, etc. Their use might be violating terms and conditions of primary data sources, but the whole industry is built up by scraping each other's data, so you might be causing very limited harm. Nevertheless, check with your IRB and potentially legal department before using these services.
Discontinued Services
Alexa
Advantages:
Highly granular taxonomy, offering fine-tuned marketing and content analysis.
Useful for top-tier domain classification in behavioral research.
Disadvantages:
Discontinued.
Low coverage, issues with taxonomy hierarchy may lead to inconsistent results.
Topic-Specific Datasets
Adult Websites, Security, and Privacy Protection
Multiple lists exist, mostly maintained for child protection in routers and similar. They are simple to access, and you can directly download a large list of URLs.
Visit individual privacy-oriented pages for more details regarding classification of Requests, Cookies, Fingerprinting, and JavaScript.
Marketing Industry
Martech provides data on 17k companies in the advertising industry. They are easy to download after registering on
https://martechmap.com/, search for
martech_data_N.json
files in the network tab of browser dev tools.
-
Best Practices
Based on a recent study [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)], the following recommendations can enhance the quality of research using website classification:
Website classification coverage according to their popularity according to [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].
Recommendations:
Use automated systems for scalability but validate portions of data manually to evaluate its quality.
Use the most recent datasets to account for dynamic web content and updated labels.
Use in Publications
Vallina et al. [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] have surveyed the usage of classification services in academic research. The figures below illustrate their findings. Note that most recent trends are not covered due to the lag in the research process.
Popularity of website classification in web measurement publications according to [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].
Commonly Used Services:
VirusTotal: Frequently used in 46% of analyzed research papers due to its aggregation, though inconsistencies require cautious interpretation.
Alexa: Leveraged for domain datasets by 27% of the surveyed studies.
Impact of Choice:
The selection of domain classification services significantly influences outcomes. Disparate coverage rates (ranging from <1% to >90%) mean some domains are entirely excluded depending on the service used.
Misalignment in label semantics and granularity further complicates research validity.
References
- [1]
- Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)