Table of Contents

Overview

Similarly to Website Selection, also the classification of websites according to their topic, properties of their companies (industry type, employees count) is often important step in results processing of web measurement studies. Awareness of the available services, their capabilities but also limitations is essential for high-quality research.

This page examines existing website classification services and datasets that work both by domain names or company names. It outlines best practices for their usage based on existing literature, and provides detailed documentation for each service, including API access, via linked pages. Note that this page overlaps with Website Selection, so refer there for classification of websites based on their popularity.

Note that multiple services are not meant for large-scale data mining that we utilize in our research, which might be violating their terms and conditions. MeasureTheWeb authors take no responsibility for your use of these services. Check with your IRB and legal department if you want to be sure that you can use it.

Existing Services: Data Types, Advantages, and Disadvantages

A variety of services and datasets about websites and their companies are utilized in research. Domain classification services provide varying types of data for different purposes. These can be grouped based on their data output and intended applications.

Content Filtering and Cybersecurity

McAfee

TODO: URL, API

FortiGuard

TODO: URL, API

Symantec

TODO: URL, API

Trend Micro

Forcepoint

TODO: URL, API

Dr.Web

TODO: URL, API

Marketing and Content Discovery

SimilarWeb

See SimilarWeb for documentation of the unofficial API, code and output example.

Webshrinker

TODO: URL, API

General Classification with Human Contributions

OpenDNS

TODO: URL, API

DMOZ (Curlie)

TODO: URL, API

Aggregated Services

VirusTotal

TODO: URL, API

Company Datasets

Compared to other services listed before, the following datasets are company-oriented instead of website-oriented. Some of them include website of the company, but this matching might be incomplete and might cause the following issues:

PeopleDataLabs

TODO: cite Machine Learning Compliance Analysis for Email Regulation when it is public.

Crunchbase

Despite that Crunchbase is a proprietary dataset, they offer academic license upon request. Also, the legacy API offered a free sample download, this feature is still supported but might disappear at any moment (please remove this note if you observe this change). You can also check individual entries (e.g., Google) on their website in visual form, this is useful for quick check of the content, but it might be exploited by scraping.

TODO: cite Machine Learning Compliance Analysis for Email Regulation when it is public.

Orbis

Orbis is in many ways similar to Crunchbase. It is proprietary but it offers academic licensing upon request. It is also primarily focused on companies, but it contains their URLs. Orbis contains data on the longer tail of private companies, but this means that the data is even more noisy.

Scraping Services

There are many scraping services for datasets like LinkedIn, Glassdoor, Yahoo Finance business information, Yelp businesses, Indeed, etc. Their use might be violating terms and conditions of primary data sources, but the whole industry is built up by scraping each other's data, so you might be causing very limited harm. Nevertheless, check with your IRB and potentially legal department before using these services.

Discontinued Services

Alexa

Topic-Specific Datasets

Adult Websites, Security, and Privacy Protection

Multiple lists exist, mostly maintained for child protection in routers and similar. They are simple to access, and you can directly download a large list of URLs.

Visit individual privacy-oriented pages for more details regarding classification of Requests, Cookies, Fingerprinting, and JavaScript.

Marketing Industry

Best Practices

Based on a recent study [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)], the following recommendations can enhance the quality of research using website classification:

Website classification coverage according to their popularity according to Vallina et al.

Website classification coverage according to their popularity according to [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].

Use in Publications

Vallina et al. [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)] have surveyed the usage of classification services in academic research. The figures below illustrate their findings. Note that most recent trends are not covered due to the lag in the research process.

Popularity of website classification in web measurement publications according to Vallina et al.

Popularity of website classification in web measurement publications according to [1Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)].

References

[1]
Vallina, Pelayo; Le Pochat, Victor; Feal, ´Alvaro; Paraschiv, Marius; Gamba, Julien; Burke, Tim; Hohlfeld, Oliver; Tapiador, Juan; Vallina-Rodriguez, Narseo (2020): "Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services", in: Proceedings of the ACM Internet Measurement Conference, pp. 598–618. Association for Computing Machinery, New York, NY, USA. (DOI) (Link)