design:website_classification
Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
design:website_classification [2025/01/02 17:22] – created website classification page karelkubicek | design:website_classification [2025/02/19 10:11] (current) – highlighted todos karelkubicek | ||
---|---|---|---|
Line 20: | Line 20: | ||
- Limited granularity in labels, which may not suit detailed marketing or behavioral analysis. | - Limited granularity in labels, which may not suit detailed marketing or behavioral analysis. | ||
- Documentation includes deprecated categories, leading to potential misinterpretations. | - Documentation includes deprecated categories, leading to potential misinterpretations. | ||
- | TODO: URL, API | + | |
==== FortiGuard ==== | ==== FortiGuard ==== | ||
Line 30: | Line 30: | ||
- Lower label granularity may restrict its applicability outside security domains. | - Lower label granularity may restrict its applicability outside security domains. | ||
- Limited documentation transparency in certain sensitive categories. | - Limited documentation transparency in certain sensitive categories. | ||
- | TODO: URL, API | + | |
==== Symantec ==== | ==== Symantec ==== | ||
Line 39: | Line 39: | ||
- Taxonomy is less diverse compared to marketing-oriented services. | - Taxonomy is less diverse compared to marketing-oriented services. | ||
- Limited coverage for obscure or long-tail domains. | - Limited coverage for obscure or long-tail domains. | ||
- | TODO: URL, API | + | |
==== Trend Micro ==== | ==== Trend Micro ==== | ||
Line 46: | Line 46: | ||
- Labels aligned with threat intelligence, | - Labels aligned with threat intelligence, | ||
* **Disadvantages**: | * **Disadvantages**: | ||
- | - (TODO: URL, API) | + | - <wrap todo> |
+ | <wrap todo>TODO: URL, API</ | ||
==== Forcepoint ==== | ==== Forcepoint ==== | ||
Line 55: | Line 56: | ||
- Limited multi-labeling capabilities restrict nuanced classification. | - Limited multi-labeling capabilities restrict nuanced classification. | ||
- Challenges in documenting clear and concise taxonomies. | - Challenges in documenting clear and concise taxonomies. | ||
- | TODO: URL, API | + | |
==== Dr.Web ==== | ==== Dr.Web ==== | ||
Line 64: | Line 65: | ||
- Very low coverage. | - Very low coverage. | ||
- Lack of nuanced or detailed labeling reduces utility in research or marketing. | - Lack of nuanced or detailed labeling reduces utility in research or marketing. | ||
- | TODO: URL, API | + | |
===== Marketing and Content Discovery ===== | ===== Marketing and Content Discovery ===== | ||
Line 83: | Line 84: | ||
- Precision and granularity can vary, sometimes complicating results. | - Precision and granularity can vary, sometimes complicating results. | ||
- Documentation and taxonomy definitions require improvement for research usability. | - Documentation and taxonomy definitions require improvement for research usability. | ||
- | TODO: URL, API | + | |
===== General Classification with Human Contributions ===== | ===== General Classification with Human Contributions ===== | ||
Line 94: | Line 95: | ||
- Scalability issues due to reliance on human volunteers. | - Scalability issues due to reliance on human volunteers. | ||
- Low coverage and subjective biases in labeling. | - Low coverage and subjective biases in labeling. | ||
- | TODO: URL, API | + | |
==== DMOZ (Curlie) ==== | ==== DMOZ (Curlie) ==== | ||
Line 103: | Line 104: | ||
- Extremely limited scalability due to a small number of editors. | - Extremely limited scalability due to a small number of editors. | ||
- Labels may be outdated due to infrequent updates for many categories. | - Labels may be outdated due to infrequent updates for many categories. | ||
- | TODO: URL, API | + | |
===== Aggregated Services ===== | ===== Aggregated Services ===== | ||
Line 114: | Line 115: | ||
- Inconsistencies due to integration of outdated or non-standardized data. | - Inconsistencies due to integration of outdated or non-standardized data. | ||
- Lack of direct control over taxonomies used by aggregated providers. | - Lack of direct control over taxonomies used by aggregated providers. | ||
- | TODO: URL, API | + | |
===== Company Datasets ===== | ===== Company Datasets ===== | ||
Compared to other services listed before, the following datasets are company-oriented instead of website-oriented. Some of them include website of the company, but this matching might be incomplete and might cause the following issues: | Compared to other services listed before, the following datasets are company-oriented instead of website-oriented. Some of them include website of the company, but this matching might be incomplete and might cause the following issues: | ||
- | * If company owns multiple websites: | + | |
- | ** Likely only the main website will be listed. | + | |
- | ** This is especially pronounced with international versions of the website. | + | * Likely only the main website will be listed. |
- | * Likewise, the dataset may contain multiple companies for a given website: | + | * This is especially pronounced with international versions of the website. |
- | ** Because of sister companies in a corporate. | + | * Likewise, the dataset may contain multiple companies for a given website: |
- | ** Many small businesses list as their website social media. Sometimes, this link does not include the full path so a single-person company might indicate facebook.com as domain. | + | * Because of sister companies in a corporate. |
+ | * Many small businesses list as their website social media. Sometimes, this link does not include the full path so a single-person company might indicate facebook.com as domain. | ||
==== PeopleDataLabs ==== | ==== PeopleDataLabs ==== | ||
Line 132: | Line 134: | ||
- Based on LinkedIn profiles that are self-reported - prone to adversarial data. | - Based on LinkedIn profiles that are self-reported - prone to adversarial data. | ||
- Only a subset of PeopleDataLabs' | - Only a subset of PeopleDataLabs' | ||
- | TODO: cite '' | + | |
==== Crunchbase ==== | ==== Crunchbase ==== | ||
Line 143: | Line 145: | ||
- URLs are extremely noisy (they are not the priority) (Source: Karel Kubicek' | - URLs are extremely noisy (they are not the priority) (Source: Karel Kubicek' | ||
- Focuses mostly on variables useful for investments and market competitiveness. | - Focuses mostly on variables useful for investments and market competitiveness. | ||
- | TODO: cite '' | + | |
==== Orbis ==== | ==== Orbis ==== | ||
Line 183: | Line 185: | ||
* [[https:// | * [[https:// | ||
- | Visit individual privacy-oriented pages for more details regarding classification of [[Privacy: | + | Visit individual privacy-oriented pages for more details regarding classification of [[Privacy: |
==== Marketing Industry ==== | ==== Marketing Industry ==== | ||
Line 197: | Line 199: | ||
- Taxonomy Documentation: | - Taxonomy Documentation: | ||
+ | <WRAP right box> | ||
{{design: | {{design: | ||
+ | < | ||
+ | </ | ||
* **From the Discussion Section**: | * **From the Discussion Section**: | ||
Line 210: | Line 215: | ||
Vallina et al. {[vallina2020_misshapes]} have surveyed the usage of classification services in academic research. The figures below illustrate their findings. Note that most recent trends are not covered due to the lag in the research process. | Vallina et al. {[vallina2020_misshapes]} have surveyed the usage of classification services in academic research. The figures below illustrate their findings. Note that most recent trends are not covered due to the lag in the research process. | ||
+ | <WRAP right box> | ||
{{design: | {{design: | ||
+ | < | ||
+ | </ | ||
* **Commonly Used Services**: | * **Commonly Used Services**: | ||
Line 227: | Line 235: | ||
<bibtex bibliography></ | <bibtex bibliography></ | ||
- | ====== BibTex ====== | ||
- | <bibtex database> | ||
- | @inproceedings{vallina2020_misshapes, | ||
- | author = {Vallina, Pelayo and Le Pochat, Victor and Feal, \' | ||
- | title = {Mis-shapes, | ||
- | year = {2020}, | ||
- | isbn = {9781450381383}, | ||
- | publisher = {Association for Computing Machinery}, | ||
- | address = {New York, NY, USA}, | ||
- | url = {https:// | ||
- | doi = {10.1145/ | ||
- | abstract = {Domain classification services have applications in multiple areas, including cybersecurity, | ||
- | booktitle = {Proceedings of the ACM Internet Measurement Conference}, | ||
- | pages = {598–618}, | ||
- | numpages = {21}, | ||
- | location = {Virtual Event, USA}, | ||
- | series = {IMC '20} | ||
- | } | ||
- | </ | ||
+ | ~~DISCUSSION~~ |
design/website_classification.1735838521.txt.gz · Last modified: 2025/01/02 17:22 by karelkubicek