Differences

This shows you the differences between two versions of the page.

--- design:website_classification [2025/01/03 17:17] – [Company Datasets] fixed formatting karelkubicek
+++ design:website_classification [2025/06/23 08:50] (current) – [Adult Websites, Security, and Privacy Protection] karelkubicek
@@ Line 20: / Line 20: @@
     - Limited granularity in labels, which may not suit detailed marketing or behavioral analysis.
     - Documentation includes deprecated categories, leading to potential misinterpretations.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ==== FortiGuard ====
@@ Line 30: / Line 30: @@
     - Lower label granularity may restrict its applicability outside security domains.
     - Limited documentation transparency in certain sensitive categories.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ==== Symantec ====
@@ Line 39: / Line 39: @@
     - Taxonomy is less diverse compared to marketing-oriented services.
     - Limited coverage for obscure or long-tail domains.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ==== Trend Micro ====
@@ Line 46: / Line 46: @@
     - Labels aligned with threat intelligence, enhancing usability in cybersecurity contexts.
   * **Disadvantages**:
-    - (TODO: URL, API)
+    - <wrap todo>TODO: are there any?</wrap>
+  <wrap todo>TODO: URL, API</wrap>
 ==== Forcepoint ====
@@ Line 55: / Line 56: @@
     - Limited multi-labeling capabilities restrict nuanced classification.
     - Challenges in documenting clear and concise taxonomies.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ==== Dr.Web ====
@@ Line 64: / Line 65: @@
     - Very low coverage.
     - Lack of nuanced or detailed labeling reduces utility in research or marketing.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ===== Marketing and Content Discovery =====
@@ Line 83: / Line 84: @@
     - Precision and granularity can vary, sometimes complicating results.
     - Documentation and taxonomy definitions require improvement for research usability.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ===== General Classification with Human Contributions =====
@@ Line 94: / Line 95: @@
     - Scalability issues due to reliance on human volunteers.
     - Low coverage and subjective biases in labeling.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ==== DMOZ (Curlie) ====
@@ Line 103: / Line 104: @@
     - Extremely limited scalability due to a small number of editors.
     - Labels may be outdated due to infrequent updates for many categories.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ===== Aggregated Services =====
@@ Line 114: / Line 115: @@
     - Inconsistencies due to integration of outdated or non-standardized data.
     - Lack of direct control over taxonomies used by aggregated providers.
-  TODO: URL, API
+  <wrap todo>TODO: URL, API</wrap>
 ===== Company Datasets =====
@@ Line 133: / Line 134: @@
     - Based on LinkedIn profiles that are self-reported - prone to adversarial data.
     - Only a subset of PeopleDataLabs' full dataset (22M/70M rows, 10/78 columns).
-  TODO: cite ''Machine Learning Compliance Analysis for Email Regulation'' when it is public.
+  <wrap todo>TODO: cite ''Machine Learning Compliance Analysis for Email Regulation'' when it is public.</wrap>
 ==== Crunchbase ====
@@ Line 144: / Line 145: @@
     - URLs are extremely noisy (they are not the priority) (Source: Karel Kubicek's experience).
     - Focuses mostly on variables useful for investments and market competitiveness.
-  TODO: cite ''Machine Learning Compliance Analysis for Email Regulation'' when it is public.
+  <wrap todo>TODO: cite ''Machine Learning Compliance Analysis for Email Regulation'' when it is public.</wrap>
 ==== Orbis ====
@@ Line 183: / Line 184: @@
   * [[https://github.com/search?q=pihole%20blocklist&type=repositories]]: Search for pi-hole blocklists for specific uses.
   * [[https://oisd.nl/]], [[https://firebog.net/]]
+  * [[https://dsi.ut-capitole.fr/blacklists/index_en.php]] contains filtering designed for blocking university traffic. Contains many lists, from porn to malicious or misinformation content.
-Visit individual privacy-oriented pages for more details regarding classification of [[Privacy:Requests]], [[Privacy:Cookies]], and [[Privacy:Fingerprinting]].
+Visit individual privacy-oriented pages for more details regarding classification of [[Privacy:Requests]], [[Privacy:Cookies]], [[Privacy:Fingerprinting]], and [[Privacy:JavaScript]].
 ==== Marketing Industry ====
@@ Line 234: / Line 236: @@
 <bibtex bibliography></bibtex>
-====== BibTex ======
-<bibtex database>
-@inproceedings{vallina2020_misshapes,
-    author = {Vallina, Pelayo and Le Pochat, Victor and Feal, \'{A}lvaro and Paraschiv, Marius and Gamba, Julien and Burke, Tim and Hohlfeld, Oliver and Tapiador, Juan and Vallina-Rodriguez, Narseo},
-    title = {Mis-shapes, Mistakes, Misfits: An Analysis of Domain Classification Services},
-    year = {2020},
-    isbn = {9781450381383},
-    publisher = {Association for Computing Machinery},
-    address = {New York, NY, USA},
-    url = {https://doi.org/10.1145/3419394.3423660},
-    doi = {10.1145/3419394.3423660},
-    abstract = {Domain classification services have applications in multiple areas, including cybersecurity, content blocking, and targeted advertising. Yet, these services are often a black box in terms of their methodology to classifying domains, which makes it difficult to assess their strengths, aptness for specific applications, and limitations. In this work, we perform a large-scale analysis of 13 popular domain classification services on more than 4.4M hostnames. Our study empirically explores their methodologies, scalability limitations, label constellations, and their suitability to academic research as well as other practical applications such as content filtering. We find that the coverage varies enormously across providers, ranging from over 90\% to below 1\%. All services deviate from their documented taxonomy, hampering sound usage for research. Further, labels are highly inconsistent across providers, who show little agreement over domains, making it difficult to compare or combine these services. We also show how the dynamics of crowd-sourced efforts may be obstructed by scalability and coverage aspects as well as subjective disagreements among human labelers. Finally, through case studies, we showcase that most services are not fit for detecting specialized content for research or content-blocking purposes. We conclude with actionable recommendations on their usage based on our empirical insights and experience. Particularly, we focus on how users should handle the significant disparities observed across services both in technical solutions and in research.},
-    booktitle = {Proceedings of the ACM Internet Measurement Conference},
-    pages = {598–618},
-    numpages = {21},
-    location = {Virtual Event, USA},
-    series = {IMC '20}
-}
-</bibtex>
+~~DISCUSSION~~