Difference between revisions of "WebXray domain ownership list"
Jump to navigation
Jump to search
(6 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | Drawn from the webXray project, this list provides a hierarchical accounting of what entities own commonly found third-party domains on the web. | + | Drawn from the [https://www.webxray.org webXray project], this list provides a hierarchical accounting of what entities own commonly found third-party domains on the web. |
The database is hosted at [https://github.com/timlib/webXray_Domain_Owner_List Github]. | The database is hosted at [https://github.com/timlib/webXray_Domain_Owner_List Github]. | ||
==What the database contains== | ==What the database contains== | ||
+ | |||
+ | '''Is the database versioned? How?''' {{P|45}} | ||
+ | |||
The list is a JSON file. Each entry in the list has the following fields: | The list is a JSON file. Each entry in the list has the following fields: | ||
− | id: a numeric identifier (integer) for the entry, this will change whenever the list is expanded and reindexed, do not count on it remaining stable | + | * '''id ({{P|124}})''': a numeric identifier (integer) for the entry, this will change whenever the list is expanded and reindexed, do not count on it remaining stable. It will be stored as a string under {{P|124}}, and each statement here will need to be qualified with a version number. |
− | parent_id: if the entity has a parent owner, the id of the parent | + | * '''parent_id ({{P|123}})''': if the entity has a parent owner, the id of the parent |
− | owner_name: a string which is the name of the service (eg. 'Google Analytics') or the company ('Google') which owns the domain | + | * '''owner_name''': a string which is the name of the service (eg. 'Google Analytics') or the company ('Google') which owns the domain (This should become the label on here, but also added in the {{P|27}} property) |
− | aliases: an array of strings representing possible alternate spellings of the owner_name (eg. 'YouTube' and 'You Tube') | + | * '''aliases''': an array of strings representing possible alternate spellings of the owner_name (eg. 'YouTube' and 'You Tube') (These should become aliases under {{P|27}}, note that we can have many entries for each) |
− | homepage_url: a string which is the url of the homepage of the service or company | + | * '''homepage_url''': a string which is the url of the homepage of the service or company ({{P|33}}, maybe {{P|15}}) |
− | privacy_policy_url: a string which is the url of the privacy policy of the service or company | + | * '''privacy_policy_url''': a string which is the url of the privacy policy of the service or company. Best is I think (({{P|32}} - {{Q|29}}, and then a qualifier with {{P|33}}). Because then we can add {{P|32}}-{{Q|412}}, with links to the Privacy Shield commitment, for instance. It's a better pattern that repeats. |
− | notes: a string which has pertinent information as to why a domain was assigned to a given owner | + | * '''notes''': a string which has pertinent information as to why a domain was assigned to a given owner ({{P|126}}) |
− | country: the ccTLD for the country in which the service or company is based | + | * '''country''': the ccTLD for the country in which the service or company is based. We have {{P|55}}, so that should be used, but I think the utility of such a thing is fairly limited because the companies are so complex. Isn't what you want a property "ccTLD" ? Isn't this associated to the domains rather than the owner??? What are your assumptions there? |
− | uses: what a first-party uses the service for, note that first-party use may be different than the ultimate third-party use. For example, a site may use audience measurement tools from a third-party to gain insights into traffic, but the third-party may use this data for marketing. | + | * '''uses''': what a first-party uses the service for, note that first-party use may be different than the ultimate third-party use. For example, a site may use audience measurement tools from a third-party to gain insights into traffic, but the third-party may use this data for marketing. I suggest {{P|127}} and not {{P|10}}, it's more precise. |
− | platforms: where the domain has been observed, so far 'web', 'mobile', and 'email' | + | * '''platforms''': where the domain has been observed, so far 'web', 'mobile', and 'email'. We will want to complexify those (mobile --> {Android, iOS}; or an industry) so it makes sense to make the target an item. I don't like the link "platforms". We want a more generic word, but not too generic either. prevalence/presence/scope/???. |
− | domains: an array of domian names (strings) which are owned by the given service or company | + | * '''domains''': an array of domian names (strings) which are owned by the given service or company {{P|118}} |
==Why it is important== | ==Why it is important== | ||
− | * | + | * The domain ownership list needs to be populated to offer an index of simply where personal data is being sent, and where the legal ownership of the domains rests. |
− | * third party tracking is | + | * Without these fields, the filing of SARs is far more time consuming, and mass inspections of websites are barely intelligible. |
+ | |||
+ | The third party tracking ecosystem is currently free to operate and monetise individuals personal data without transparency. The domain ownership list is a step to addressing this. | ||
+ | |||
==Why it is important to build it collaboratively== | ==Why it is important to build it collaboratively== | ||
* can do SARs, and build the whole pipeline of support around that | * can do SARs, and build the whole pipeline of support around that | ||
Line 41: | Line 47: | ||
==Wishlist WebXRay side== | ==Wishlist WebXRay side== | ||
* Framework for contributions and easy format for submission. | * Framework for contributions and easy format for submission. | ||
− | + | * contributions! |
Latest revision as of 21:12, 22 October 2019
Drawn from the webXray project, this list provides a hierarchical accounting of what entities own commonly found third-party domains on the web.
The database is hosted at Github.
What the database contains
Is the database versioned? How? version (P45)
The list is a JSON file. Each entry in the list has the following fields:
- id (WebXray ID (P124)): a numeric identifier (integer) for the entry, this will change whenever the list is expanded and reindexed, do not count on it remaining stable. It will be stored as a string under WebXray ID (P124), and each statement here will need to be qualified with a version number.
- parent_id (parent (P123)): if the entity has a parent owner, the id of the parent
- owner_name: a string which is the name of the service (eg. 'Google Analytics') or the company ('Google') which owns the domain (This should become the label on here, but also added in the named as (P27) property)
- aliases: an array of strings representing possible alternate spellings of the owner_name (eg. 'YouTube' and 'You Tube') (These should become aliases under named as (P27), note that we can have many entries for each)
- homepage_url: a string which is the url of the homepage of the service or company (reference URL (P33), maybe official website (P15))
- privacy_policy_url: a string which is the url of the privacy policy of the service or company. Best is I think ((states compliance (P32) - privacy policy (Q29), and then a qualifier with reference URL (P33)). Because then we can add states compliance (P32)-Privacy Shield arrangement (Q412), with links to the Privacy Shield commitment, for instance. It's a better pattern that repeats.
- notes: a string which has pertinent information as to why a domain was assigned to a given owner (comment (P126))
- country: the ccTLD for the country in which the service or company is based. We have country (P55), so that should be used, but I think the utility of such a thing is fairly limited because the companies are so complex. Isn't what you want a property "ccTLD" ? Isn't this associated to the domains rather than the owner??? What are your assumptions there?
- uses: what a first-party uses the service for, note that first-party use may be different than the ultimate third-party use. For example, a site may use audience measurement tools from a third-party to gain insights into traffic, but the third-party may use this data for marketing. I suggest purpose (P127) and not collects (P10), it's more precise.
- platforms: where the domain has been observed, so far 'web', 'mobile', and 'email'. We will want to complexify those (mobile --> {Android, iOS}; or an industry) so it makes sense to make the target an item. I don't like the link "platforms". We want a more generic word, but not too generic either. prevalence/presence/scope/???.
- domains: an array of domian names (strings) which are owned by the given service or company web domain name (P118)
Why it is important
- The domain ownership list needs to be populated to offer an index of simply where personal data is being sent, and where the legal ownership of the domains rests.
- Without these fields, the filing of SARs is far more time consuming, and mass inspections of websites are barely intelligible.
The third party tracking ecosystem is currently free to operate and monetise individuals personal data without transparency. The domain ownership list is a step to addressing this.
Why it is important to build it collaboratively
- can do SARs, and build the whole pipeline of support around that
- can do visualisations, maybe not for the general public but at least for super users of PDIO to better understand progress
What needs to be done for PDIO
- define format for the adtech entries
- complexify the SAR tool in order to leverage third party situations
Wishlist WebXRay side
- Framework for contributions and easy format for submission.
- contributions!