WebXray domain ownership list

From Wikibase Personal data
Jump to navigation Jump to search

Drawn from the webXray project, this list provides a hierarchical accounting of what entities own commonly found third-party domains on the web.

The database is hosted at Github.

What the database contains

Is the database versioned? How? version (P45)

The list is a JSON file. Each entry in the list has the following fields:

  • id (WebXray ID (P124)): a numeric identifier (integer) for the entry, this will change whenever the list is expanded and reindexed, do not count on it remaining stable. It will be stored as a string under WebXray ID (P124), and each statement here will need to be qualified with a version number.
  • parent_id (parent (P123)): if the entity has a parent owner, the id of the parent
  • owner_name: a string which is the name of the service (eg. 'Google Analytics') or the company ('Google') which owns the domain (This should become the label on here, but also added in the named as (P27) property)
  • aliases: an array of strings representing possible alternate spellings of the owner_name (eg. 'YouTube' and 'You Tube') (These should become aliases under named as (P27), note that we can have many entries for each)
  • notes: a string which has pertinent information as to why a domain was assigned to a given owner (comment (P126))
  • country: the ccTLD for the country in which the service or company is based. We have country (P55), so that should be used, but I think the utility of such a thing is fairly limited because the companies are so complex. Isn't what you want a property "ccTLD" ? Isn't this associated to the domains rather than the owner??? What are your assumptions there?
  • uses: what a first-party uses the service for, note that first-party use may be different than the ultimate third-party use. For example, a site may use audience measurement tools from a third-party to gain insights into traffic, but the third-party may use this data for marketing. I suggest purpose (P127) and not collects (P10), it's more precise.
  • platforms: where the domain has been observed, so far 'web', 'mobile', and 'email'. We will want to complexify those (mobile --> {Android, iOS}; or an industry) so it makes sense to make the target an item. I don't like the link "platforms". We want a more generic word, but not too generic either. prevalence/presence/scope/???.
  • domains: an array of domian names (strings) which are owned by the given service or company web domain name (P118)

Why it is important

  • The domain ownership list needs to be populated to offer an index of simply where personal data is being sent, and where the legal ownership of the domains rests.
  • Without these fields, the filing of SARs is far more time consuming, and mass inspections of websites are barely intelligible.

The third party tracking ecosystem is currently free to operate and monetise individuals personal data without transparency. The domain ownership list is a step to addressing this.

Why it is important to build it collaboratively

  • can do SARs, and build the whole pipeline of support around that
  • can do visualisations, maybe not for the general public but at least for super users of PDIO to better understand progress

What needs to be done for PDIO

  • define format for the adtech entries
  • complexify the SAR tool in order to leverage third party situations

Wishlist WebXRay side

  • Framework for contributions and easy format for submission.
  • contributions!