Splitting the work

From Wikibase Personal data
Jump to navigation Jump to search

Overview

There are many options for storing data and code in the Wikibase system:

  • as data entering the item/property paradigm;
  • as templates in the Template: namespace;
  • as module in the Md: namespace, i.e. using a Lua script;
  • as widget in the Widget: namespace;
  • as Javascript in the user script/common.js paradigm (User:/MediaWiki:).

The widget namespace probably only makes sense for the operation of the website itself.

A lot of components can be and should be reused from Wikidata. At least for templates, modules, widgets, and scripts, those that make sense to reuse probably have to do with the operation of the website itself as well. For items and properties, the reuse might be broader: a database of companies is useful, for instance, and certainly Wikidata contains a start.

Otherwise, there are factors to take into account to distribute the work:

  • access control -- or lack thereof -- can be decided with namespace granularity (considering that specific items can be protected). This is very important in also enabling outside contributions as data that become quickly very useful, or possibly for users to design their own new interfaces;
  • some data will remain user-side, and needs to be semantically aligned. Therefore it makes sense to maintain some of the information within the database rather than the code;
  • some data will be written by technical contributors, who will not write code but data.

A concrete example can illustrate the last two points: the system might want to store in the user's personal data store that the "_ga" cookie on the website "example.com" had a given value. Alongside, the system should also store, that the email address of the user is "joe@schmo.com". Both are attributes of the individual user, so they should be stored in the same way. The best way to achieve this is to have one item for "email address", and one for "_ga cookie on example.com". We call this principle "internet architecture as semantic data": what is at the level of ecosystem architecture is what is not personal data, and should follow semantic principles. In this way individual level data can be clearly identified and manipulated, in a privacy preserving way.

In general, the code base, distributed over many different languages and modes of operation, should be seen as a toolkit enabling others to craft their own experiments.

Lessons from "telephone number"

We implemented a quick system for storing telephone number interface button (Q488) locally and interfacing with the server.

The perspective/goal has been refined: smooth distribution of data and computation across server and client.

Computing

The server computing environment could be:

  • gadgets;
  • user scripts;
  • other?

The client computing environment could be:

The client itself can distribute its data however it pleases, as well as the computations it needs to perform. The server serves code that has to be run.

Static data

For static data, we win the closest we are to storing RDF graphs. RDF graphs (i.e. collections of triples) have the advantage that they are directly interoperable at a technical level. Just append the two files. So when we store data, including in the browser, etc, we should be as close to that as possible.

Semantics

Beyond mere storage of data, the semantics matter as well. For interoperability reasons (multiple servers eventually), it makes sense to get as far away from possible from the intricacies of Wikibase, and refer to concepts using external tags, possibly coming from well-known ontologies. This is true on server side, as well as on client side (when data is stored in browser storage, for instance).

Formatting

There is also the question of formatting. What constitutes a properly stored phone number? We didn't touch that yet, but presumably it could be stored also by an additional server, possibly separate from the server holding the ontology. (Actually, we did, to deal with the mess of Wikibase properties and qualifiers, but this is "one level higher" in the abstraction, not looking at values of user attributes, but how properties and items are indexed in our instance)

Conclusions

For static data we need to keep as close as possible to RDF format, everywhere. rdflib.js and the concept of "named graphs" can help.

For computation (code), we need to decouple as much as possible prototype definitions, instantiation of the objects, and the definitions of static storage of properties and methods. In JavaScript, 'this' is our friend, as it allows us to attach fluidly code to data. But it will only remain our friend if we are deliberate in what we are doing.