Monsense 01: January 2014
In the past, we have sometimes developed solutions that now are extremely useful, but, had we been in a position to ask a broader expert community how to progress we might have chosen more sustainable solutions. One of the purposes of this blog is to ‘float’ ideas more broadly than my internal group before embarking on realisation (or ditch them).
A nanopublication-based, community compliant core concept reference & unification service: C4RUS (pronounce SeeForUS, or Syphorus, the presbiter that was elected after the death of Thomas (first seeing, then believing)…
One of the key elements of a nanopublication is that each and every concept used in the assertional graph as well as in the provenance graphs is unambiguously mapped to a unique and stable identifier. I have always been arguing for the UUID in addition to multiple unique URI’s for reasons that have been deeply discussed before , but in brief, URI’s usually contain semantically meaningful strings and are therefore inherently unstable (if you want the whole story, write a mail to Geoff Bilder). Now that we progressively see RDF rather as a static interoperability language than as an active reasoning environment, performance is no longer a major issue and we might even contemplate to introduce URI’s that contain a UUID as part of the string. Even if the ‘variable part’ of such a URI would change, the UUID will still be uniquely pointing to a concept reference store. Such an independent, community owned concept reference store has been attempted in the context of (a.o) the Open PHACTS project of IMI, where it was called ‘Concept Wiki’ for historical reasons.
It has been a growing concern that the Concept Wiki in its current form will be very difficult to sustain, while everyone using the non-ideal version today is convinced that ‘something like this’ is needed in a semantic data publishing environment. The Netherlands Bioinformatics Center has been primly responsible, in close collaboration with SIB and LUMC for the current version of the Concept Wiki. Separate form the technical, community and updates-related difficulties inherent to the current Concept Wiki implementation, the concentration of the responsibility to keep such a core service in the air cannot be with a single institute or even a small group.
Therefore, here I argue for a radically different approach, which has already been discussed with a few key people in the community, but is obviously meant to provoke discussion, as is any Monsense post.
Let’s look first at what is ‘really needed’
First of all, we learned that as soon as the community realizes the benefit of a interoperability standard in anything, the first pitfall is that we create many standards. How do we avoid this already emerging mess in the case of many controlled vocabularies describing individual concepts that then are referred to with a growing number of different URI’s that in turn have the inherent feature of being unstable over time? There is no ‘undisputed’ global authority, trusted by everyone in the community that can serve this role. It would also be a role that is difficult to finance sustainably, as it is a very ‘invisible’ service, although at the heart of the semantic bridge between humans and computers.
If we want to make computers ‘understand’ what we ‘mean’ by a symbol used in computer readable language, we have to be absolutely sure that in any circumstance, there will be a reliable and stable service to resolve the ‘meaning’ of that symbol. That symbol is not always referring to what we classically call an ‘Entity’. In Subject-Predicate-Object triples (also the building blocks of nanopublications) many concepts are not ‘entities’ but still ‘units of thought’. Hence I talk about ‘concepts’ for any ‘unit of human thought’ that needs a unique reference so that ‘computers can follow us’.
So, assuming that we all agree the service is needed, it needs to be reliable and stable, but it can not be entrusted or put on the shoulders of one dedicated, professional and trusted global party, we need to find a way to entrust it with the most stable (although dynamic) group in the world: the research community. (As I already said during the CWA founding meeting: ‘even that community can disappear, but in that case we do not need the service anymore’.
How can we do this?
With a slight bias to the basic idea of nanopublication, here is my proposal:
The ‘Concept Wiki’ >>> C4RUS has to serve at least the following needs:
(a) A unique and stable reference to each concept used in research
(b) That means: A UUID for each concept
(c) The ability to add all symbols (human language terms as well as URI’s/URL’s etc.) referring to the concept represented by the UUID
Important facts to deal with:
(e) Some symbols are ‘generally recognized’ and some are not (only used by one organisaton for instance).
(f) Some symbols are uniquely referring to one particular concept, some are ambiguous (referring to more than one concept), the latter being mainly true for human language symbols (homonymy)
In my mind the ‘ideal successor’ of the Concept Wiki should therefore have the following minimal features to be both useful and sustainable.
- For each concept used in research, the service has (or creates a) UUID
- Each UUID (representing the concept at the top of the semantic triangle) is ‘referred to’ by minimally one symbol in human language (usually a term of an ‘identifier’) and minimally one in computer language (usually a type of URI).
- The system is RDF-with-Provenance (=nanopublication) based and is in fact an Open Nanopublication Store (ONS) of (distributed and) fully open RDF, accessible to everybody and with the most liberal User License.
- The ‘principle nanopublications’ in this ONS describe ‘how the community refers to a concept’ (which inherently supports the unavoidable situation that people will continue to use different symbols to refer to the same concept)
Nanopublications in the [C4RUS] have a very simple basic scheme.
- The ‘Subject’ is referred to by a URI that contains the UUID, here denoted as [UUID].
- The ‘Object’ is either a [term] or a [URI] or any other symbol that refers to a concept.
- The Predicatesneeded to make the reference service operational are very limited and include minimally:
- [ARTA] (ARTA needs a UUID as it is a well defined predicate), which stands for (proposed definiton):
[Also referred to As] A predicate that links the symbol used to refer to a given concept in the semantic triangle to that concept.
In the context of the [C4RUS] the ‘subject’ is represented by the UUID representing that concept in the C4RUS.
- The typical SKOS predicates needed to reconstruct hierarchical thesaurus relationships between broader and narrower concepts
- The provenance: Who published this nanopublication and when.
In my mind, this serves all purposes for which we conceived the current Concept Wiki.
- Everyone can now ‘nano-publish’ a vocabulary or thesaurus, with one effort to map each symbol in the ‘local’ terminology system (LTS) to the corresponding UUID in the C4RUS. (mostly synonym-mapping based and a few manual checks)
- For concepts in the LTS that cannot be mapped to any of the existing UUIDs in the C4RUS, the LTS can freely create novel UUID’s.
- The first ARTA-type nanopublications resulting from this effort will be minimally two (but in many cases more than two due to already known synonyms) of the type:
- [UUID] [ARTA] [symbol01 in human language]
- [UUID] [ARTA] [symbol01 in computer language]
- From all nanopublications starting with the same UUID (subject), anyone can now reconstruct a ‘ARTA table’ with all symbols referring to the same concept.
- Conversely, tables with ambiguous symbols (objects) will be automatically created as identical ‘symbols’ may have multiple UUID’s associated with them via the ARTA predicate.
The hot topic of ‘Authority’ (does SwissProt determine the preferred URI’s or terms for proteins and Chemspider for Chemicals?) can now also be very easily solved via the provenance graph in each nanopublication. Also, this approach does NOT compete with UMLS, OBOfoundry, BioPortal or Chemspider (to name a few). These ‘authorities can review nanopublications pertaining to one of the concepts in their system and ‘decide’ to publish and authoritative nanopublication either confirming the community one or contesting it!
- If ‘any Dutchman’ would publish a nanopublication stating that [9dadb18a-e1a9-49f9-9062-296419555fd5] [ARTA] [De Gevreesde Ziekte], where the UUID is referring to ‘cancer’ in the current CW implementation, a number of things will happen if we follow a correct nanopublication scheme.
- It will be automatically known that this nanopublication refers to concept [C0006826] in UMLS and to concept [http://purl.bioontology.org/ontology/MSH/D009369] in the bioontology service.
- The provenance will contain the ‘registration ID (ORCID?) of the submitter as well as a time/date stamp. This will, in the case of ‘any Dutchman’ not lead to a change in the ‘authoritative’ ARTA table of any organization using the C4RUS for local purposes and which has decided that ‘for them’ UMLS is the only authority for disease terms in other languages than English. However, the nanopublication asserting that that ‘any Dutchman’ has published that [9dadb18a-e1a9-49f9-9062-296419555fd5] [ARTA] [De Gevreesde Ziekte] is accessible and usable to anyone and will enrich automatically the ‘ARTA’ table of [9dadb18a-e1a9-49f9-9062-296419555fd5] in any application that ‘allows all nanopublications to show up’.
- If NLM would decide to accept the term [De Gevreesde Ziekte] for the concept [C0006826] in Dutch language, they can just publish that again in a nanopublication. The only difference between the non-authoritative nanopublication and the authoritative one is in the provenance graph (author=NLM and time/date stamp)
- Assuming that the time/date stamp of the authoritative nanopublication is at a later point in time, ‘Any Dutchman’ will still get the credit for the original nanopublication.
How does this all influence sustainability?
- The discussion on ‘who do we accept as authority’ is mitigated, as each organization can decide that for themselves and only nanopublications published by their accepted authority will show up and function in their local system if they switch on their ‘authority (= author) filter.
- Anyone can now ‘publish a LTS’ in nanopublication format and as long as the www.nanopub.org guidelines are followed, anyone can now use these to enrich any system.
- Many different organizations can ‘host’ nanopublications stores of ‘C4RUS’ type nanopublications and it is their own responsibility to keep these up and running and up to date. Obviously, interested organsiations can ‘sponsor’ certain datasets if they are deemed to be of general and sustained use and the asset at the originating organization is not self-sustainable.
- Organisations such as ELIXIR, NCBI, NLM, RDA, Force11, CWA etc. can ‘endorse’ authorities, and nanopublication stores as ‘preferred’ reference standards. However, even if people or organizations decide to allow and use other reference stores, everything will always be referencable back to the UUID in the C4RUS.
Remains the question: Who assumes responsibility for the UUID assignment in the C4RUS?
The answer is obviously: ‘Ideally the community’ However, the community, although always there, is changing and elusive when it comes to ‘institutionalizing’.
However, the beauty of the UUID assignment principle is that by its very nature, each UUID is ‘universally unique’ and can thus only be given out once. Now, whomever resumes (part of) the responsibility to host a UUID based C4RUS, will obviously have a website and therefore each UUID descriptor page will need a URL. However, I argue here that this does NOT pose a problem. Right now for the concept [9dadb18a-e1a9-49f9-9062-296419555fd5] there is obviously a page in the CW, namely: http://www.conceptwiki.org/concept/index/9dadb18a-e1a9-49f9-9062-296419555fd5. However, if we decide to change to http://www.cocors/9dadb18a-e1a9-49f9-9062-296419555fd5 in the future the simple fact that the URI contains the UUID [9dadb18a-e1a9-49f9-9062-296419555fd5] will enable any system to find the reference to the concept ‘cancer’ (or ‘De gevreesde Ziekte’ for that matter]
Just try the clean search term 9dadb18a-e1a9-49f9-9062-296419555fd5 in Google right now.
As the newest CW implementation has not been indexed yet, today there are 0 hits on Google…..
But before that, do not forget to jump and react to this Monsense proposal.