Last Update April 6, 2011

Welcome

Welcome to the Nanopub, a friendly place serving up nanopublications to critique, customize and share. This site was established to host the very first collection of nanopublications.

Here at the Nanopub you can sample nanopublications and learn more about initiatives to create an infrastructure for this new form of professional communication.

A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author. Individual nanopublications can be cited by others and monitored for their impact on the community.

Nanopublications are a natural response to the explosion of high-quality contextual information that overwhelms the capacity of conventional research articles in scholarly communication. With nanopublications, it is possible to disseminate individual data as independent publications with or without an accompanying research article [1,2]. Furthermore, because nanopublications can be attributed and cited, they provide incentives for researchers to make their data available in standard formats that drive data accessibility and interoperability [3].

Nanopublications are intended to complement, not displace, existing modes of peer-reviewed publication and Myles Axton, Editor at Nature Genetics assures us that “NO DATABASES WERE HARMED IN THE MAKING OF THIS REVOLUTION!”.

This website provides example nanopublications to guide early adopters who would like to publish their own data as nanopublications. Enjoy!

The Anatomy of a Nanopublication

A nanopublication has three basic elements [4]:

  1. An assertion whereby two concepts (called the Subject and the Object) are associated (using a third concept called the Predicate).
  2. Metadata regarding Conditions under which the assertion holds.
  3. Metadata regarding the Provenance of the assertion, such as its author, a time-stamp marking when it was created, links to DOIs, URLs, etc.

Nanopublications are based on open standards, and we anticipate the community–driven evolution of nanopublication formats to fit the changing needs of authors and publishers. Nanopublications can be serialized in many different ways, for example using XML and RDF. Standards allow nanopublications to be machine readable, opening the door to many new communication possibilities. Nanopublications allow internet-based search for and retrieval of specific data rather than for documents (that may or may not contain that data) or databases (that often have idiosyncratic data structures). Machine readability of nanopublications aims to enable universal interoperability and the automated discovery of new associations that would otherwise be beyond the capacity of human reasoning.

Some Principles of Nanopublishing

We propose the following principles when creating new nanopublications:

The Principle of Added Value:

The Assertion arises from a well-documented procedure or observation. For example, the Predicate establishing an association between the Subject and the Object could arise from a mathematical model, co-occurrence in text, a new experimental dataset, manually curated relations established by experts or from the exposure of an existing database.

The Principle of Transparency:

The Provenance and Condition refer to who, what, where, when of the Predicate, allowing the quality of the nanopublication to be assessed by others.

The Principle of Ambiguity Avoidance:

The arguments of the nanopublication (all concepts in the assertion, the condition and the provenance) can be unambiguously resolved to unique concepts.

The Principle of Global Reference:

Where authority, namespace, accession and version of any nanopublication argument has already been established on the Web, the Unique Resource Identifier (URI) of the concept should be used. Where no URI exists, a Universal and Unique Identifier (UUID) can be generated using the ConceptWiki.

A Sample Nanopublication

With this in mind, consider the following example of a nanopublication.

A recent review article by Giardine et al [3] presented a large amount of high-quality data as Supplementary Information (4 separate Tables, downloadable as an Excel spreadsheet). DNA variants of the hemoglobin gene are described in the first column of the Variant Submission Information Table while their observed frequencies in human populations are listed in the third column of the Variant Frequency Information Table. These data form assertions that can be exposed as nanopublications.

For example, the DNA variant NG_000007.3:g.70628G>A (Subject) has a frequency (Predicate) of 0.25% (Object). The assertion holds specifically for the Sardinian population (the second column of the Variant Frequency Information Table) and is likely to be not true for other populations. Hence, Sardinian becomes a Condition of the nanopublication. The Provenance includes the authors of the article (Giardine et. al.), the date when the nanopublication was created, and other information. Schematically, this particular nanopublication looks like this:

As XML, this nanopublication can be written as:

<nanopublication id="0">
 <assertion>
  <subject>NG_000007.3:g.70628G>A</subject>
  <predicate>has variant frequency</predicate>
  <object>0.25%</object>
 </assertion>
 <condition>Sardinian</condition>
 <provenance>
  <dateofcreation>March 24, 2011</dateofcreation>
  <lastedit>March 24, 2011</lastedit>
  <evidenceType>empirical</evidenceType>
  <authorID>Giardine et. al.</authorID>
  <curatorID>unresolved</curatorID>
  <registrantID>Mons et. al.</registrantID>
  <PMID>6695908</PMID>
  <PMID>1428944</PMID>
  <PMID>1610915</PMID>
  <DOI>http://dx.doi.org/10.1038/ng.785</DOI>
  <linkout>http://globin.bx.psu.edu/cgi-bin/hbvar/query_vars3?mode=output&display_format=page&i=239</linkout>
  <linkout>http://phencode.bx.psu.edu/cgi-bin/phencode/phencode?build=hg18&id=HbVar.239</linkout>
 </provenance>
</nanopublication>

By data-mining the Giardine et al Supplementary Information, we exposed 637 such nanopublications (available as a single text file in the Human Hemoglobin Genetic Variation link of the Nanopublication Downloads section). Note that the XML provides a header and footer delimiting the nanopublication and structures the information of the nanopublication for machine-readability. The header also contains an identifier unique for each nanopublication (id=”0”).

In the Provenance, we are given information to help us evaluate the origin and quality of the nanopublication. First, we see when the nanopublication was created and the last time it was edited. The "evidenceType" specifies whether the nanopublication is derived from observation or measurement (i.e, is "empirical") or derived as a prediction based on a model or theory (i.e., is "hypothetical"). The nanopublication then distinguishes between the "author" (who published the narrative description of the assertion), the "curator" (who entered the data of the assertion into a database), and the "registrant" (who produced the nanopublication itself). By distinguishing between these contributions, nanopublications permit a more fine-grained attribution and citation of data, and the development of bibliometrics that allow the impact of these contributions to be monitored. In turn, this will create incentives for the often thankless tasks of data curation and for the tedious creation of nanopublications from legacy information. Lastly, included in this example nanopublication are PubMed identification numbers from the Supplementary Information, a DOI for the article itself, and links to the relevant records in the HbVar and PhenCode databases.

Although the nanopublication is machine-readable, in such a simple representation, the meaning of arguments in the nanopublication may not always be clear independent of the article describing the database from which the nanopublication was made. For example, without reference to the article [3], it will not always be possible to understand the meaning of “NG_000007.3:g.70628G>A” or “has variant frequency” or “0.25%”.

One way to reduce this kind of ambiguity in nanopublications is to specify the arguments for the Subject, Predicate, Object, Condition and Provenance as internet resources or URIs. The idea is that explicit and unambiguous definitions of these data exist somewhere on the web (in databases or ontologies) and their internet addresses can be used to identify them. For example, the term “dystrophin” by itself is somewhat ambiguous, but the gene "DMD dystrophin [Homo sapiens]" is listed in the Entrez Gene Database with the identification number 1756, and can be unambiguously linked to at http://www.ncbi.nlm.nih.gov/gene/1756.

RDF provides a language for using URIs to identify concepts by linking to them, reducing or obviating ambiguity. Furthermore, in keeping with the principles of linked data, URIs allow other nanopublications to make use of the same concepts, forming a semantic network of inter-linked data that could be used in knowledge discovery [5]. To preserve priority for authors and to incentivize participation and convergence of participating semantic resources we advocate, whenever possible, the use of existing URIs.

However, the manifold issues of URI management can sometimes lead to problems with persistence and consistency of meaning over time. URIs are liable to change as directories are reconfigured and domain names are transferred or decommissioned. This means there will be broken links. Furthermore, terms are often ambiguous and URIs could mistakenly point to the wrong concepts.

In an attempt to mitigate these issues, the ConceptWiki was created as a redundant layer of disambiguation and indexing to guarantee persistence of concept identifiers and semantics. In the ConceptWiki, any concept that could appear in a nanopublication can potentially be described as a unique wiki page, and each wiki page is automatically issued a UUID. Each ConceptWiki page also keeps a list of the synonyms and the URIs for that concept (the so-called, Also Referred To As Table or ARTA Table). ConceptWiki users can edit and modify ARTA Tables as the need arises. Together, the UUIDs and the ARTA Tables can resolve ambiguity between terms, URIs and concepts and preserve links despite broken URIs. We therefore encourage nanopublications that use UUIDs issued by the ConceptWiki.

To transform our 637 XML nanopublications into RDF, we first had to locate or create URIs for all of the nanopublication data. In some cases, such as the DOI and PubMed IDs, suitable existing URIs already existed. We found that other concepts such as authors, curators and ethnic populations had ConceptWiki descriptions and corresponding UUIDs that we could use. However, many of the concepts, such as DNA variants had neither URIs nor ConceptWiki UUIDs. In these cases, we used a combination of manual and automated approaches to generate new ConceptWiki pages and UUIDs.

For "authorID" we could have listed ConceptWiki UUIDs for each of the 47 authors of the Giardine et al article [3]. For clarity, we include here the UUID of only the first author (Belinda Giardine). Likewise, for "registrantID" we could have listed ConceptWiki UUIDs for each of the 16 members of the team that exposed the Giardine et al Supplementary Information [2], but for clarity list the UUID of only the first author (Barend Mons).

The "curatorID" posed a special challenge. We found reference to curators in the Comments column of the Variant Submission Information Table, but of the 637 nanopublications, only 25 could be unambiguously resolve to a URI or ConceptWiki UUID. In the remaining 612 nanopublications we link to a generic ConceptWiki page called "Unresolved". These links can be edited later to the UUID of the actual curators by experts who know the identity of curators or the curators themselves.

Finally, using these URIs and UUIDs, the nanopublication depicted above in XML looks like this in RDF:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix nanopub: <http://www.nanopub.org/nschema#>.
@prefix nphbvar: <http://www.nanopub.org/nanopubs/hbvar#>.
@prefix conceptwiki: <http://www.conceptwiki.org/index.php/Concept:>.
@prefix dcterms: <http://purl.org/dc/terms/>.
{
nphbvar:n0 nanopub:hasAssertion nphbvar:n0assertion.
nphbvar:n0 nanopub:hasProvenance nphbvar:n0provenance.
nphbvar:n0 nanopub:hasCondition nphbvar:n0condition.
nphbvar:n0 rdf:type nanopub:Nanopublication.
}
nphbvar:n0assertion {
conceptwiki:e9598771-0fa2-40bb-814e-2a2e846fd166 conceptwiki:f79290f7-f41f-4418-9611-6b3e0ff4b8bb "0.25%"^^xsd:float.
}
nphbvar:n0condition {
nphbvar:n0 conceptwiki:ad66b4b5-9850-4273-9c2d-44ccf170e3eb conceptwiki:a510399c-ae15-4ab3-974b-fcc5552bd417
}
nphbvar:n0provenance {
nphbvar:n0assertion dcterms:created "2011-03-24"^^xsd:date.
nphbvar:n0assertion dcterms:modified "2011-03-24"^^xsd:date.
nphbvar:n0assertion nanopub:authorID conceptwiki:e909b3e0-0dd4-4caf-ac7a-4d200890e183.
nphbvar:n0assertion nanopub:curatorID conceptwiki:d8880e2c-c351-47a1-84dd-828e3f05be3a>.
nphbvar:n0assertion nanopub:registrantID conceptwiki:ddea13fc-78be-11df-9387-001517ac506c.
nphbvar:n0assertion nanopub:fromDOI  <http://dx.doi.org/10.1038/ng.785>.
nphbvar:n0assertion nanopub:fromPubMedId <http://www.ncbi.nlm.nih.gov/pubmed/6695908>.
nphbvar:n0assertion nanopub:fromPubMedId <http://www.ncbi.nlm.nih.gov/pubmed/1428944>.
nphbvar:n0assertion nanopub:fromPubMedId <http://www.ncbi.nlm.nih.gov/pubmed/1610915>.
nphbvar:n0assertion nanopub:evidenceType conceptwiki:fff20cb4-5f2f-11df-b0cb-001517ac506c.
nphbvar:n0assertion nanopub:linkout <http://globin.bx.psu.edu/cgi-bin/hbvar/query_vars3?mode=output&display_format=page&i=239>.
nphbvar:n0assertion nanopub:linkout <http://phencode.bx.psu.edu/cgi-bin/phencode/phencode?build=hg18&id=HbVar.239>.
}

The "@prefix" lines in the header compose the RDF schema. The URIs and UUIDs are shown in red (and link to their appropriate web resources). Like the XML, these 637 RDF-based nanopublications can be downloaded as a single text file in the Nanopublication Downloads section. However, the real value of RDF is that these data are now accessible to queries from anywhere on the web without having to parse the Supplementary Information spreadsheet or even knowing that the data even exists.

Note that the RDF encoding is for machine-readability and is not intended for human consumption. Although some aspects of the URIs depicted here have implementation dependencies (e.g., "index.php" or "cgi-bin") that may jeopardize their long-term persistence, these problems can be handled in various ways, for example by using PURLs.

This sample nanopublication illustrates some of the many challenges involved in exposing data from existing databases and other legacy information. Our intension here is not to advocate specific solutions but to frame the issues and engender discussion. However, in looking to the future we imagine a world where authors publish all data directly as nanopublications following an existing ontology and RDF specifications (that is, we hope the nanopublication "authors" and "registrants" become the same people). Not only would this eliminate an enormous amount of ambiguity, but it would give rise to a ecosystem of linked data that is universally accessible, interoperable, searchable and 'reasonable' via automated knowledge discovery systems (such as the Large Knowledge Collider). Of course, nanopublications do not preclude the assembly of data into tables, spreadsheets, and traditional databases. But by generating the nanopublication first any resulting database will tend to be more explicit about its contents.

References

Here are some conventional publications about nanopublications:

  1. Axton M (2011) Crowdsourcing human mutations. Nature Genetics 43: 279.
  2. Mons B et al (2011) The value of data. Commentary, Nature Genetics 43: 281-283.
  3. Giardine B et al (2011) Systematic documentation and analysis of human genetic variation using the microattribution approach. Nature Genetics 43: 295-301.
  4. Groth P, Gibson A & Velterop J (2010) The Anatomy of a Nanopublication. Information Services & Use 30: 51-56.
  5. An overview of semantic web & linked data principles and An example application in the pharmacological space