Predicted gene-disease associations from text mining


The goal of this project was to find novel associations between human genes and disease phenotypes by exposing networks of implicit information in PubMed abstracts. Our technology uses concept profiles, which allow us to infer gene-disease associations even in cases where relationships have never been explicitly stated. After mining 18 million abstracts dating from 1980 to 2011, running concept profile matching workflows and performing a statistical analysis, found nearly 2 million significant gene-disease associations (p < 0.05). After various filtering steps, we decided to publish the top-ranking 281,154 gene-disease associations (1%) as nanopublications. This is work has been submitted for publication and is currently in review [1]. These nanopublications contain 21 triples each and are stored locally in an Allegro-Franz database at LUMC.


Our Assertion takes the form of an “association” between a gene and a genetic disorder, and the strength of this association is given by a statistical p-value. When searching for an ontology that would allow us to serialize this Assertion using RDF, we were forced to be more explicit about whether the Assertion is a mathematical/statistical claim or a biomedical claim. As our intention in this project was to offer gene-disease associations as testable hypotheses to the biomedical research community prompting laboratory or clinical research, our first approach was to use the class GeneDiseaseInteraction defined in pharmacogenomics-complex ontology. As the name suggested, this class is a broadly defined biological process that involves a gene and a disease. As GeneDiseaseInteraction is a subclass of BFO:Process which is a subclass of  BFO:Occurent, the interaction in this case is a physical one with a beginning and an end. However, as our research method establishes an association between a gene and disease (based on the overlap in their respective concept profiles) that may or may not be physical, we decided to look for a different data model.

Dr. Michel Dumontier, the author of pharmacogenomics-complex ontology pointed out that the Semanticscience Integrated Ontology (SIO), a simple, integrated ontology (types, relations) for diverse knowledge representation across physical, processual and informational entities [2], is better suited for bridging the gap between statistical/mathematical claims and its physical or biological interpretation. Moreover, SIO already provides information entities accommodating probability values and predicates that are well aligned with our concept profile technology. However, it did not have a class that defined a statistical relationship. Michel again helped us to create a new SIO class called “statistical association”[3]. This collaboration worked out perfectly, especially as we will get a lot of re-use from “statistical association”  as we apply our concept profile technology to find novel associations between many other semantic pairs (e.g., gene-drug associations).

Publication Information

In modeling the Publication Information, we decided to use a combination of dcterms and pav ontologies. Most of the predicates used here should be obvious. However, we wanted to make a clear distinction between the authors of the content (pav:authoredBy) versus the creator of the nanopublication (pav:createdBy). The author of a nanopublication is the person(s) providing the content of a nanopublication, whereas the creator is the person(s) realizing the content according to the nanopublication format. In our case, Herman van Haagen and Erik Schultes did the work to produce the scientific claims that in the statistical association between gene X and disease Y, thus they are the listed as authors. Zuotian Tatum took that scientific claim, translated it into RDF statements with proper ontologies and URIs and then organized the content according to the OPS Nanopublication Guidelines. Thus, she is the creator of the nanopublication. In compliance with OPS Guidelines, the authors and creator are listed using Research IDs. We could have also used ConceptWiki IDs, but we hope Researcher ID and CW are mapped in the ORCID project in the near future.


The main ontologies we used to describe our text-mining and concept profile methods are OPM (Open Provenance Model) and OntoDM (Ontology for Data Mining).
By the way, BioPortal is a great place for searching biomedical ontologies.

Example Nanopublication in RDF

(converted to the latest version of the nanopublication guidelines)

@prefix nanopub: <> .
@prefix dcterms: <> .
@prefix opm: <> .
@prefix pav: <> .
@prefix rdfs: <> .
@prefix sio: <> .
@prefix xsd: <> .
@prefix : <> .

:NanoPub_1_Head {
  : a nanopub:Nanopublication ;
    nanopub:hasAssertion :NanoPub_1_Assertion ;
    nanopub:hasProvenance :NanoPub_1_Provenance ;
    nanopub:hasPublicationInfo :NanoPub_1_Pubinfo .

:NanoPub_1_Assertion {
  :Association_1 a sio:statistical-association ;
    sio:has-measurement-value :Association_1_p_value ;
    sio:refers-to <>, <> ;
    rdfs:comment "This association has p-value of 0.00066, has attribute gene CENPJ (Entrenz gene id 55835)
        and attribute disease Seckel Syndrome (OMIM 210600)."@en .

  :Association_1_p_value a sio:probability-value ;
    sio:has-value "0.0000656211037469712"^^xsd:float .

:NanoPub_1_Provenance {
  :NanoPub_1_Assertion opm:wasDerivedFrom <> ;
    opm:wasGeneratedBy <> .

:NanoPub_1_Pubinfo {
  : pav:authoredBy <> ,
        <> .
  : pav:createdBy <> ;
    dcterms:created "2012-03-28T11:32:30.758274Z"^^xsd:dateTime ;
    dcterms:rights <> ;
    dcterms:rightsHolder <> .
Details for this example
Creator: Zuotian Tatum
Author: Erik Schultes, Herman van Haagen
Profile status: Final
Acknowledgements: Michel Dumontier
Access point to data: here
[1] van Haagen et al (2012) Confronting the complexity of polygenic diseases by exposing networks of implicit information [2] [3]
Related to these domains:

Name : Genetic Disorders

Name : Genetics

Name : implicitome
Available under these licenses (info):

Name : Creative Commons Attribution 3.0 Unported