Predicted Protein-Protein associations from text mining


The goal of this project was to find novel associations between human proteins by text mining PubMed abstracts. Our method allows us to find associations that are broader than localized physical protein-protein interactions (PPI). For example, associations could also be due to cellular localization, co-occurrences in the same metabolic pathway, or correlated gene expression profiles.

After mining 18 million abstracts dating from 1980 to 2008, running concept profile matching algorithms and performing a statistical analysis, we reported 44,000 significant gene-disease associations which we wanted to expose as nanopublications. We originally published this work in 2009.


The assertion takes the form of an SIO “association” between two proteins, identified using EntrezGene IDs.

Publication Information

Following the Open PHACTS Guidelines authors are listed using Research ID. We use a combination of dcterms and pav ontologies for assigning attribution. We also list the nanopublication version, rights information and a link using the DOI to our published paper.


The main ontologies we used to describe our text-mining and concept profile methods are OPM (Open Provenance Model) and OntoDM (Ontology for Data Mining).

Example Nanopublication in RDF

(converted to the latest version of the nanopublication guidelines)

@prefix : <> .
@prefix dcterms: <> .
@prefix nanopub: <> .
@prefix opm: <> .
@prefix pav: <> .
@prefix rdfs: <> .
@prefix sio: <> .
@prefix xsd: <> .
@base <> .

<> {
  :NanoPub_1 a nanopub:Nanopublication ;
    nanopub:hasAssertion :NanoPub_1_Assertion ;
    nanopub:hasProvenance :NanoPub_1_Provenance ;
    nanopub:hasPublicationInfo :NanoPub_1_Pubinfo .

:NanoPub_1_Assertion {
  :Association_1 a sio:statistical-association ;
    sio:has-measurement-value :Association_1_p_value ;
    sio:refers-to <>, <> ;
    rdfs:comment """This association has p-value of 0.00066, has attribute gene CAPN3 (Entrenz gene id 825)
        and PARVB (Entrenz gene id 29780)."""@en .

  :Association_1_p_value a sio:probability-value ;
    sio:has-value "0.0000656211037469712"^^xsd:float .

:NanoPub_1_Provenance {
  :NanoPub_1_Assertion opm:wasDerivedFrom <> ;
    opm:wasGeneratedBy <> .

:NanoPub_1_Pubinfo {
  :NanoPub_1 dcterms:created "2012-03-28T11:32:30.758274Z"^^xsd:dateTime ;
    pav:authoredBy <> ,
        <> ;
    pav:createdBy <> ;
    pav:versionNumber "1.0" ;
    dcterms:rights <> ;
    dcterms:rightsHolder <> ;
    dcterms:DOI  <> .
Details for this example
Creator: Zuotian Tatum
Author: Herman van Haagen, Erik Schultes
Profile status: Draft
Acknowledgements: -
Access point to data: here
van Haagen HHHBM, 't Hoen PAC, Botelho Bovo A, de Morrée A, van Mulligen EM, et al. (2009) Novel Protein-Protein Interactions Inferred from Literature Context. PLoS ONE 4(11): e7894. doi
