SemKernel

Introduction

The SemKernel is a prototype combining semantic similarity measures on ontologies with support vector machines.

We set up a demo workflow that predicts phenotypes of mouse single gene knockouts based on the associated GO functions or the deleted proteins. This example uses GO annotations to mouse proteins from the Mouse Genome Informatics (MGI) database as well as phenotype annotations for non-conditional mouse mutants from MGI. Treating the phenotype annotations as annotations to the protein we obtain a map of proteins
to a set of phenotypes. With the SemKernel prototype training data can be generated for each class of the Mammalian Phenotype (MP) ontology, treating associated proteins as positive instances and selecting a random subset of proteins without declared associations as negative instances.

The main workflow that is performed by this SemKernel example comprises the following steps:

  • training
    1. read URIs to train
    2. read the underlying MP knowledge base
    3. write the prepared training data (in SVM light format) to the training data directory
    4. read the GO knowledge base
    5. do the training run on the GO knowledge base and write out the training model to the model directory
  • prediction
    1. read URIs to predict
    2. read the underlying knowledge base (if not done already)
    3. write out the prepared prediction data to the prediction data directory
    4. read the GO knowledge base
    5. do the prediction based on the GO knowledge base and write out the prediction results to the result directory

Download

The SemKernel prototype can be downloaded at http://dl-learner.org/Images/dllearner-semkernel.tar.gz. We also set up a documented example configuration which is available at http://dl-learner.org/Images/semkernel.conf.

Configuration

An overview of the available configuration options is given below

option description default
type should be “Mammalian Phenotype SemKernel Workflow” for this example
trainURIsFilePath file containing MP URIs to train (one URI per line)
mpKBFilePath ontology file containing the MP ontology to derive all subclasses of the MP class to train from
trainingInputDirectoryPath directory where the prepared training input data in SVM light format should be written to (one file per MP class to train)
goKBFilePath ontology file containing the GO ontology used to derive a semantic similarity between GO classes which is used by the SemKernel
trainingOutputDirectoryPath directory where the model data should be written to (one file per MP class to train)
predictionURIsFilePath file containing MP URIs to calculate a prediction for (one URI per line)
predictionInputDirectoryPath directory where the prepared prediction input data in SVM light format should be written to (one file per MP class to calculate a prediction for)
predictionOutputDirectoryPath directory where the prediction output should be written to (one file per MP class the prediction was made for)
mgi2mpMappingsFilePath file containing the MGI ID to MP class mappings
mgi2goMappingsFilePath file containing the MGI ID to GO class association mappings
crossValidationFolds number of cross validation folds, e.g. 10 for 10-fold cross validation 10
cost the parameter C of C-SVC, epsilon-SVR, and nu-SVR 5.0
predictProbability whether probability prediction should be performed true
posNegExampleRatio ratio of positive vs negative examples 1.0
doTraining determines whether the training run will be performed true
doPrediction determines whether the prediction run will be performed true

Execution

After having downloaded the dllearner-semkernel.tar.gz file you can extract it to a directory of your choice. To run the SemKernel you should find an executable under bin called semkernel. An example call with the first output lines could look like this:

user@host:/tmp/dllearner-1.0 $ bin/semkernel semkernel.conf
SemKernel command line interface
Refreshing org.springframework.cont...: startup date [Sun Mar 29 16:53:52 CEST 2015]; root of context hierarchy
Loading XML bean definitions from URL [jar:file:/tmp/dllearner-1.0/lib/interfaces-1.1-SNAPSHOT.jar!/org/dllearner/configuration/spring/bean-post-processor-configuration.xml]
Pre-instantiating singletons in org.springframework.bean...: defining beans [org.springframework.cont...]; root of factory hierarchy
Inializing workflow...

When the whole SemKernel run finished you can find single files, each for one MP class to train/predict in the corresponding directories.

Results

Our evaluation results can be found here.