Introduction

The SemKernel is a prototype combining semantic similarity measures on ontologies with support vector machines.

We set up a demo workflow that predicts phenotypes of mouse single gene knockouts based on the associated GO functions or the deleted proteins. This example uses GO annotations to mouse proteins from the Mouse Genome Informatics (MGI) database as well as phenotype annotations for non-conditional mouse mutants from MGI. Treating the phenotype annotations as annotations to the protein we obtain a map of proteins
to a set of phenotypes. With the SemKernel prototype training data can be generated for each class of the Mammalian Phenotype (MP) ontology, treating associated proteins as positive instances and selecting a random subset of proteins without declared associations as negative instances.

The main workflow that is performed by this SemKernel example comprises the following steps:

training
1. read URIs to train
2. read the underlying MP knowledge base
3. write the prepared training data (in SVM light format) to the training data directory
4. read the GO knowledge base
5. do the training run on the GO knowledge base and write out the training model to the model directory
prediction
1. read URIs to predict
2. read the underlying knowledge base (if not done already)
3. write out the prepared prediction data to the prediction data directory
4. read the GO knowledge base
5. do the prediction based on the GO knowledge base and write out the prediction results to the result directory

Download

The SemKernel prototype can be downloaded at http://dl-learner.org/Images/dllearner-semkernel.tar.gz. We also set up a documented example configuration which is available at http://dl-learner.org/Images/semkernel.conf.

Configuration

An overview of the available configuration options is given below

option	description	default
type	should be “Mammalian Phenotype SemKernel Workflow” for this example
trainURIsFilePath	file containing MP URIs to train (one URI per line)
mpKBFilePath	ontology file containing the MP ontology to derive all subclasses of the MP class to train from
trainingInputDirectoryPath	directory where the prepared training input data in SVM light format should be written to (one file per MP class to train)
goKBFilePath	ontology file containing the GO ontology used to derive a semantic similarity between GO classes which is used by the SemKernel
trainingOutputDirectoryPath	directory where the model data should be written to (one file per MP class to train)
predictionURIsFilePath	file containing MP URIs to calculate a prediction for (one URI per line)
predictionInputDirectoryPath	directory where the prepared prediction input data in SVM light format should be written to (one file per MP class to calculate a prediction for)
predictionOutputDirectoryPath	directory where the prediction output should be written to (one file per MP class the prediction was made for)
mgi2mpMappingsFilePath	file containing the MGI ID to MP class mappings
mgi2goMappingsFilePath	file containing the MGI ID to GO class association mappings
crossValidationFolds	number of cross validation folds, e.g. 10 for 10-fold cross validation	10
cost	the parameter C of C-SVC, epsilon-SVR, and nu-SVR	5.0
predictProbability	whether probability prediction should be performed	true
posNegExampleRatio	ratio of positive vs negative examples	1.0
doTraining	determines whether the training run will be performed	true
doPrediction	determines whether the prediction run will be performed	true

Execution

After having downloaded the dllearner-semkernel.tar.gz file you can extract it to a directory of your choice. To run the SemKernel you should find an executable under bin called semkernel. An example call with the first output lines could look like this:
user@host:/tmp/dllearner-1.0 $ bin/semkernel semkernel.conf SemKernel command line interface Refreshing org.springframework.cont...: startup date [Sun Mar 29 16:53:52 CEST 2015]; root of context hierarchy Loading XML bean definitions from URL [jar:file:/tmp/dllearner-1.0/lib/interfaces-1.1-SNAPSHOT.jar!/org/dllearner/configuration/spring/bean-post-processor-configuration.xml] Pre-instantiating singletons in org.springframework.bean...: defining beans [org.springframework.cont...]; root of factory hierarchy Inializing workflow...

When the whole SemKernel run finished you can find single files, each for one MP class to train/predict in the corresponding directories.

Results

Our evaluation results can be found here.