Introduction
The SemKernel is a prototype combining semantic similarity measures on ontologies with support vector machines.
We set up a demo workflow that predicts phenotypes of mouse single gene knockouts based on the associated GO functions or the deleted proteins. This example uses GO annotations to mouse proteins from the Mouse Genome Informatics (MGI) database as well as phenotype annotations for non-conditional mouse mutants from MGI. Treating the phenotype annotations as annotations to the protein we obtain a map of proteins
to a set of phenotypes. With the SemKernel prototype training data can be generated for each class of the Mammalian Phenotype (MP) ontology, treating associated proteins as positive instances and selecting a random subset of proteins without declared associations as negative instances.
The main workflow that is performed by this SemKernel example comprises the following steps:
- training
- read URIs to train
- read the underlying MP knowledge base
- write the prepared training data (in SVM light format) to the training data directory
- read the GO knowledge base
- do the training run on the GO knowledge base and write out the training model to the model directory
- prediction
- read URIs to predict
- read the underlying knowledge base (if not done already)
- write out the prepared prediction data to the prediction data directory
- read the GO knowledge base
- do the prediction based on the GO knowledge base and write out the prediction results to the result directory
Download
The SemKernel prototype can be downloaded at http://dl-learner.org/Images/dllearner-semkernel.tar.gz. We also set up a documented example configuration which is available at http://dl-learner.org/Images/semkernel.conf.
Configuration
An overview of the available configuration options is given below
option | description | default |
---|---|---|
type | should be “Mammalian Phenotype SemKernel Workflow” for this example | |
trainURIsFilePath | file containing MP URIs to train (one URI per line) | |
mpKBFilePath | ontology file containing the MP ontology to derive all subclasses of the MP class to train from | |
trainingInputDirectoryPath | directory where the prepared training input data in SVM light format should be written to (one file per MP class to train) | |
goKBFilePath | ontology file containing the GO ontology used to derive a semantic similarity between GO classes which is used by the SemKernel | |
trainingOutputDirectoryPath | directory where the model data should be written to (one file per MP class to train) | |
predictionURIsFilePath | file containing MP URIs to calculate a prediction for (one URI per line) | |
predictionInputDirectoryPath | directory where the prepared prediction input data in SVM light format should be written to (one file per MP class to calculate a prediction for) | |
predictionOutputDirectoryPath | directory where the prediction output should be written to (one file per MP class the prediction was made for) | |
mgi2mpMappingsFilePath | file containing the MGI ID to MP class mappings | |
mgi2goMappingsFilePath | file containing the MGI ID to GO class association mappings | |
crossValidationFolds | number of cross validation folds, e.g. 10 for 10-fold cross validation | 10 |
cost | the parameter C of C-SVC, epsilon-SVR, and nu-SVR | 5.0 |
predictProbability | whether probability prediction should be performed | true |
posNegExampleRatio | ratio of positive vs negative examples | 1.0 |
doTraining | determines whether the training run will be performed | true |
doPrediction | determines whether the prediction run will be performed | true |
Execution
After having downloaded the dllearner-semkernel.tar.gz file you can extract it to a directory of your choice. To run the SemKernel you should find an executable under bin
called semkernel
. An example call with the first output lines could look like this:
user@host:/tmp/dllearner-1.0 $ bin/semkernel semkernel.conf
SemKernel command line interface
Refreshing org.springframework.cont...: startup date [Sun Mar 29 16:53:52 CEST 2015]; root of context hierarchy
Loading XML bean definitions from URL [jar:file:/tmp/dllearner-1.0/lib/interfaces-1.1-SNAPSHOT.jar!/org/dllearner/configuration/spring/bean-post-processor-configuration.xml]
Pre-instantiating singletons in org.springframework.bean...: defining beans [org.springframework.cont...]; root of factory hierarchy
Inializing workflow...
When the whole SemKernel run finished you can find single files, each for one MP class to train/predict in the corresponding directories.
Results
Our evaluation results can be found here.