Stats2RDF

Representing multi-dimensional statistical data as RDF using the RDF Data Cube Vocabulary


Contents

Motivation

Biomedical statistical data is often published as Excel sheets. Thus they have the advantage of being easily readable by humans. However, they cannot be queried efficiently. Also it is difficult to integrate with other datasets, which may be in different formats. Our approach is to convert the data into a single data model – RDF. But in these biomedical datasets, a single statistical value is described in several dimensions. Thus a simple row-based transformation is not possible. Therefore, we used The RDF Data Cube vocabulary for the conversion as it is designed particularly to represent multidimensional statistical data using RDF.

Methodology


Transforming CSV to RDF in a fully automated way is not feasible as there may be dimensions encoded in the heading or label of a sheet. Therefore, we introduce a semi-automated approach as a plug-in in OntoWiki. More information here. Using this plug-in, a CSV file can be converted to RDF using the Data Cube Vocabulary. We used the WHO's Global Health Observatory dataset as a first use case. It is primarily available as Excel sheets. We converted them to CSV files and then transformed them into RDF.


The following screenshot displays a GHO table configured for conversion into RDF when imported into the OntoWiki CSV import plug-in:


Results

After converting the data reported for the mortality and burden of disease in GHO, classified according to countries and region, we obtained an RDF dataset containing 3 million triples.
Following is the example of a single statistical item, the death value of 1098, from the GHO dataset represented using the Data Cube vocabulary:


eg:o1     a              qb:Observation;
          qb:dataset     eg:dataset-in1;
          eg:refArea     Afghanistan;
          eg:contact     3010;
          eg:stat_pop    11605;
          eg:disease     Tuberculosis;
          eg:gbdcode     W0003;
          eg:death       1098.


Further Information

  • This is a short presentation describing the process of conversion of the CSV files to RDF using the SCOVO (Statistical Core Vocabulary) in OntoWiki. SCOVO is an earlier version of the Data Cube Vocabulary and the conversion process is similar for both.
  • This is a position paper that was accepted for a presentation at the Ontologies in Biomedicine and Life Sciences workshop held at Mannheim (Germany) from September 9 – 10, 2010.
  • This dataset is also part of the LODD datasets. Click here for more details.

Download

To use the plug-in, you need to install OntoWiki via github and enable the extension by:
«cd ontowiki/extensions» and «git clone git://github.com/AKSW/csvimport.ontowiki.git csvimport».


OntoWiki and enable this extension.
The RDFized GHO files can be downloaded from here. The original data is present here.

Contact

Amrapali J. Zaveri
Johannisgasse 26, Zimmer 5-10
04103 Leipzig

Tel.: +49 341 97-32273
E-Mail, Workpage, Research Group


 
There are no files on this page. [Display files/form]
There is no comment on this page. [Display comments/form]

Information

Last Modification: 2012-05-13 17:06:14 by Amrapali Zaveri