Sernadela P, Pereira A, Rossetti R
Proceedings of the 2015 10th Iberian Conference on Information Systems and Technologies (Cisti 2015), June 2015
The continuous growth in quantity and diversity of life sciences data is triggering several bioinformatics challenges to be able to integrate and select desired information for later study. The majority of these data are scattered through independent systems disregarding interoperability features, which makes data integration processes not a trivial task. Consequently, several ETL (Extract-Transform-and-Load) frameworks have been developed to make data integrations tasks suitable for later exploration studies, providing better solutions for data heterogeneity, diversity and distribution. However, current advanced data integration tasks depend on large and heterogeneous data sources that must be modelled according to the source specifications and network conditions. Furthermore, these automated tasks are significantly dependent of sequential processes that dramatically increase the global request and processing time. Without estimation of the task completion time, the whole research workflow becomes even more challenging. This paper presents DISim, an ontology for data integration simulation, to estimate large and heterogeneous data integration jobs, in order to provide valuable outputs to enhance decision-making scenarios.