Abstract : Data Lake (DL) is known as a Big Data analysis solution. A data lake stores not only data but also the processes that were carried out on these data. It is commonly agreed that data preparation/transformation takes most of the data analyst's time. To improve the efficiency of data processing in a DL, we propose a framework which includes a metadata model and algebraic transformation operations. The metadata model ensures the findability, accessibility, interoperability and reusability of data processes as well as data lineage of processes. Moreover, each process is described through a set of coarse-grained data transforming operations which can be applied to different types of datasets. We illustrate and validate our proposal with a real medical use case implementation.
https://hal.archives-ouvertes.fr/hal-03141202
Contributor : Yan Zhao <>
Submitted on : Monday, February 22, 2021 - 9:14:27 AM Last modification on : Tuesday, February 23, 2021 - 3:24:04 AM
File
Restricted access
To satisfy the distribution rights of the publisher, the document is embargoed
until : 2021-07-11
Imen Megdiche, Franck Ravat, Yan Zhao. Metadata Management on Data Processing in Data Lakes. 47th International Conference on Current Trends in Theory and Practice of Informatics (SOFSEM 2021), Jan 2021, Bozen-Bolzano, Italy. pp.553-562, ⟨10.1007/978-3-030-67731-2_40⟩. ⟨hal-03141202⟩