Surfing Heterogeneous Data Lakes – TopOL project (Top of the Lake)

Querying and Exploring Large Heterogeneous Data Lakes

Data usage is conditioned by the ability to identify useful datasets, which requires finding what a dataset contains, what it is about. Large numbers of datasets, their complexity and variety complicate the task of users seeking to discover information in the datalake. The project goal is to devise methods helping users, and in particular users without technical (CS) skills, identify useful information within very large, heterogeneous datasets.

Our approach will be to rely on rich-structured graphs at the conceptual and logical level, and Information Extraction based on language models, as well as scalable data storage and indexing systems, to help non-technical users leverage heterogeneous graph data lakes.

The TopOL project is funded by the French National Research Agency (ANR) under grant number ANR-25-CE23-3959-01.