Data Lakes: Consolidation and Schema Design through Formal Concept Analysis

Data lakes have emerged as a popular solution for storing and analyzing large and diverse datasets for advanced analytics. However, the unstructured nature of data in these repositories presents challenges in effectively leveraging the information and extracting valuable insights. In order to address this issue, a new approach rooted in Formal Concept Analysis (FCA) has been proposed in this paper.

The authors of the paper conducted their research at Infologic, where they explored the data structures stored in their data lake, including measurements in InfluxDB and indexes in Elasticsearch. The goal was to develop conventions for a more accessible and unified data model.

Through the application of FCA, the researchers represented the data structures as objects and analyzed the concept lattice. This allowed them to identify common concepts in the data, such as shared fields like timestamp, type, and usedRatio. By using FCA, they were able to establish a common schema for the data lake.

The results of the study were significant. The number of distinct field names in the data structures was reduced by 54 percent, from 190 to 88. Furthermore, they achieved complete coverage of 80 percent of the data structures with only 34 distinct field names, which was a major improvement from the initial 121 field names needed to achieve the same coverage.

This research provides valuable insights into the Infologic ecosystem and offers a comprehensive methodology for consolidating data lakes and designing a unified schema. It demonstrates both qualitative and quantitative results, showcasing the effectiveness of the FCA approach. By applying this methodology, organizations can streamline their data lakes, making them more accessible and enabling more efficient analysis and extraction of insights.

Expert Commentary

The use of data lakes has grown rapidly in recent years due to their ability to store vast amounts of data in its raw form, enabling organizations to perform advanced analytics and gain valuable insights. However, the unstructured nature of data within data lakes poses challenges in terms of organization, accessibility, and analysis.

What makes this paper particularly interesting is the use of Formal Concept Analysis (FCA) as a methodology to address these challenges. FCA is a mathematical framework that focuses on extracting knowledge from complex and diverse data sets. By applying FCA to the data structures in the data lake, the authors were able to identify common concepts and establish a unified schema.

The reduction in the number of distinct field names is particularly noteworthy. By consolidating and unifying the data structures, the authors were able to significantly reduce the complexity and redundancy within the data lake. This not only makes it easier to navigate and understand the data, but also improves the efficiency of analysis by reducing the number of variables that need to be considered.

Furthermore, the authors’ methodology showcased a complete coverage of 80 percent of the data structures with only 34 distinct field names. This means that the majority of the data lake could be effectively analyzed using a relatively small number of fields, indicating that the unified schema derived from FCA was comprehensive and encompassing.

This research has significant implications for organizations that rely on data lakes for their analytics and decision-making processes. By adopting the FCA approach outlined in this paper, organizations can improve the accessibility and understandability of their data lakes, making it easier for data scientists and analysts to extract meaningful insights.

In conclusion, this paper presents a practical and effective approach for consolidating data lakes and deriving a common schema. The application of Formal Concept Analysis provides a systematic methodology for organizing and designing data structures within data lakes, resulting in improved accessibility and streamlined analysis. This research has the potential to revolutionize the way data lakes are utilized, enabling organizations to unlock the true value of their data.

Read the original article