Apache Spark MLlib platform capabilities for query processing in data lakes
Main Article Content
Abstract
In the era of big data, data lakes have become indispensable for storing large-scale data of both structured
and unstructured types. Unlike traditional databases, data lakes store raw data in their own format, providing
flexibility for future analysis. However, the volume and variety of data in data lakes present challenges for
efficient query processing. The paper presents the role and importance of data lakes as a centralized repository for
large-scale data storage and analysis. The difficulties of data lakes in terms of processing are also presented. In
order to deal with this problem, the role of the Apache Spark tool as a flexible analytical mechanism for large
scale data processing is presented. Describes the advantages of Apache Spark MLlib for overcoming query
processing problems in data lakes. Platforms and services that use Apache Spark MLlib for industry applications
are presented. Real modern applications and future research trends and directions are also discussed.