Apache Spark MLlib platform capabilities for query processing in data lakes

Main Article Content

Giorgi Muradovi

Abstract

In the era of big data, data lakes have become indispensable for storing large-scale data of both structured
and unstructured types. Unlike traditional databases, data lakes store raw data in their own format, providing
flexibility for future analysis. However, the volume and variety of data in data lakes present challenges for
efficient query processing. The paper presents the role and importance of data lakes as a centralized repository for
large-scale data storage and analysis. The difficulties of data lakes in terms of processing are also presented. In
order to deal with this problem, the role of the Apache Spark tool as a flexible analytical mechanism for large
scale data processing is presented. Describes the advantages of Apache Spark MLlib for overcoming query
processing problems in data lakes. Platforms and services that use Apache Spark MLlib for industry applications
are presented. Real modern applications and future research trends and directions are also discussed.

Keywords:
Query Processing, Data Lakes, Apache Spark, MLlib, Big Data Analytics, Machine Learning.
Published: Dec 7, 2024

Article Details

Section
Information and communication technologies (ICTs)