Whenever we study about any tool which handles data, we must study how much volume of data can it process and why was the tool actually came into use. The reasons behind the development of Apache Spark are many. We will study about each of them here. Apart from this, we will also learn about the components and architecture of Apache Spark and how does it complement the Hadoop Framework. Let’s get Started.
Why Apache Spark?
There are several reasons Apache Spark has gained so much popularity as a part of the Hadoop Ecosystem. Let’s study the reasons due to which Apache Spark is considered as an excellent tool to process data.
Today, the speed at which data is created is humungous. Data is being generated everywhere around us by machines, sensors and servers. One of the biggest examples to describe the speed of data is when users use a website, website tracking system collects data about where did a user clicked on a website to visit another page. Image about thousands of users using the same website at the same time and the tracking system collecting this data. But what is the use of this data unless we start to store and process this data? This kind of data which is huge in number and which inflows at a high speed can be termed as Big Data.
Big Data can be used to obtain results after insightful queries and processing have been done on it. But because of this data being highly unstructured, conventional ways aren’t applicable to perform this analytics. A framework which is capable of analyzing this type and size of data is known as Hadoop. Hadoop is an excellent framework to process this kind of Big Data but it has some limitations which we will study in coming sections (you can read it here as well). To cover these limitations, Apache Spark was developed.
Data Processing Speed
Hadoop is an excellent way of analyzing data but one thing it cannot do is present the insight results related to the Big Data we collected in short time. The Hadoop Map R Jobs can take a long time to execute and the business insights created might not be relevant at all by the time these results reach the stakeholders. These results are only helpful for business when provided in a minimum timeframe. This is another problem which is solved by Apache Spark.
What is Apache Spark?
Apache Spark is an open source data processing framework which can perform analytic operations on Big Data in a distributed environment. It was an academic project in UC Berkley and was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009. Apache Spark was created on top of a cluster management tool known as Mesos. This was later modified and upgraded so that it can work in a cluster based environment with distributed processing.
Spark is an answer to the limitations presented in Hadoop related to processing speed of Big Data due to its ability to maintain the intermediate results of the processing it does in-memory. This means that the execution time of analytics operations are at least 100 times faster than standard MapReduce Jobs. The biggest advantage regarding managing data in-memory is that Apache Spark, very smartly, starts writing data to disk when in-memory data starts to reach its threshold. Spark follows the concept of Lazy Evaluations on its RDD (Resilient Distributed Datasets). This means that any actions which we perform on RDDs is not taken unless it is absolutely necessary. This avoids any transformations and triggers which might consume memory. Here is how we can visualise Lazy EThis way, Spark avoids execution of tasks immediately and it maintains a meta-data about the operations it needs to perform in a DAG (Directed Acyclic Graph).
Some features related to Apache Spark are:
- Spark supports much more operations on tasks as compared to MapReduce Framework functions
- Spark is written in Scala Programming Language and runs in a JVM
- Spark APIs are supported in various programming languages like Scala, Java, Python and R which makes applications to be made with Spark easy and flexible
- Spark also offers an interactive shell for quick operations and task execution, however, this shell only supports Python and Scala as of now
- Due to the reason that Spark runs on top of a Hadoop cluster, it can easily process data in JBase Structure and so, it acts like an extension to your current application environment
- It is an excellent processing framework for iterative tasks used in Machine Learning Algorithms
Components of Apache Spark
Before we start discussing components of Apache Spark, let’s look at how its components fit together to form an ecosystem:
Apache Spark Ecosystem
Apache Spark Core APIs
Apache Spark Core APIs contains the execution engine of spark platform which provides the ability of memory computing and referencing datasets stored in external storage systems of the complete Platform. It is also responsible for task scheduling, dispatching and other I/O functionalities. In our programs, we can make use of Core APIs to expose the functionalities using Python, Scala, Java and R programming language.
Spark SQL provides user with SQL-based APIs to run SQL queries to perform computations on these Big Data-based datasets. This way, even a Business Analyst can run Spark Jobs by providing simple SQL Queries and deep dive into the available data.
Spark Streaming is an extremely useful API which allows us to make high-throughput, fault-tolerant stream processing form various data-sources as shown in the image below:
This way, this API makes Spark an excellent tool to process real-time streaming data. The fundamental stream unit in the API is called a DStream.
MLlib (Machine Learning Library)
MLib is a collection of few Machine Learning Algorithms which can be used to perform tasks like data cleaning, classification, regression, feature extraction, dimension reduction etc. This also includes optimization primitives like SGD and BFGS.
GraphX is the Spark API for graph related computations. This API improves upon the Spark RDD by introducing the Resilient Distributed Property Graph.
Just like Spark Streaming and Spark SQL APIs, GraphX also extends Spark RDD APIs which makes a directed graph (DAG). It also contains many operators which manipulate the graphs using the graph algorithms.
In this lesson, we studied about Apache Spark which consists of many components and is an excellent way to get faster results in comparison to Hadoop. Read more Big Data Posts to gain deeper knowledge of available Big Data tools and processing frameworks.