Introduction to Spark

  • 2020-06-23 02:26:26
  • OfStack

SPARK

Apache Spark is a fast, general-purpose computing engine designed for large-scale data processing. Spark is the common parallel framework of Hadoop MapReduce, an open source class of UC Berkeley AMP lab (AMP Lab, University of California, Berkeley). Spark has the advantages of Hadoop MapReduce. However, unlike MapReduce, the intermediate output of Job can be kept in memory, so there is no need to read or write HDFS. Therefore, Spark is better suited to iterative MapReduce algorithms such as data mining and machine learning.

Spark is one kind of similar to Hadoop open source cluster computing environment, but some differences still exist between 1, the difference between these useful make Spark more superior in terms of certain workloads, in other words, Spark enabled memory distribution data set, in addition to being able to provide an interactive query, it also can be optimized iterative workload.

Spark is implemented in the Scala language and USES Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, where Scala can easily manipulate distributed data sets as if they were local collection objects 1.

Although Spark was created to support iterative jobs over distributed data sets, it is actually a complement to Hadoop and can run in parallel within the Hadoop file system. This behavior is supported through a third party clustering framework called Mesos. Spark was developed by THE AMP Lab at the University of California, Berkeley (Algorithms, Machines, and People Lab) to build large, low-latency data analysis applications.

Apache Spark is a fast, general-purpose computing engine designed for large-scale data processing. There is now a rapidly developing and widely used ecosystem.

The starting point for learning big data

Spark has three main features:

First, advanced API removes the focus on the cluster itself, allowing Spark application developers to focus on the computation that the application is doing.

Second, Spark is fast and supports interactive computing and complex algorithms.

Finally, Spark is a general-purpose engine that can be used to perform a variety of operations, including SQL queries, text processing, machine learning, etc. Before Spark, we generally had to learn a variety of engines to handle these requirements separately.

The performance characteristics of

Faster speed

In memory calculations, Spark is 100 times faster than Hadoop.

Ease of use

Spark provides more than 80 advanced operators.

generality

Spark provides a large number of libraries, including SQL, DataFrames, MLlib, GraphX, Spark Streaming. Developers can use these libraries seamlessly in the same application.

Support for multiple resource managers

Spark supports Hadoop YARN, Apache Mesos, and its self-contained cluster manager

Spark ecosystem

Shark: Shark basically provides the HiveQL command interface like Hive1 on the basis of Spark framework. In order to maintain the maximum compatibility with Hive, Shark USES API of Hive to realize query Parsing and Logic Plan generation, and Spark replaces HadoopMapReduce in the final stage of PhysicalPlan execution. By configuring the Shark parameter, Shark can automatically cache specific RDD in memory for data reuse, thus speeding up the retrieval of specific data sets. At the same time, Shark realizes specific data analysis and learning algorithm through UDF user-defined functions, so that SQL data query and operational analysis can be combined in 1 to maximize the reuse of RDD.

SparkR: SparkR is an R package that provides a lightweight Spark front end for R. SparkR provides a distributed data frame data structure, which solves the bottleneck that data frame in R can only be used in a single machine. It and data frame 1 in R support many operations, such as select,filter,aggregate and so on. This is a good solution to the big data level bottleneck of R. SparkR also supports distributed machine learning algorithms, such as using the MLib machine learning library. SparkR introduced the vitality of the R language community to Spark, attracting a large number of data scientists to start their data analysis journey directly on the Spark platform.

The basic principle of

Spark Streaming: Built on Spark to process Stream data, the basic principle is to divide Stream data into small pieces of time (a few seconds) and process these small pieces of data in a similar way to batch batch processing. Spark Streaming build on Spark, 1 because of the low latency Spark execution engine (ms + 100), though not nearly as special streaming data processing software, can also be used for real-time calculation, the other one aspect compared to other processing based on Record framework (e.g., Storm), 1 part narrow dependent RDD data sets can be achieved from the source data to recalculate the fault-tolerant processing purposes. In addition, the small batch processing makes it compatible with both batch and real-time data processing logic and algorithms. It is convenient for some specific applications that require joint analysis of historical data and real-time data.

Calculation method

Bagel: Pregel on Spark, you can use Spark for graph calculations, this is a very useful little project. Bagel comes with an example to implement Google's PageRank algorithm.

At present, Spark has not only stopped at real-time computing, but also aimed at the general big data processing platform. However, the termination of Shark and the opening of SparkSQL may already be seen.

In recent years, the parallelization algorithm of big data machine learning and data mining has become an important research hotspot in the field of big data. In the early years, researchers and the industry at home and abroad paid more attention to the design of parallel algorithm on Hadoop platform. However, HadoopMapReduce platform is difficult to efficiently implement the parallelization algorithm of machine learning that requires a lot of iterative computation due to high network and disk read and write costs. With the emergence and gradual development and maturity of Spark Berkeley, the new generation 1 big data platform, Spark, launched by UC, in recent years, people at home and abroad have begun to pay attention to how to realize the design of various machine learning and data mining parallel algorithms on Spark platform. In order to facilitate data analysis on the Spark platform by using the familiar R language, Spark provides a programming interface called SparkR, enabling data analysts in general applications to easily use the parallel programming interface and powerful computing power of Spark in the R environment.

conclusion


Related articles: