Spark Architecture 系统架构

星夢妙者

发布时间：2025-06-27 13:00:25

494人浏览过

来源于php中文网

原创

let's delve into the apache spark architecture, providing a high-level overview and discussing some key software components in detail.

High-Level Overview Apache Spark's application architecture is composed of several crucial components that work together to process data in a distributed environment. Understanding these components is essential for grasping how Spark functions. The key components include:

Driver Program
Master Node
Worker Node
Executor
Tasks
SparkContext
SQL Context
Spark Session

Here's an overview of how these components integrate within the overall architecture:

Spark Architecture 系统架构 Apache Spark application architecture - Standalone mode

Driver Program The Driver Program serves as the primary component of a Spark application. The machine hosting the Spark application process, which initializes SparkContext and Spark Session, is referred to as the Driver Node, and the running process is known as the Driver Process. This program interacts with the Cluster Manager to allocate tasks to executors.

Cluster Manager As the name suggests, a Cluster Manager oversees a cluster. Spark is compatible with various cluster managers such as YARN, Mesos, and a Standalone cluster manager. In a standalone setup, there are two continuously running daemons: one on the master node and one on each worker node. Further details on cluster managers and deployment models will be covered in Chapter 8, Operating in Clustered Mode.

Worker If you're familiar with Hadoop, you'll recognize that a Worker Node is akin to a slave node. These nodes are where the actual computational work occurs within Spark executors. They report their available resources back to the master node. Typically, each node in a Spark cluster, except the master, runs a worker process. Usually, one Spark worker daemon is initiated per worker node, which then launches and oversees executors for the applications.

Executors The master node allocates resources and utilizes workers across the cluster to instantiate Executors for the driver. These executors are employed by the driver to execute tasks. Executors are initiated only when a job begins on a worker node. Each application maintains its own set of executor processes, which can remain active throughout the application's lifecycle and execute tasks across multiple threads. This approach ensures application isolation and prevents data sharing between different applications. Executors are responsible for task execution and managing data in memory or on disk.

Tasks A task represents a unit of work dispatched to an executor. It is essentially a command sent from the Driver Program to an executor, serialized as a Function object. The executor deserializes this command (which is part of your previously loaded JAR) and executes it on a specific data partition.

A partition is a logical division of data spread across a Spark cluster. Spark typically reads data from a distributed storage system and partitions it to facilitate parallel processing across the cluster. For instance, when reading from HDFS, a partition is created for each HDFS partition. Partitions are crucial because Spark executes one task per partition. Consequently, the number of partitions is significant. Spark automatically sets the number of partitions unless manually specified, e.g., sc.parallelize(data, numPartitions).

同徽B2C电子商务软件系统

开发语言：java，支持数据库：Mysql 5，系统架构：J2EE，操作系统：linux/Windows1. 引言 32. 系统的结构 32.1 系统概述 33. 功能模块设计说明 43.1 商品管理 43.1.1 添加商品功能模块 53.1.2 商品列表功能模块 83.1.3 商品关联功能模块 93.

下载

SparkContext SparkContext serves as the entry point for a Spark session. It connects you to the Spark cluster and enables the creation of RDDs, accumulators, and broadcast variables on that cluster. Ideally, only one SparkContext should be active per JVM. Therefore, you must call stop() on the active SparkContext before initiating a new one. In local mode, when starting a Python or Scala shell, a SparkContext object is automatically created, and the variable sc references this SparkContext object, allowing you to create RDDs from text files without explicitly initializing it.

/** 
 * Read a text file from HDFS, a local file system (available on all nodes), or any 
 * Hadoop-supported file system URI, and return it as an RDD of Strings.
 * The text files must be encoded as UTF-8.
 * 
 * @param path path to the text file on a supported file system
 * @param minPartitions suggested minimum number of partitions for the resulting RDD
 * @return RDD of lines of the text file
 */
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}
<p>/** Get an RDD for a Hadoop file with an arbitrary InputFormat</p><ul><li></li><li>@note Because Hadoop's RecordReader class re-uses the same Writable object for each </li><li>record, directly caching the returned RDD or directly passing it to an aggregation or shuffle </li><li>operation will create many references to the same object.</li><li>If you plan to directly cache, sort, or aggregate Hadoop writable objects, you should first </li><li>copy them using a <code>map function.</li><li>@param path directory to the input data files, the path can be comma separated paths </li><li>as a list of inputs</li><li>@param inputFormatClass storage format of the data to be read</li><li>@param keyClass <code>Class</code> of the key associated with the <code>inputFormatClass</code> parameter</li><li>@param valueClass <code>Class</code> of the value associated with the <code>inputFormatClass</code> parameter</li><li>@param minPartitions suggested minimum number of partitions for the resulting RDD</li><li>@return RDD of tuples of key and corresponding value
*/
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}</code>

Spark Session The Spark Session is the entry point for programming with Spark using the dataset and DataFrame API.

For more in-depth information, you can refer to the following resource: Apache Spark Architecture.

「Spark从精通到重新入门(二)」Spark中不可不知的动态资源分配

Spark Architecture 系统架构

图文详解 Spark 总体架构 [禅与计算机程序设计艺术]

使用宝塔面板搭建Hadoop、Spark等大数据平台

Linux升级软件要注意_版本升级风险

相关专题

resource是什么文件

Resource文件是一种特殊类型的文件，它通常用于存储应用程序或操作系统中的各种资源信息。它们在应用程序开发中起着关键作用，并在跨平台开发和国际化方面提供支持。本专题为大家提供相关的文章、下载、课程内容，供大家免费下载体验。

176

2023.12.20

if什么意思

if的意思是“如果”的条件。它是一个用于引导条件语句的关键词，用于根据特定条件的真假情况来执行不同的代码块。本专题提供if什么意思的相关文章，供大家免费阅读。

840

2023.08.22

session失效的原因

session失效的原因有会话超时、会话数量限制、会话完整性检查、服务器重启、浏览器或设备问题等等。详细介绍：1、会话超时：服务器为Session设置了一个默认的超时时间，当用户在一段时间内没有与服务器交互时，Session将自动失效；2、会话数量限制：服务器为每个用户的Session数量设置了一个限制，当用户创建的Session数量超过这个限制时，最新的会覆盖最早的等等。

334

2023.10.17

session失效解决方法

session失效通常是由于 session 的生存时间过期或者服务器关闭导致的。其解决办法：1、延长session的生存时间；2、使用持久化存储；3、使用cookie；4、异步更新session；5、使用会话管理中间件。

774

2023.10.18

cookie与session的区别

本专题整合了cookie与session的区别和使用方法等相关内容，阅读专题下面的文章了解更详细的内容。

2025.08.19

function是什么

function是函数的意思，是一段具有特定功能的可重复使用的代码块，是程序的基本组成单元之一，可以接受输入参数，执行特定的操作，并返回结果。本专题为大家提供function是什么的相关的文章、下载、课程内容，供大家免费下载体验。

497

2023.08.04

js函数function用法

js函数function用法有：1、声明函数；2、调用函数；3、函数参数；4、函数返回值；5、匿名函数；6、函数作为参数；7、函数作用域；8、递归函数。本专题提供js函数function用法的相关文章内容，大家可以免费阅读。

166

2023.10.07

hadoop是什么

hadoop是一个由Apache基金会所开发的分布式系统基础架构。用户可以在不了解分布式底层细节的情况下，开发分布式程序。本专题为大家免费提供hadoop相关的文章、下载和课程。

216

2023.06.30

Rust内存安全机制与所有权模型深度实践

本专题围绕 Rust 语言核心特性展开，深入讲解所有权机制、借用规则、生命周期管理以及智能指针等关键概念。通过系统级开发案例，分析内存安全保障原理与零成本抽象优势，并结合并发场景讲解 Send 与 Sync 特性实现机制。帮助开发者真正理解 Rust 的设计哲学，掌握在高性能与安全性并重场景中的工程实践能力。

2026.03.05

热门下载

网站特效

网站源码

网站素材

前端模板