Spark Memory Model
Spark is a distributed data processing framework. It works best with huge amount of data (What we call as Bigdata) .As we are dealing with enormous amount of data , it is very important to understand the Memory model of the framework , which will give you a better flexibility to process data . Here we will consider Spark is running on top of YARN resource manager .
Spark has mainly two components where we are concerned about memory as below
Spark has mainly two components where we are concerned about memory as below
- Driver
- Executor
Driver
Driver is the place where all the local computations happens . Some times it is required to collect too much data to driver from executors or some times it is required to do a heavy local computations. Hence this kind of memory intensive activities might slowdown or break your application .
Driver has two memory division i.e Diver overhead and driver memory. Driver over head is the amount of heap memory (in megabytes) . This memory accounts for JVM overheads, interned strings, other native overheads . Driver memory is the amount of memory that is supplied to Spark during spark submit .
Over all memory consumed by driver = Driver Overhead + driver memory Driver Overhead= Max(MEMORY_OVERHEAD_FACTOR*requested driver memory ,MEMORY_OVERHEAD_MINIMUM)
MEMORY_OVERHEAD_FACTOR=.01
MEMORY_OVERHEAD_MINIMUM= 384 mb
Parameters controlling Driver overhead and driver memory are "spark.yarn.driver.memoryOverhead" and "spark.driver.memory respectively" .We can also limit the amount of data to be collected from executors to driver by setting parameter "spark.driver.maxResultSize" .
Executor
Executor is the place where distributed computation happens . So understanding memory model of executor is very important . Memory related issues we will see most of the time in executors .
Broadly executor has 4 memory division i.e
Over all memory consumed by driver = Driver Overhead + driver memory Driver Overhead= Max(MEMORY_OVERHEAD_FACTOR*requested driver memory ,MEMORY_OVERHEAD_MINIMUM)
MEMORY_OVERHEAD_FACTOR=.01
MEMORY_OVERHEAD_MINIMUM= 384 mb
Parameters controlling Driver overhead and driver memory are "spark.yarn.driver.memoryOverhead" and "spark.driver.memory respectively" .We can also limit the amount of data to be collected from executors to driver by setting parameter "spark.driver.maxResultSize" .
Executor
Executor is the place where distributed computation happens . So understanding memory model of executor is very important . Memory related issues we will see most of the time in executors .
Broadly executor has 4 memory division i.e
- Executor overhead
- Reserved Memory
- User Memory
- Spark memory
Executor over head This is the amount of off-heap memory (in megabytes) . This memory accounts for JVM overheads, interned strings, other native overheads .Its is calculated using below calculation
Executor Overhead= Max(MEMORY_OVERHEAD_FACTOR*requested executor memory ,MEMORY_OVERHEAD_MINIMUM)
MEMORY_OVERHEAD_FACTOR=.01
MEMORY_OVERHEAD_MINIMUM= 384 mb
Reserved Memory This is the amount of heap memory for the use of Spark internal objects . This memory is a fixed amount of 300 mb . This can be seen in Spark 2.2 source code as below . Spark executor memory should be at least 1.5 times of reserved memory i.e 1.5 * 300 =450 mb . If spark executor memory parameter is less than 450 mb , then exception will be thrown .
Source code link : Spark Source Code
User Memory This is the amount of memory left over from spark executor memory after allocation of spark memory and reserved memory . This is used to store user objects (like user defined data structures ) . Below calculation is required to determine User Memory
Spark memory This is fraction of memory of (executor heap - 300 mb) , which is used for spark data computation and spark in memory storage of partitions . We can provide spark a configuration i.e "spark.memory.fraction" which is used to calculate Spark memory . Below is the calculation
Spark memory = (Spark executor memory - 300 mb) * "spark.memory.fraction"
Further "Spark memory" has been divided into two parts i.e Execution memory and Storage memory .Boundary between these two parts is defined by "spark.memory.storaageFraction" . So below calculation will determine the two areas.
Execution memory= Spark memory* (1-spark.memory.storaageFraction)
Storage memory = Spark memory* spark.memory.storaageFraction
But the above calculation is valid for initial memory allocation , however if for execution it requires more memory and storage area has free space , then this free space can be used for execution . Same way if for storage more space is required and free space is available in execution area then , that free space can be used for storage. Hence this kind of memory management is called unified memory management.But there is an edge case , in which if the Storage area does not have free space and execution area needs to be extended , in that case storage area data can be evicted in accordance to LRU mechanism . But the other way is not possible i.e execution area can never be evicted .
Comments
Post a Comment