Apache Spark Join Strategies Part III

Saurabh Bhedokar
2 min readJan 14, 2024

--

The “Shuffle Hash Join” is a join algorithm employed in Apache Spark for merging data from disparate data frames or datasets. Its purpose is to execute joins effectively through the strategic use of partitioning and hashing techniques on the data.

When the table is relatively large, the use of broadcast may cause driver- as well as executor-side memory issues, so shuffle Hash Join is the right choice.

It is an expensive join as it involves both shuffling and hashing. Also, it requires memory and computation to maintain a hash table.

Spark shuffles the data across the cluster to ensure that records with the same hashed keys are colocated in the same partition on different worker nodes. This is a critical step for the Shuffle Hash Join.

Here’s an overview of how a Shuffle Hash Join works in Spark and the relevant configurations:

How Shuffle Hash Join Works

Step 1: Shuffling: The data from the Join tables are partitioned based on the Join key. It does shuffle the data across partitions to have the same Join keys of the record assigned to the corresponding partitions.

Step 2: Hash Join: A classic single node Hash Join algorithm is performed for the data on each partition.

NOTE: To use the Shuffle Hash Join, spark.sql.join.preferSortMergeJoin needs to be false

Important considerations for the “Shuffle Hash Join” in Apache Spark:

1. Equality Join Only:
The algorithm is exclusively supported for ‘=’ (equality) joins.

2. Unsorted Join Keys:
Unlike some other join algorithms, the join keys do not require sorting beforehand, simplifying the preparation process.

3. Join Type Limitation:
While applicable to various join types, it does not support full outer joins.

4. Resource Intensiveness:
Considered an expensive join due to the involvement of both shuffling and hashing. Maintaining a hash table demands both memory and computational resources. The hashing process, as explained earlier, contributes to the computational overhead.

In summary, the “Shuffle Hash Join” in Apache Spark is well-suited for equality joins with unsorted keys, the condition is not suitable for broadcasting & supporting various join types except for full outer joins. However, its resource-intensive nature, involving both shuffling and hashing, makes it relatively costly in terms of memory and computation.

--

--

Responses (1)