Version: 0.9

Spark Tuning for Feature Retrieval in Tecton

This guide provides foundational steps to optimize Spark configurations for feature retrieval jobs in Tecton. Effective tuning ensures cost-effective, scalable, and performant operations, even as feature sets and data sizes grow.

1. Understanding the Context of Feature Retrieval

Feature retrieval jobs typically involve:

Querying raw or preprocessed data.
Joining the data to a dataset (e.g., primary key + timestamp).
Materializing computed features for offline or online use.

Key factors influencing performance:

Dataset Size: Impacts shuffle operations and join performance.
Feature Definitions: Complexity and number of transformations or aggregations.
Data Sources: Location (e.g., S3, Delta Lake, Snowflake) and access patterns.
Cluster Environment: Whether running on DBR (Databricks Runtime) or EMR (Elastic MapReduce).

2. How Spark Tuning Varies Between DBR and EMR

Although Spark core concepts apply to both platforms, tuning strategies may differ:

Aspect	DBR-Specific Tuning	EMR-Specific Tuning
Cluster Management	Auto-scaling enabled by default.	Requires manual configuration for instance counts.
Delta Lake Optimization	Use OPTIMIZE and Z-order clustering for Delta.	Ensure S3-based Delta tables are well-partitioned.
Resource Allocation	Leverage pre-configured instance types.	Choose appropriate EC2 types (e.g., r5, c5).
Dynamic Allocation	Enabled and works efficiently out-of-the-box.	May require custom tuning or be disabled entirely.
Shuffle Handling	Built-in optimizations for handling shuffle.	Ensure sufficient disk space for shuffle storage.
Monitoring	Leverage Databricks’ built-in monitoring tools.	Use Spark UI on EMR or integrate with CloudWatch.

3. Instance Sizes and Types

Choose instance types based on workload requirements:

Compute-Intensive Jobs: For heavy transformations or aggregations, use compute-optimized instances (e.g., c5.xlarge).
Memory-Intensive Jobs: For large joins or in-memory caching, use memory-optimized instances (e.g., r5.2xlarge).

Instance Availability Considerations:
For cost optimization, consider using a mix of On-Demand and Spot Instances:

Spot Instances: Spot Instances are cost-effective but can be reclaimed by AWS at any time. To avoid interruptions in your Spark job, configure your cluster to use a combination of Spot and On-Demand Instances.
Best Practice: Set the first_on_demand parameter to 1 or 2. This ensures that your cluster retains the driver and at least one worker as On-Demand Instances, maintaining job stability even if Spot Instances are reclaimed.

for Databricks Users

When configuring memory settings, be aware that some instance types have off-heap memory enabled by default. This can affect how you set spark.executor.memory, as the off-heap memory allocation will reduce the amount of memory available for on-heap operations. For more details, refer to this Databricks Knowledge Base article.

Instance Type	Use Case
General Purpose	Balanced workloads.
Compute-Optimized	Heavy transformations.
Memory-Optimized	Large-scale joins or caching.

4. Worker Count and Cluster Sizing

Cluster size and worker count impact parallelism and completion time:

Match worker count to data partitioning.
Ensure each partition is ~128MB–256MB in size.

Cluster Size	Use Case
Small	Prototyping and testing.
Medium	Mid-sized feature retrieval jobs.
Large	Production-scale feature pipelines.

5. Dataset Size and Partitioning

Partitioning the dataset effectively is key to scalable jobs:

Use balanced partitions. A rule of thumb is to divide your dataset size by the partition size to determine the appropriate number of partitions.
Partitioning during processing: Be cautious when increasing the number of partitions during processing, as this can lead to high data transit costs and increased network latency. Instead of using repartition, consider using Spark SQL partitioning hints, which leverage the Spark optimizer to make partitioning more efficient. For more information, refer to the Spark SQL Partitioning Hints documentation.
Reducing partitions: When reducing partitions, using coalesce is generally more efficient than repartition as it avoids shuffling data across the network.
Monitor dataset size. Regularly monitor the dataset size and adjust configurations as your data grows.
For EMR: Always set at least 50 GB for root volumes. This can be done using the root_volume_size_in_gb parameter.

6. Key Spark Configurations

Here are some Spark configurations to consider for your Spark configuration:

Configuration	Description	Suggested Value
spark.executor.memory	Memory allocated per executor.	Start with 4–8 GB.
spark.executor.cores	Number of cores per executor.	2–4 cores.
spark.sql.shuffle.partitions	Number of partitions for shuffle operations.	Dataset size / partition size = Number of partitions
spark.memory.fraction	Fraction of JVM heap allocated to execution and storage.	Default is 0.6; adjust if needed.
spark.storage.memoryFraction	Fraction of heap memory reserved for storage (e.g., caching).	Default is 0.5; tweak if caching heavily.
spark.sql.broadcastTimeout	Timeout for broadcast joins.	Increase for large datasets (e.g., 300 sec).
spark.network.timeout	Timeout for network connections.	Default is 120 sec; increase if necessary.
spark.driver.memory	Memory allocated to the driver.	Start with 4–8 GB for medium jobs.
spark.driver.maxResultSize	If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from driver loss	Start with 4 G
spark.executor.instances	Total number of executors.	Depends on cluster size.
spark.sql.autoBroadcastJoinThreshold	Size threshold for broadcasting join tables.	Default is 10MB; tweak for large joins.

7. Debugging and Optimization

Monitor Job Execution:
- Use Spark UI to identify long-running stages and shuffles.
- Check executor logs for out-of-memory errors.
- Monitor data shuffling for possible improvements.
Optimize Queries:
- Push down filters to reduce data read volume.
- Minimize wide transformations like groupBy or join by reordering operations.
Leverage Profiling Tools:
- Use DBR's query optimizer for SQL plans.
- Enable Spark event logging for detailed profiling on EMR.

8. Iteration and Support

Spark tuning is an iterative process. Start with smaller datasets and test configurations incrementally before scaling to production workloads.

If you encounter challenges, reach out to Tecton Support for assistance.

1. Understanding the Context of Feature Retrieval​

2. How Spark Tuning Varies Between DBR and EMR​

3. Instance Sizes and Types​

4. Worker Count and Cluster Sizing​

5. Dataset Size and Partitioning​

6. Key Spark Configurations​

7. Debugging and Optimization​

8. Iteration and Support​

Was this page helpful?