Skip to main content
Version: 0.9

Spark Tuning for Feature Retrieval in Tecton

This guide provides foundational steps to optimize Spark configurations for feature retrieval jobs in Tecton. Effective tuning ensures cost-effective, scalable, and performant operations, even as feature sets and data sizes grow.

1. Understanding the Context of Feature Retrieval​

Feature retrieval jobs typically involve:

  • Querying raw or preprocessed data.
  • Joining the data to a dataset (e.g., primary key + timestamp).
  • Materializing computed features for offline or online use.

Key factors influencing performance:

  • Dataset Size: Impacts shuffle operations and join performance.
  • Feature Definitions: Complexity and number of transformations or aggregations.
  • Data Sources: Location (e.g., S3, Delta Lake, Snowflake) and access patterns.
  • Cluster Environment: Whether running on DBR (Databricks Runtime) or EMR (Elastic MapReduce).

2. How Spark Tuning Varies Between DBR and EMR​

Although Spark core concepts apply to both platforms, tuning strategies may differ:

AspectDBR-Specific TuningEMR-Specific Tuning
Cluster ManagementAuto-scaling enabled by default.Requires manual configuration for instance counts.
Delta Lake OptimizationUse OPTIMIZE and Z-order clustering for Delta.Ensure S3-based Delta tables are well-partitioned.
Resource AllocationLeverage pre-configured instance types.Choose appropriate EC2 types (e.g., r5, c5).
Dynamic AllocationEnabled and works efficiently out-of-the-box.May require custom tuning or be disabled entirely.
Shuffle HandlingBuilt-in optimizations for handling shuffle.Ensure sufficient disk space for shuffle storage.
MonitoringLeverage Databricks’ built-in monitoring tools.Use Spark UI on EMR or integrate with CloudWatch.

3. Instance Sizes and Types​

Choose instance types based on workload requirements:

  • Compute-Intensive Jobs: For heavy transformations or aggregations, use compute-optimized instances (e.g., c5.xlarge).
  • Memory-Intensive Jobs: For large joins or in-memory caching, use memory-optimized instances (e.g., r5.2xlarge).

Instance Availability Considerations:
For cost optimization, consider using a mix of On-Demand and Spot Instances:

  • Spot Instances: Spot Instances are cost-effective but can be reclaimed by AWS at any time. To avoid interruptions in your Spark job, configure your cluster to use a combination of Spot and On-Demand Instances.

  • Best Practice: Set the first_on_demand parameter to 1 or 2. This ensures that your cluster retains the driver and at least one worker as On-Demand Instances, maintaining job stability even if Spot Instances are reclaimed.

for Databricks Users

When configuring memory settings, be aware that some instance types have off-heap memory enabled by default. This can affect how you set spark.executor.memory, as the off-heap memory allocation will reduce the amount of memory available for on-heap operations. For more details, refer to this Databricks Knowledge Base article.

Instance TypeUse Case
General PurposeBalanced workloads.
Compute-OptimizedHeavy transformations.
Memory-OptimizedLarge-scale joins or caching.

4. Worker Count and Cluster Sizing​

Cluster size and worker count impact parallelism and completion time:

  • Match worker count to data partitioning.
  • Ensure each partition is ~128MB–256MB in size.
Cluster SizeUse Case
SmallPrototyping and testing.
MediumMid-sized feature retrieval jobs.
LargeProduction-scale feature pipelines.

5. Dataset Size and Partitioning​

Partitioning the dataset effectively is key to scalable jobs:

  • Use balanced partitions. A rule of thumb is to divide your dataset size by the partition size to determine the appropriate number of partitions.
  • Partitioning during processing: Be cautious when increasing the number of partitions during processing, as this can lead to high data transit costs and increased network latency. Instead of using repartition, consider using Spark SQL partitioning hints, which leverage the Spark optimizer to make partitioning more efficient. For more information, refer to the Spark SQL Partitioning Hints documentation.
  • Reducing partitions: When reducing partitions, using coalesce is generally more efficient than repartition as it avoids shuffling data across the network.
  • Monitor dataset size. Regularly monitor the dataset size and adjust configurations as your data grows.
  • For EMR: Always set at least 50 GB for root volumes. This can be done using the root_volume_size_in_gb parameter.

6. Key Spark Configurations​

Here are some Spark configurations to consider for your Spark configuration:

ConfigurationDescriptionSuggested Value
spark.executor.memoryMemory allocated per executor.Start with 4–8 GB.
spark.executor.coresNumber of cores per executor.2–4 cores.
spark.sql.shuffle.partitionsNumber of partitions for shuffle operations.Dataset size / partition size = Number of partitions
spark.memory.fractionFraction of JVM heap allocated to execution and storage.Default is 0.6; adjust if needed.
spark.storage.memoryFractionFraction of heap memory reserved for storage (e.g., caching).Default is 0.5; tweak if caching heavily.
spark.sql.broadcastTimeoutTimeout for broadcast joins.Increase for large datasets (e.g., 300 sec).
spark.network.timeoutTimeout for network connections.Default is 120 sec; increase if necessary.
spark.driver.memoryMemory allocated to the driver.Start with 4–8 GB for medium jobs.
spark.driver.maxResultSizeIf estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from driver lossStart with 4 G
spark.executor.instancesTotal number of executors.Depends on cluster size.
spark.sql.autoBroadcastJoinThresholdSize threshold for broadcasting join tables.Default is 10MB; tweak for large joins.

7. Debugging and Optimization​

  • Monitor Job Execution:
    • Use Spark UI to identify long-running stages and shuffles.
    • Check executor logs for out-of-memory errors.
    • Monitor data shuffling for possible improvements.
  • Optimize Queries:
    • Push down filters to reduce data read volume.
    • Minimize wide transformations like groupBy or join by reordering operations.
  • Leverage Profiling Tools:
    • Use DBR's query optimizer for SQL plans.
    • Enable Spark event logging for detailed profiling on EMR.

8. Iteration and Support​

Spark tuning is an iterative process. Start with smaller datasets and test configurations incrementally before scaling to production workloads.

If you encounter challenges, reach out to Tecton Support for assistance.

Was this page helpful?