Skip to main content
Version: 1.0

Integrating with Flink

Tecton integrates seamlessly with Apache Flink and is commonly used alongside it in modern streaming ML stacks.

While Tecton handles feature engineering โ€” transforming and aggregating events into features for real-time inference and model training โ€” Flink is often the right tool for upstream stateful event stream preparation.

Requirements and Limitationsโ€‹

This integration pattern:

  • Two Integration Patterns Available: Flink can integrate with Tecton through two supported patterns:
    • Using the Stream Ingest API directly
    • Publishing to Kafka/Kinesis and using Tecton's Spark Streaming integration
  • Does not require Rift batch compute: While the Stream Ingest API integration requires the API be enabled, it doesn't require using Rift for batch compute.

Integration Patternsโ€‹

Tecton supports two architectural patterns for integrating with Flink:

In this pattern, Flink publishes cleaned events directly to Tecton's Stream Ingest API, where they are processed by the Rift compute engine.

This integration pattern is currently not available for VPC deployments, as the Stream Ingest API does not support VPC environments.

In this pattern, Flink publishes to Kafka or Kinesis, and Tecton's Spark Streaming integration consumes from these message queues.

Choosing the Right Patternโ€‹

ConsiderationStream Ingest APIKafka/Kinesis + Spark
ThroughputBest for < 1k records/secondBetter for > 1k records/second
LatencyMillisecond-level freshnessSecond-level freshness
InfrastructureNo additional message queue neededRequires Kafka/Kinesis infrastructure
Existing StackStandalone solutionBetter if already using Spark
Compute EngineRift (Python-native)Spark Structured Streaming

If your upstream events are on Kafka or Kinesis, we recommend the following responsibility split:

  • Deduplicating events
  • Stream enrichment (e.g., joining with metadata)
  • Filtering malformed or irrelevant events
  • Ensuring at-least-once or exactly-once delivery

Flink transforms raw bronze data into clean silver streams. It's typically used by Data Engineers.

Further Reading Please also see Confluent's page on the Shift Left paradigm, which explains in more detail how Apache Flink is used upstream to clean, enrich, and govern event data before it's consumed by downstream systems like Tecton.

Tecton is responsible for:โ€‹

  • Applying row-level transformations
  • Applying time window aggregations
  • Leveraging Python packages or ML models to transform events
  • Joining in other precomputed features
  • Applying real-time transformations at feature request time
  • Serving features online or generating training data

Tecton transforms silver events into gold ML-ready features. It's most commonly used by Data Scientists and MLEs.

Data Flowโ€‹

Pattern 1: Stream Ingest API Integrationโ€‹

  1. Flink publishes cleaned events to Tecton's Stream Ingest API, where they are processed in real time by Rift and written to the online store.

  2. Flink also writes the same events to a data warehouse or data lake, which Tecton uses to backfill features and generate training datasets.

Pattern 2: Kafka/Kinesis Integrationโ€‹

  1. Flink publishes cleaned events to Kafka or Kinesis message queues.

  2. Tecton's Spark Streaming jobs consume from these queues, process the events, and write features to both online and offline stores.

  3. Historical data in the data warehouse/lake is used for feature backfills and training dataset generation.

Both patterns ensure online/offline consistency through dual-write or coordinated processing strategies.

Batch World Analogyโ€‹

Think of Flink like dbt for streams, and Tecton like the feature layer on top.

In the batch world:

  • You might use dbt to turn bronze logs into silver tables (event cleaning, enrichment, normalization).
  • Then, you'd define features on top of those silver tables using Tecton.

This same pattern applies in streaming โ€” only now it's real-time.

Was this page helpful?