Frequently Asked Questions
We are always looking to make the use of Tecton as transparent and effective as possible for your needs. Below, you can find answers to common questions we've received to date.
If your question isn't answered here, please reach out to email@example.com and we will get back to you quickly.
How do Feature Repos, Feature Packages, and Feature Services work together?
FeaturePackages contain all of the metadata, transformation code, and orchestration information for expressing one or more features. FeatureServices define collections of FeaturePackages to serve to a model. The Feature Repository refers to the collection of all FeaturePackage, FeatureService, and data source definitions.
Are there any recommended design patterns for feature management?
Yes. First, pick a granularity of features produced by each FeaturePackage that maximizes flexibility for reuse across the organization. Second, if a FeaturePackage contains a subset features that are useful to another team, consider splitting it into two FPs to improve reusability. The correct design patterns are often specific to your problems and use cases. If you partner with Tecton for a PoC, we'll work together to set up best practices including partitioning of data, feature naming conventions, logical grouping of features, and more.
Is Tecton focused on machine learning development workflows or deployment to production?
Tecton is the bridge for your ML features between development and production. You define your feature pipelines once - Tecton then takefs care of the scheduling, storing, and serving of those features for you. The development and production environments have different storage infrastructure based on the general requirements of those environments. This saves you from having to reimplement those pipelines into production.
Deployment & Infrastructure
Does Tecton support cross-region and cross-datacenter availability?
Tecton's default implementation is for all components to be highly available in one region. If your use case requires more than that, Tecton does support feature serving in satellite regions. We mirror the feature store to other regions in which you want to serve features. This helps support low latency, local serving capabilities and can also be used to support global failover from one region to another. Tecton has an additional fee for every cluster maintained in additional regions, to offset the operational overhead.
What data processing engines does Tecton support today?
Today, all data processing pipelines are managed by Spark. We have plans in 2021 to add a lighter weight data processing engine, as an alternative to the system.
What deployment options does Tecton have available?
Tecton's deployment mode is called "Enterprise SaaS." Your Tecton cluster is split between an AWS account managed by Tecton and an AWS account managed by your company. Tecton's core services and metadata live in Tecton's account, but all the data processing and feature data at rest are stored in your AWS account. This option keeps your data within the bounds of your AWS account while reducing the work required from your IT organization since Tecton's core systems live in Tecton's account.
In certain circumstances, we are able to deploy Tecton in a VPC. In this situation, the entire cluster runs in an AWS sub-account owned by your company. In this case, all of the data processing and storage stays within this account. You grant Tecton administrative access to an AWS sub-account that you own. Tecton accesses this sub-account to manage the provisioning of the right infrastructure components (VPCs, instances, etc.). Software updates are taken care of by Tecton.
For more information, see Deployment Models
Does Tecton allow for CI/CD integration?
Yes - we recommend that customers integrate Tecton's CLI with their CI/CD pipeline so that applying to production workspaces is performed automatically rather than manually. This also provides the flexibility to write unit tests against transformations that can be run automatically and on every pull request to your repo.
Where is feature data stored?
Historical data is stored on S3 and serving data is stored in DynamoDB. Since Tecton orchestrates your feature pipelines, the data lives in a Tecton-managed VPC. Depending on the deployment mode you choose, this can be within your AWS environment or an environment owned by Tecton.
What infrastructure does Tecton integrate with?
Tecton use Spark on EMR or Databricks to compute features and store them in S3/Dynamo. Features stored in Dynamo also pass through MSK. Offline features are served from S3 directly through the SDK. Online features are served by our FeatureServer which runs on EKS and serves values from Dynamo. There is additional metadata stored in S3 and RDS (feature definitions and configs).
How do you determine how to partition and bucket historical feature data?
For offline storage (historical data), the partitioning depends on the cadence of your materialization. For example, there will be different partitioning for features that are computed hourly vs. features computed daily.
Can separate teams use the same infrastructure and share features across different models?
Tecton can deploy separate clusters within an organization, but also provides Workspaces for multi-tenancy. Multiple teams can, and do, use a single Tecton cluster today. Tecton enables the ability to createorganization around the features, as well as data sources, being connected. You might have different teams using different families of features which are shared across multiple teams.
What functionality does Tecton provide for data discovery and management?
When developing features in Tecton, you will register the underlying data sources they are built from. With these data sources registered, you are able to query them via our SDK. You can also inspect, discover and browse these data sources using the Tecton Web UI - for example, you can review the schema and access code snippets for working with the data interactively in a notebook.
Are there any best practices for managing the data associated with real-time transformations?
Yes, we recommend storing data in the exact same shape and form at online request-time as in training. This is important because you want to execute the same feature transformation in both cases in order to guarantee train/serve consistency. If the data isn't in the same shape and form in both contexts, you can't run the same transformation against them in both contexts.
When registering Hive data sources, do you have any recommendations or best practices?
We recommend registering your Hive data sources using AWS Glue. Glue converts all schema column names to lowercase, so all transformations must assume all inputs are lowercase. Having capitalization in the column names can lead to difficult-to-catch bugs - we would recommend using lowercase schema column names for raw data sources and lowercase references to column names in transformations.
Does any work need to be done on our end to enable streaming raw data for consumption in Tecton?
It is required to provide a historical record of your stream's output - this allows you to do backfilling for your features. Without this, streaming feature collection will begin with being setup with Tecton. The stream's historical output will need to be collected at the same level of granularity as your features will support going forward (eg, if features are processed in 15 minute intervals, the historical log needs to be stored in 15 minute intervals, at minimum). Tecton support can work with you to help set this infrastructure up, if necessary.
What infrastructure dooes Tecton use for streaming data sources?
Tecton plugs into Kafka or Kinesis as a streaming data source. For processing against those streams, Tecton then uses Spark structured streaming.
Which file formats does Tecton support?
Today, Tecton reads raw data with Spark, and supports all data formats that Spark natively supports, including CSV, JSON, Parquet, and AVRO. .tfrecords, which is not Spark supported, is not supported by Tecton currently.
What data sources does Tecton support for ingestion?
Tecton supports ingestion from Kafka, Kinesis, AWS Glue, Redshift, and raw Parquet files on S3.
How does Tecton use batch and stream data sources together?
Features in Tecton are built on top of a batch data source (eg, Hive, Redshift) or a streaming data source (eg, Kafka, Kinesis). For each kind of data source, you will provide the scheduling cadence for the feature (eg, weekly, daily, hourly) - for streaming features, the processing is done against the stream using Spark structured streaming.
Ingesting Data & Data Quality
Tecton uses Spark streaming, which is considered near real-time. What support do you have for lower latency real-time use cases?
Support for ultra-real-time use cases is coming to Tecton in Q2 2020.
When I execute tecton plan, I am receiving a large number of notifications and am seeing that the change will affect many features. Is it safe to execute tecton apply now?
If you see many updates when calling tecton plan, it is usually associated with an update to a Tecton object with many downstream objects (e.g. a data source being used by many Feature Packages.) Destructuvely updating upstream objects will modify downstream objects as well.
What guarantees does Tecton provide for ingestion latency?
Streaming data can be ingested, processed, and loaded in the order of seconds. If you require further optimization, this is possible, but would need to be discussed on a case by case basis. This has been sufficient for our customers to date.
Do you support alerting and monitoring via AWS?
If you are using a tool like Cloudwatch, we have the ability to add tags to the Dynamo tables to facilitate your requirements. Contact support for more details.
What monitoring and alerting capabilities do you have?
In the Tecton Web UI, you can find basic metrics per feature set, including total query rate, error rate, latency distributions. Please also note that Tecton is a managed service - if there's an issue with performance or errors in your services, our rotation of on-call engineers will be alerted as well.
What kind of permissions does Tecton need?
When choosing a deployment model, you may opt for a design that gives Tecton acess to a dedicated VPC within your account. This is done to give Tecton permissions to deploy/maintain the system in your account, while ensuring that your data never leaves your environment. The sub-account allows us to have these privileges in a way that does affect any of your other accounts. To see the specific requirements per deployment model, please refer to the deployment options section of our documentation.
Will my data have to be stored outside the cloud infrastructure that I already own?
We have a variety of deployment models. The most common is our Enterprise SaaS model. We also provide a VPC deployment model. Data never leaves your cloud.
The Enterprise SaaS Deployment model is a hybrid between SaaS and VPC Deployment, in which your Tecton cluster is split between an AWS account managed by Tecton and an AWS account managed by your company to provide complete data isolation within your AWS account. All data processing and feature data at rest, including materialized views, will live and stay in your AWS account. Only Tecton's metadata and core services live in Tecton's account. We recommend customers maintain a dedicated VPC for Tecton-managed data.
With VPC Deployment, your Tecton cluster runs in an AWS sub-account owned by you. All of the data processing and storage stays within this account. You grant Tecton administrative access to an AWS sub-account that you own. Tecton accesses this sub-account to manage the provisioning of the right infrastructure components (VPCs, instances, etc.). Software upgrades are taken care of by Tecton.
What does Tecton do for data lineage? Does it support the entire data flow?
For data lineage, we consider both how features are created and how they are consumed. For feature creation, we show you the entire data flow - from your raw data sources, to the different transformations being ran, to where the data is being stored. For feature consumption, we have concept of a FeatureService which maps to the features for a model that is running. For any feature, you can see which services are using it and, likewise for any service, what are all the features that are inside of it - there is bidirectional tracking.
Integrating with ETL
Does Tecton have an Airflow or Prefect integration?
Tecton does not currently have a first-class Airflow or Prefect integration. However, since Tecton's SDK is a plain Python package, it can be run from any environment with access to Python.
Currently, reading features from Tecton requires access to a Spark cluster. Ensure your Airflow environment has access to a Spark cluster when using Airflow with Tecton.
Developing with CLI
What is the access control mechanism for Tecton CLI and the web interface?
The access pattern for both is via Okta.
How are API tokens granted for service accounts?
It is possible to create a bot account or manually issue a token. Users with admin access are able to create/delete tokens via the CLI.
Developing with SDK
When a data source is registered, is any data being copied?
Tecton does not create any duplicates of the source data. It goes to the underlying data source. Tecton does manage the storage of your features, online for serving and offline for training.
Does Tecton support Jupyter Notebooks?
Tecton authors a Python SDK that can be used in any Spark Notebook. As a result, we integrate with Databricks and EMR notebooks.
Creating Feature Packages
Does Tecton provide the functionality to replay and fix a backfill if the underlying data source is updated?
Yes, it is possible to kick off an "overwrite backfill" for a particular time range. Tecton will replay all transformations. This functionality is currently in private preview. To run an overwrite backfill, contact firstname.lastname@example.org.
When defining a feature pipeline, are users required to write custom code or use a DSL?
For each feature in Tecton, you will create a python-based feature definition file that includes all of the metadata and logic you want Tecton to manage for you. The transformation logic for creating a feature will be in either PySpark, SQL or Python. Tecton requires no DSL and should be familiar to how your data scientists work today.
How can users ensure there are no duplicate features ingested?
The Tecton Feature Store manages feature dependencies via the names of the objects that are configured for Tecton (eg, data sources, feature packages, and services). It is possible to have users submit similar features with different names; we would recommend users first look to reuse features that exist in the feature store.
What programming languages do you support for generating features?
Tecton's transformation logic is managed in Spark, using PySpark or SQL transformations. If your model requires real-time / request-payload transformations, those are managed in python.
What happens when the definition of a feature changes?
If a feature's definition changes, Teton automatically detects all the dependencies on that feature, surfaces that for you, and asks you to confirm if you want to go forward with the changes. If you would like to roll back the changes or see the feature lineage, these definitions are backed by git. So you can always track the state of the world of your feature store, at all times.
Can users inspect features?
Tecton provides a variety of ways for your data scientists to inspect features. They can review the actual code of the feature, see summary statistics for all features, and query the feature's data using the Tecton SDK.
Can users register and discover features in Tecton?
Yes, with Tecton, you register the entire transformation logic, plus metadata around families, owners, custom tags, and more. The Tecton Web UI then allows users to access, browse and discover different families of features if you break them down by different use cases or filter down to different metadata tags that you can add on to these features.
How far back does Tecton support time-travel?
You set your features' backfill start date in Tecton. Time-travel can be performed as far back as feature data exists.
What support do you provide for time travel?
Tecton performs time travel on a row level basis - our granularity of time travel can be quite specific. If you have event driven ML models where you're regularly making predictions and you need to go back to every single specific point in time, and get the feature values as of that point in time, Tecton will handle that full travel query as opposed to just being able to get all feature values at a single point in time.
Does tecton supports temporal features ?
Yes, in the feature definition, users can specify a serving_ttl for features - the length of time the feature values are valid for serving.
Are features versioned? How can users manage updating a "v1" feature to a "v2" feature?
For this, you can use Tecton's variants. Users have the ability to specify a new feature that is a variant of another feature (bug fix, data element name change, a new experiment, etc.). Feature variants can then be marked as deprecated to notify users to migrate over.
When scheduling materializations, does Tecton only materialize new data? Or does Tecton re-materialize all data?
Generally speaking, Tecton only reads and computes new data. There may be instances in which more historical data is required (eg, computing a one month average at materialization time requires knowing the full window of information).
What data types are supported for feature values?
Tecton supports the following Spark data types:
StringType elements may only be used in a
Consuming Feature Services
What are the feature serving limits?
Tecton is built on DynamoDB. Dynamo's default ingestion rate is 40 writes per second and becomes more expensive above this. Dynamo's read SLO is a P99 of 100ms. This is subject to a limit on the amount of data that is being aggregated for that request. When there are aggregations at request time, there is a limit of 2MB of data being aggregated for continuing to mee the 100ms threshold.
For serving aggregate features, what happens if the feature requires more than the 2MB limit?
You can aggregate more than 2MB, however, we can't guarantee what's going to happen to the latency, after some point, with regards to how much data you're reading per query. When we're measuring our compliance with our SLO, we exclude requests that require aggregating more than 2MB's of data. It's a soft limit in the sense that the query will work if it goes over 2MB; it just may be slower.
What are your availability guarantees for feature serving?
For a single region, we have an SLO of 99.99% uptime (52 minutes per year downtime). If your system requires multi-region support, we provide an SLO of five 9's (8 minutes per year downtime).
When serving features, what limits have you tested for throughput?
Our highest production workloads currently support ~4,000 requests per second. That being said, our system is capable of scaling higher than this if your system requires it.
How granular is user access? Can we restrict which users can add vs delete features?
ACLs are on a per-workspace level with read, write, and delete rights.
Does Tecton provide a set of “shared” features that others can pull from for their models?
Yes. Multiple teams can be included in a feature repository and share features within it. These features can be further organized to delineate shared vs. use case-specific features. We strongly encourage the sharing and reuse of features where possible and have seen organizations experience significant productivity and model gains when doing so.
How can I improve the request latency when FeatureService.query_features()?
If your specified online serving index returns a large amount of data, you may have longer response times. The time it takes for Tecton to return is a function of the number of rows matched by your query, the number of features in your feature service, and the amount of aggregation needed for your TemporalAggregateFeaturePackages.
If you need to retrieve a large number of rows, you can try:
* Reducing the data aggregated at request time by increasing the
aggregation_slide_period or shortening the longest
* Removing unimportant features from your model.
Using Tecton across the organization
How can I ensure my features are useable across my company?
When registering a feature with Tecton, the owner of the feature can define feature descriptions, tags, and the transformation logic, among other things. These are all searchable and surfaced for other users when exploring features.
What SSO support do you provide (eg, Microsoft Office 365)?
Tecton integrates with Okta and Okta can integrate with any SSO provider.