Delete Keys from a Feature View
Tecton makes it simple to delete individual keys from a materialized Feature View. This capability can be helpful for cleaning up erroneous data or handling user data deletion requests.
Requirements and Limitations​
Feature View Requirements
In order to be eligible for key deletion, a Feature View needs to meet the following requirements:
- Needs to materialize data with either
online=True
oroffline=True
. Otherwise there is no data to delete! Note thatOnDemandFeatureView
does not materialize data to Tecton, so there is similarly no data to delete. - The offline store needs to be configured to use Delta format. To do so, set
offline_store=DeltaConfig()
. Ifoffline_store
is not specified, Delta format is the default starting with Tecton SDK 0.8. - Cannot have
online_serving_index
configured.
Deletion Request Limitations
When constructing your dataframe of IDs to delete:
- Maximum 500,000 keys can be deleted per request.
- If a Feature View has multiple entities, the full set of join keys must be
specified. For example, if the Feature View has entities
[user_id, merchant_id]
, then both IDs must be present for each row in the deletion request.
Finally, note that Tecton does not prevent materializing data for these IDs in the future, including late-arriving data or concurrently running materialization jobs.
Using the delete_keys method​
The delete_keys()
SDK method is available for BatchFeatureView
,
StreamFeatureView
, and FeatureTable
.
See the SDK reference for the full method signature.
First, construct your Spark or Pandas Dataframe with the set of keys to be deleted. For example:
join_keys_df = pandas.DataFrame({
'user_id': ['A100000000', 'C200000000']
})
Then call the delete_keys()
method on Feature View or Feature table:
ws = tecton.get_workspace('my_workspace')
fv = ws.get_feature_view('my_feature_view')
job_ids = fv.delete_keys(join_keys_df)
for job_id in job_ids:
print(fv.get_materialization_job(job_id).state)
This method will trigger asynchronous jobs and return a list of job ids. To view
the status of these jobs, pass the job ids to get_materialization_job()
. These
jobs typically take 10 to 45 minutes to complete, depending on the size of your
data.
You can also view status under the materialization tab for this feature view in your Tecton web console. Note that the deletion jobs will be at the end, after all the materialization jobs.
Finally, you may want to verify that the data was deleted as intended.
keys = { 'user_id' : 'A100000000'}
fv.get_online_features(join_keys=keys).to_dict()
keys_df = pandas.DataFrame({
'user_id': ['A100000000']
})
start_time = datetime(<Your Start Time>)
end_time = datetime(<Your End Time>)
fv.get_features_in_range(start_time, end_time, entities=keys_df).to_pandas()
How it works​
When you run FeatureView.delete_keys(join_keys_dataframe)
, Tecton will
initiate jobs to delete entries from both the online and offline store that
match the specified join keys.
Deletion requires that you use Delta formatting for the offline store. Tecton will use Delta table APIs to delete all historical values for the specified join keys. Because deletion with Delta only removes data from the latest version of the table, Tecton will additionally run a vacuum command to fully remove any data deleted at least 7 days ago as part of Delta maintenance tasks. As a result, all data should be fully deleted within 7-14 days.
Additionally, Tecton will delete the associated feature values from the online store. Note that there will be costs associated with these operations if you are using DynamoDb as your online store.