Skip to main content
Version: Beta ๐Ÿšง

TectonDataFrame

Summaryโ€‹

A thin wrapper around Pandas, Spark dataframes.

Attributesโ€‹

NameData TypeDescription
columnsSequence[str]The columns of the dataframe
schemaSchemaThe schema of the dataframe

Methodsโ€‹

NameDescription
__init__(...)Method generated by attrs for class TectonDataFrame.
explain(...)Prints the query tree. Should only be used when this TectonDataFrame is backed by a query tree.
get_sql_node(...)Returns the first node from which SQL can be generated from the TectonDataFrame's query tree.
start_dataset_job(...)Start a job to materialize a dataset from this TectonDataFrame.
subtree(...)Creates a TectonDataFrame from a subtree of prior querytree labeled by a node id in .explain().
to_arrow()Get the results as arrow Table
to_pandas(...)Convert TectonDataFrame to Pandas DataFrame
to_spark()Returns data as a Spark DataFrame.

__init__(...)โ€‹

Method generated by attrs for class TectonDataFrame.

Parameters

Returns

None

explain(...)โ€‹

Prints the query tree. Should only be used when this TectonDataFrame is backed by a query tree.

Parameters

  • node_id: bool = True If True, the unique id associated with each node will be rendered.
  • name: bool = True If True, the class names of the nodes will be rendered.
  • description: bool = True If True, the actions of the nodes will be rendered.
  • columns: bool = False If True, the columns of each node will be rendered as an appendix after tree itself.

get_sql_node(...)โ€‹

Returns the first node from which SQL can be generated from the TectonDataFrame's query tree.

Parameters

  • tree: NodeRef Subtree for which to generate SQL

start_dataset_job(...)โ€‹

Start a job to materialize a dataset from this TectonDataFrame.

Parameters

  • dataset_name: str Dataset object will be created with this name. Dataset can be later retrieved by this name, hence it must be unique within the workspace.
  • cluster_config: Union[_DefaultClusterConfig, DatabricksClusterConfig, EMRClusterConfig, DatabricksJsonClusterConfig, EMRJsonClusterConfig, RiftBatchConfig, NoneType] = None Configuration for Spark/Rift cluster
  • tecton_materialization_runtime: Optional[str] = None Version of tecton package used by the job cluster
  • environment: Optional[str] = None The custom environment in which jobs will be run
  • extra_config: Optional[Dict[str,Any]] = None Additional parameters (the list may vary depending on the tecton runtime) which may be used to tune remote execution heuristics (ie, what number to use when chunking the events dataframe)
  • compute_mode: Union[tecton_core.compute_mode.ComputeMode, str, NoneType] = None Override compute mode used in get_features call
  • job_retry_times: Optional[int] = None Max retry times of the job. If not specified, use the default Remote Dateset Job retry times.

Returns

DatasetJob: DatasetJob object

subtree(...)โ€‹

Creates a TectonDataFrame from a subtree of prior querytree labeled by a node id in .explain().

Parameters

  • node_id: int identifier of node from .explain()

Returns

TectonDataFrame

to_pandas(...)โ€‹

Convert TectonDataFrame to Pandas DataFrame

Parameters

  • pretty_sql: bool = False Not applicable when using spark. For Snowflake and Athena, to_pandas() will generate a SQL string, execute it, and then return the resulting data in a pandas DataFrame. If True, the sql will be reformatted and executed as a more readable, multiline string. If False, the SQL will be executed as a one line string. Use pretty_sql=False for better performance.

Returns

DataFrame: A Pandas DataFrame.

to_spark(...)โ€‹

Returns data as a Spark DataFrame.

Returns

DataFrame: A Spark DataFrame.

Was this page helpful?