The Transform is a Synthesizer, Composer, and Orchestrator process that helps to build user-defined analytics datasets. Some of the high-level capabilities available with Transform are:
- Join multiple datasets
- Transforming various fields
- Aggregation
- Filtering
Example use cases
- Customer-level Synthesizer datasets joining clickstream and CRM data and aggregating by a UUID.
- Join labels and predictions in an Outcome Propensity app.
- Join multiple Big Query datasets with clickstream, aggregate by customer ID and filter including only valid IDs in Orchestrator to be connected to Publish process sending derived attributes to a DMP.
Process Configuration
The Transform process offers various methods for data preparation. The following sections aim to explain each screen and field to enhance your understanding.
Join
This section provides the information Syntasa needs if you are joining more than one set of data. Here are the steps to create a new join:
- Go to the App and navigate to Development >> Workflow.
- Click on the 'Transform' process.
- Select the 'Join' tab.
- Click the Plus (+) icon shown on the screen.
Following is the explanation of the configurable fields:
- Join Type: Options include inner, left, right, and full outer.
- Source: This will display the list of data stores linked to the 'Transform' process. The selected data store will act as the input source for the data.
- Dataset Selector: Choose the dataset that will be joined with the first dataset.
- Alias: Type a table alias if a different name is desired or required.
- Left Value: Choose the field from the first dataset that will link with the joined dataset (e.g., customer ID if joining a CRM dataset).
- Operator: Select how the left value should be compared with the right value. For joins, this will typically be an equals sign (=).
- Right Value: Select the value from the joining dataset that is being compared with the left value.
Mapping
The Mapping screen is where event data fields are mapped to Syntasa fields, desired functions are applied, and user-friendly labels are created.
This section defines and labels the input data according to the Syntasa schema. Syntasa offers a growing set of custom functions that can be applied alongside any Hive functions for data transformation. It is recommended to consult Syntasa professional services with any questions before applying functions other than the defaults.
Here is the explanation of different fields:
- Type - Simple & Aggregation
- Simple: This is the default setting with "Is Partitioned" set to true.
- Aggregation: When aggregation is selected, a "Group By" column will be made available for use
- Is Partitioned: Selection to specify if the dataset should be partitioned.
- A partition column will appear to select the field(s) that should partition the table.
- If a partition is not selected, Syntasa will use the event partition.
- Windowing: This feature provides the ability to look back a specified number of days, ending on a specified date, and writing the data to one partition.
-
Number of Days: Enter the number of days to look back and include in the partition.
-
Ending On: Select a date the look-back period should end. Typically, the default of "Processing Date" will be used, but there may be occasions where a different selection is relevant.
- Offset Dates: When the toggle is disabled, the start & end dates are set based on configured values of fields 'Number of Days' and 'Ending on'. When the toggle is enabled, the “Move end date by” field becomes active, allowing you to enter a specific number of days to move the start & end date forward or backward.
-
Reminder: Syntasa processes partitioned data on a partition-by-partition basis, so "Processing Date" would mean the event_partition being processed for the period.
Example: You run the Transform process for 10 days, you have configured 30 days as your window ending on Processing Date, and your data is partitioned by day. The process will look back 30 days and include all those days in the processing for each day in your specified 10-day time period.
For Transform, there are six actions available:
- Add - Add is used to select specific fields from the input table.
- Add All - Add All will select all fields from the input table.
- Clear - Clear will clear all selected fields from the mapping canvas.
- Function - The function is used to access the function editor to create custom fields.
- Import - Import is used if the client has JSON data available to provide the custom mappings.
Note:- Wait 60 seconds to ensure the process of pulling in mappings and labels is complete.
- Use the scroll, order, and search options to locate the cust_fields and cust_metrics fields to ensure all the report suite custom fields have been mapped.
- Export - Export is utilized to export the existing mapping schema in a .csv format that can be used to assist in the editing or manipulation of the schema. This updated file could then be used to input an updated schema into the dataset.
Filters
Filters allow the user to apply a "Where" clause to the dataset if needed.
Steps to create a filter:
- Click the "Apply Where Clause" button to enable filter editing.
- The filter editor screen will appear.
- Select the appropriate Left Value from the drop-down list or click "--Function Editor--" to create and apply a custom function.
- Select the appropriate Operator from the drop-down list.
- Select the desired Right Value for the filter from the drop-down list or click "--Function Editor--" to create and apply a custom function.
- Multiple filters can be applied.
- Ensure the proper (AND/OR) logic is applied when adding additional filters if required.
Output
The Outputs tab allows you to:
- Name the table and set its display name on the graph canvas.
- Select the destination for data loading:
- Load to BigQuery (BQ) if using Google Cloud Platform (GCP).
- Load to Redshift or RDS if using Amazon Web Services (AWS).
- Write to HDFS if using an on-premise Hadoop setup.
Expected Output
The expected outputs of the Transform process include the following tables within the environment where the data is processed (e.g., AWS, GCP, on-premise Hadoop):
- output_table (with configurable field name): This table uses Syntasa-defined column names and can be queried directly using an enterprise-provided query engine.
Additionally, this table can serve as the foundation for building other datasets, such as Syntasa custom-built datasets.