Description
The Lookback process is building a history dataset for the propensity scoring app as an example. This dataset contains all the features we will use to train the machine learning model by going back a specified number of days based on the processing time and creating a table with user-defined columns and filters.
For example, to train a model on predicting house purchases we would use the features square footage of house, number of bedrooms, number of bathrooms, etc.
Process Configuration
The Lookback Process Configuration screen has four tabs that define where the files reside, structure of the filename, structure of the schema, and storage rules of the files. Click on the Lookback node to access the editor.
Below are details of each screen and descriptions of each of the fields.
Input
The input screen defines the dataset(s) to use as the input.
- Primary Source - The first dataset connected on the graph will appear by default, click the down arrow to select a different dataset.
- Alias - type a table alias if a different name is desired or required.
Joins
To create a join, click the green plus button.
- Join Type - left or inner join
- Source - choose the dataset that will be joined with first dataset
- Alias - type a table alias if a different name is desired or required
- Left Value - choose the field from the first dataset that will provide a link with the joined dataset (i.e. customer ID if joining a CRM dataset)
- Operator - select how the left value should be compared with the right value, for joins this will typically be an = sign
- Right Value - select the joining dataset value that is being compared with the left value
Mapping
- Lookback Window Length - specifies the number of days the process should include when compared to the date processing.
- For example, if being processed for January 10 and 11, the process for January 10 will process for January 7-9. For January 11 the process will include January 8-10
Actions
Fields to be included in the Mapping section depend on the use case, there typically is a field defined as an Identifier (i.e. visitor_id), a field defined as a partition (i.e. input source event_partition), and one or more fields that will be used to create features in a subsequent process
For Lookback there are six options available: Add, Add All, Clear, Function, Import and Export.
- Add - used to select specific fields from the input table.
- Add All - will select all fields from the input table.
- Clear - clear all selected fields from the mapping canvas.
- Function - used to access the function editor to create custom fields.
- Import - selected if the client has JSON data available to provide the custom mappings.
- Export - utilized to export the existing mapping schema in a .csv format that can be used to assist in the editing or manipulation of the schema. This updated file could then be used to input an updated schema into the dataset.
Mapping Output
- Order - column ordering
- Name - specified name of the column
- Function - map a field directly or apply a function such as max(), sum() or a case statment to name just a few
- Identifier - field to aggregate on
- Partition - column to partition the data on
- To switch a field to Identifier or Partition, click in the corresponding cell and select the checkbox
Filters
Filters provides the ability to filter the dataset (aka apply a Where and/or Having clause) to include only certain data.
To create a filter click the green plus button and the filter editor screen will appear. Multiple filters can be applied, ensure the proper (AND/OR) logic is applied.
Output
The Output tab provides the ability to name the table and displayed name on the graph canvas, along with selecting whether to load to Big Query (BQ) if in the the Google Cloud Platform (GCP), load to Redshift or RDS if in Amazon Web Services (AWS), or simply write to HDFS if an using on-premise Hadoop.
Expected Output
The expected output of the Lookback process is the below table within the environment the data is processed (e.g. AWS, GCP, on-premise Hadoop):
- Table Name <tb_visitor_history> default valule - table using Syntasa defined column names
- Display Name <Visitor History> default value - display name of node on canvas
This table can be queried directly using an enterprise provided query engine.
Additionally, the table can serve as the foundation for building other processes within the Syntasa Composer environment.
Test
After the process is configured, it is highly recommended to test configured process.
- Click the down arrow to close the process configuration screen
- Save and Lock the canvas
- Shift-click on the process
- Click the Test button
- Run for one day using Overwrite mode (ensure the day being run exists in the input dataset)
- Click on Operations screen to track the job progress
- After a successful, test move on to the next process