The Lookback process is used to build a historical dataset for applications like the propensity scoring app. This dataset includes all the features needed to train a machine learning model by looking back a specified number of days from the processing time and creating a table with user-defined columns and filters.
Example: To train a model for predicting house purchases, the dataset would include features such as square footage of the house, number of bedrooms, number of bathrooms, etc.
Process Configuration
Let's cover each tab shown when you click on the LookBack node to access the editor.
Join
At the top of the Join screen, it defines the dataset(s) to use as the input with the following two fields:
- Primary Source: The first dataset connected to the graph will appear by default. Click the down arrow to select a different dataset.
- Alias: Type a table alias if a different name is desired or required.
This section provides the information Syntasa needs if you are joining more than one set of data. Here are the steps to create a new join:
- Go to the App and navigate to Development >> Workflow.
- Click on the 'LookBack' process.
- Select the 'Join' tab.
- Click the Plus (+) icon shown on the screen.
Following is the explanation of the configurable fields:
- Join Type: Choose between a left or inner join.
- Source: Select the dataset that will be joined with the first dataset.
- Alias: Type a table alias if a different name is desired or required.
- Left Value: Choose the field from the first dataset that will link with the joined dataset (e.g., customer ID if joining a CRM dataset).
- Operator: Select how the left value should be compared with the right value; for joins, this will typically be an equals sign (=).
- Right Value: Select the value from the joining dataset that is being compared with the left value.
Mapping
The Mapping screen defines the fields, allows the application of functions, and sets the identifiers and partitions.
This screen shows two configurable fields:
- Lookback Window Length: This field determines the number of days of historical data to include in the analysis. In the above screenshot, the value is set to 7, which means the model will look at the previous 7 days of data.
- Lookback Lag: This field defines a delay between the start date of your lookback window and the most recent data point included in the analysis.
Let's consider the example of the above screenshot:
- Today's date: July 11th, 2024
- Lookback Window Length: 7 days (duration: July 4th - July 10th)
- Lookback Lag: 2 days
With a Lookback Lag of 2, the model will exclude the most recent 2 days of data (July 9th and 10th) from the analysis. This means the model will use data from:
- Start date: July 4th, 2024 (7 days before the end date + Lookback Lag of 2)
- End date: July 8th, 2024 (2 days before the original end date)
Therefore, the Lookback Lag effectively shortens the lookback window by the specified number of days, focusing the analysis on data points further back in time.
Fields to be included in the Mapping section depend on the use case. Typically, these fields include:
- Identifier: A field defined as an identifier (e.g., visitor_id).
- Partition: A field defined as a partition (e.g., input_source_event_partition).
- Feature Fields: One or more fields that will be used to create features in a subsequent process.
For LookBack, there are six actions available on the 'Mapping' screen:
- Add - Add is used to select specific fields from the input table.
- Add All - Add All will select all fields from the input table.
- Clear - Clear will clear all selected fields from the mapping canvas.
- Function - The function is used to access the function editor to create custom fields.
- Import - Import is used if the client has JSON data available to provide the custom mappings. (Note: Wait 60 seconds to ensure the process of pulling in mappings and labels is complete.)
- Export - Export is utilized to export the existing mapping schema in a .csv format that can be used to assist in the editing or manipulation of the schema. This updated file could then be used to input an updated schema into the dataset.
Mapping Output
Here is the list of columns shown as mapping output on the screen:
- Order: Specifies the sequence or position of the column in the output.
- Name: The custom or specified name given to the column.
- Function: Defines how the data in the column is manipulated or processed, such as applying aggregation functions like max(), and sum(), or performing conditional operations using case statements.
- Identifier: This column serves as a key field for aggregation purposes.
- Partition: Indicates the column used to partition the data.
Filters
Filters provide the ability to filter the dataset (i.e., apply a WHERE and/or HAVING clause) to include only certain data.
Steps to create a filter:
- Toggle on the "Apply Where Filter" or "Apply Having Filter" to enable filter editing.
- The filter editor screen will appear.
- Select the appropriate Left Value from the drop-down list or click "--Function Editor--" to create and apply a custom function.
- Select the appropriate Operator from the drop-down list.
- Select the desired Right Value for the filter from the drop-down list or click "--Function Editor--" to create and apply a custom function.
- Multiple filters can be applied.
- Ensure the proper (AND/OR) logic is applied when adding additional filters if required.
Output
The Outputs tab offers the following capabilities:
- Naming the table and setting its displayed name on the graph canvas.
- Selecting the destination for data loading:
- Loading to BigQuery (BQ) if using Google Cloud Platform (GCP).
- Loading to Redshift or RDS if using Amazon Web Services (AWS).
- Writing to HDFS if using an on-premise Hadoop environment.
Expected Output
The expected output of the Lookback process is the following table within the environment where the data is processed (e.g., AWS, GCP, on-premise Hadoop):
- Table Name: <tb_visitor_history> (default value) - table using Syntasa-defined column names.
- Display Name: <Visitor History> (default value) - display the name of the node on the canvas.
This table can be queried directly using an enterprise-provided query engine.
Additionally, the table can serve as the foundation for building other processes within the Syntasa Composer environment.