Random Forest – SYNTASA™

Description

The Random Forest process type provides the ability to do a Random Forest algorithmic model. This process type provides the ability to define desired settings and fields to model in effort to provide the user with ability to experiment faster and productionize the model so that Syntasa manages the job scheduling and running.

Process Configuration

Algorithm Details

Checkpoint Interval provides frequency for checkpointing node ID cache resilient distributed datasets (RDDs). Setting this too low will cause extra overhead from writing to HDFS; setting this too high can cause problems if executors fail and the RDD needs to be recomputed. The checkpoint interval is set to (>= 1) or disabled (-1).
Impurity is the measure based on the optimal condition chosen. There are two options.
- Gini is a measure of how frequently a randomly chosen element from a set would be incorrectly labeled if if it was randomly labeled according to the distribution of labels in the subset
- Entropy or information gain is the amount of information acquired about a random variable from observing a different random variable
Max Bins maximum number of ordered splits for feature sorting of large distributed datasets
Max Depth provides the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data.
Min Info Gain provides ability to specify the minimum decrease of entropy
Min Instances Per Node allow ability to set a minimum number of instances to run per node
Sub Sampling Rate ability to specify the desired rate of sub sampling
Num Trees defines the number of trees to use in the model
Feature Subset Strategy ability to define a feature subset strategy besides 'auto'
Cross Validate toggle on to have the model cross validate
Folds number of subsamples the original sample should be partitioned into

Help Desk > v4 - Random Forest > image2018-11-13_13-51-45.png

Mapping

The mapping screen provides the ability to define the fields that should be included in the model. While also defining what fields are Label, Feature, Identifier and/or Partitioned. Fields can be re-ordered and removed from this screen. There following Actions are available through the Actions menu.

Actions

Add - add a new field
Add All - add all fields from the input source
Clear - remove all fields
Import - ingest from a file
Export - export to a file

Help Desk > v4 - Random Forest > image2018-11-13_13-52-15.png

Output

Output screen is where the table name, display name, and model name can be defined along with the option to "Load to BQ" when using Google Cloud Platform or "Load to Redshift" when using Amazon Web Services. There are three outputs for this process type per the following ensuring the table names are unique to mitigate data being overwritten.

learning_metrics
feature_importance
model

Help Desk > v4 - Random Forest > image2018-11-13_13-55-6.png

Expected Output

The expected out of this process type are the model that is stored in the "Base Path" and the learning_metrics and feature_importance stored in the "Location" that are found on the Output screen. Loading to BQ or Redshift helps to make querying the learning metrics and feature importance easier.