Description
The Decision Tree process is used to create a model that predicts the value of a target variable based on several input variables. This process type provides the ability to define desired settings and fields to model in an effort to provide the user with the ability to experiment faster and productionize the model so that Syntasa manages the job scheduling and running.
Process Configuration
Algorithm Details
- Checkpoint Interval provides frequency for checkpointing node ID cache resilient distributed datasets (RDDs). Setting this too low will cause extra overhead from writing to HDFS; setting this too high can cause problems if executors fail and the RDD needs to be recomputed. The checkpoint interval is set to (>= 1) or disabled (-1)
- Impurity is the measure based on the optimal condition chosen. There are two options:
- Gini is a measure of how frequently a randomly chosen element from a set would be incorrectly labeled if if it was randomly labeled according to the distribution of labels in the subset
- Entropy or information gain is the amount of information acquired about a random variable from observing a different random variable
- Max Bins maximum number of ordered splits for feature sorting of large distributed datasets
- Max Depth provides the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data
- Min Info Gain provides ability to specify the minimum decrease of entropy
- Min Instances Per Node allow ability to set a minimum number of instances to run per node
- Cross Validate toggle on to have the model cross validate
- Folds number of subsamples the original sample should be partitioned into
Mapping
The mapping screen provides the ability to define the fields that should be included in the model. While also defining what fields are Label, Feature, Identifier and/or Partitioned. Fields can be re-ordered and removed from this screen. There following Actions are available through the Actions menu.
Actions
- Add - add a new field
- Add All - add all fields from the input source
- Clear - remove all fields
- Import - ingest from a file
- Export - export to a file
Output
Output screen is where the table name, display name, and model name can be defined along with the option to "Load to BQ" when using Google Cloud Platform or "Load to Redshift" when using Amazon Web Services. There are three outputs for this process type per the following ensuring the table names are unique to mitigate data being overwritten.
- learning_metrics
- feature_importance
- model
Expected Output
The expected out of this process type are the model that is stored in the "Base Path" and the learning_metrics and feature_importance stored in the "Location" that are found on the Output screen. Loading to BQ or Redshift helps to make querying the learning metrics and feature importance easier.