Generalized Linear Regression – SYNTASA™

Description

The Generalized Linear Regression model is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. This process type provides the ability to define desired settings and fields to model in an effort to provide the user with ability to experiment faster and productionize the model so that Syntasa manages the job scheduling and running.

Help Desk > v4 - Generalized Linear Regression > image2018-11-20_9-38-25.png

Process Configuration

Algorithm Details

Family - select from four distribution methods:
- binomial - used to model the number of successes in a sample of size n drawn with replacement from a population of size N
- gamma - used to model two-parameter family of continuous probability distributions
- gaussian - used to normalize so that the sum over all values of x gives a probability of 1
- poisson - discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event
Link - select from drop-down for link functions available by family distribution selected
- binomial distribution selections: cloglog, logit, probit
- gamma, gaussian distribution selections: identity, inverse, log
- poisson distribution selections: identity, log, sqrt
Iterations - provide number of times the process should iterate
Regularizations - numeric value from 0-1 used to provide additional information in order to prevent overfitting
Cross Validate - toggle on to have the model cross validate
- Folds - specify number of k-folds for validation

Mapping

The mapping screen provides the ability to define the fields that should be included in the model. While also defining what fields are Label, Feature, Identifier and/or Partitioned. Fields can be re-ordered and removed from this screen. The following Actions are available through the Actions menu.

Actions

Add - add a new field
Add All - add all fields from the input source
Clear - remove all fields
Import - ingest from a file
Export - export to a file

Help Desk > v4 - Generalized Linear Regression > image2018-11-20_10-16-32.png

Output

Output screen is where the table name, display name, and model name can be defined along with the option to "Load to BQ" when using Google Cloud Platform or "Load to Redshift" when using Amazon Web Services. There are three outputs for this process type per the following ensuring the table names are unique to mitigate data being overwritten.

learning_metrics
feature_importance
model

Help Desk > v4 - Generalized Linear Regression > image2018-11-20_10-18-4.png

Expected Output

The expected out of this process type are the model that is stored in the "Base Path" and the learning_metrics and feature_importance stored in the "Location" that are found on the Output screen. Loading to BQ or Redshift helps to make querying the learning metrics and feature importance easier.