Description
The Generalized Linear Regression model is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. This process type provides the ability to define desired settings and fields to model in an effort to provide the user with ability to experiment faster and productionize the model so that Syntasa manages the job scheduling and running.
Process Configuration
Algorithm Details
- Family - select from four distribution methods:
- binomial - used to model the number of successes in a sample of size n drawn with replacement from a population of size N
- gamma - used to model two-parameter family of continuous probability distributions
- gaussian - used to normalize so that the sum over all values of x gives a probability of 1
- poisson - discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event
- Link - select from drop-down for link functions available by family distribution selected
- binomial distribution selections: cloglog, logit, probit
- gamma, gaussian distribution selections: identity, inverse, log
- poisson distribution selections: identity, log, sqrt
- Iterations - provide number of times the process should iterate
- Regularizations - numeric value from 0-1 used to provide additional information in order to prevent overfitting
- Cross Validate - toggle on to have the model cross validate
- Folds - specify number of k-folds for validation
Mapping
The mapping screen provides the ability to define the fields that should be included in the model. While also defining what fields are Label, Feature, Identifier and/or Partitioned. Fields can be re-ordered and removed from this screen. The following Actions are available through the Actions menu.
Actions
- Add - add a new field
- Add All - add all fields from the input source
- Clear - remove all fields
- Import - ingest from a file
- Export - export to a file
Output
Output screen is where the table name, display name, and model name can be defined along with the option to "Load to BQ" when using Google Cloud Platform or "Load to Redshift" when using Amazon Web Services. There are three outputs for this process type per the following ensuring the table names are unique to mitigate data being overwritten.
- learning_metrics
- feature_importance
- model
Expected Output
The expected out of this process type are the model that is stored in the "Base Path" and the learning_metrics and feature_importance stored in the "Location" that are found on the Output screen. Loading to BQ or Redshift helps to make querying the learning metrics and feature importance easier.