Description
The Split process is used to divide your dataset into two parts: Training and Testing datasets.
Process Configuration
There are two screens that need configuring for this process type.
- Input
- Outputs
Input
- Train Ratio
- The values can be anything between 0 and 1.
- Example - Train Ratio = 0.8, it means 80% of the data goes for Training and 20% is for Testing.
- Set Seed
- Seed values are used for randomization.
- The values have to be integer type (minimum value is 1 and maximum 9999)
- By default, Syntasa has set the seed value as 1000
Outputs
Output screen is where the table name can be defined along with the option to "Load to BQ" when using Google Cloud Platform or "Load to Redshift" when using Amazon Web Services.
Expected Output
The expected output of this process type are two dataset tables Training and Testing tables that are produced by the train ratio from the Input screen. These output datasets will be written to tables in the environment the code was run (i.e. BigQuery, Redshift).