The purpose of the From File process is to define the input file naming pattern, schema, and load the data into a big data environment. A source connection will need to be defined first on the Connections page and that Connection will need to be dragged onto the canvas, selected, and connected to the From File process.
Process Configuration
Configuration of this process includes three screens.
Input
This section provides the information Syntasa needs to understand the source connection path of the files and details of the files that need to be ingested.
- Source Path - folder within the source connection the files reside. Do not include bucket name or directory name specified in the Connection.
- Source File Pattern - file name pattern of the raw files to pull specific source files from the connection, keeping in mind that the files may be .tar or .zip containing multiple files, such as the raw data and supporting enrichment data. This provides the ability to pull specific files from a Connection where multiple source files may exist.
- Event File Pattern - file name of separate raw events file, if exists, if not use same pattern as source file. For example, Adobe provides a .tar file named with report suite and date, inside that .tar file is the hit_data.tsv file. In this example, the user would enter hit_data.tsv in this field because it is the event file of the source file.
- File Type - Tar, Textfile, Zip file type
- Compression Type - specify if the file is compressed and compression type
- Incremental Load - provides option to keep previous files or overwrite in the case of one time lookup load
- File Name Has Date - specify if the filenames provide a date if the data source needs to be partitioned by date
- Date Pattern - pattern of the date in the filename (i.e. yyyy-MM-dd)
- Date Extraction Type - Regex or Index
- Regex Pattern - pattern of the date using regex extraction
- Group Number - group number of regex extraction string
- Date Manipulation - provides means to shift partition date by positive or negative number of days compared to file date (i.e. file date is 2018-08-01, but contents of file is 2018-07-31)
Schema
This section provides the ability to define the column names of the input file and time column.
- File Data Format - Avro, Delimited, Apache Log, Hive Text drop down menu
- Delimited requires a Field Delimiter (i.e. \t for tab delimited)
- Apache requires an Apache Log Format and a Regex Pattern
- Hive Text requires a Serde and Serde Properties
- Log Format - comma separated list of input column headers
- Time Source - column designating a timestamp, required incremental loading
- Time Format Type - specify how the time Source field is formatted
Output
The Outputs tab provides the ability to name table and displayed name on the graph canvas, along with selecting whether to load to Big Query (BQ) if in the the Google Cloud Platform (GCP), load to Redshift or RDS if in Amazon Web Services (AWS), or simply write to HDFS if an using on-premise Hadoop.
Expected Output
Data from the source connection file in the Input tab of this process is loaded to Hive. Also, it provides the ability for Syntasa to be aware of the State and schema for downstream usage, and optionally write to other environment specific query engines.