The Input screen provides Syntasa with crucial details needed to understand the source connection path of the files and specific information about the files scheduled for ingestion.
The following 4 fields are essential for configuring the From File processor, enabling precise identification and handling of files from the source connection:
- Source Path: Specifies the folder within the source connection containing the files, excluding the bucket or directory name defined in the connection.
- Source File Pattern: Defines the file name pattern, allowing the selection of specific files. This is especially useful when dealing with compressed files like .tar or .zip, which may contain several files, including raw data and supporting enrichment data. It allows you to pick out the exact files you need from a connection that holds multiple files.
- File Type: Specifies the type of file to process, such as Tar, Textfile, Zip, Parquet, or Avro.
-
Compression Type: Indicates whether the file is compressed. Options include Gzip, Lz4, Bzip2, Snappy, or "none" for uncompressed files.
Auto Configure
After specifying the source file path and pattern, the Auto Configure feature can be used to automatically populate the remaining fields. It identifies the file based on the provided path and pattern and completes the relevant fields accordingly.
Validate
After specifying the source file path and pattern, click the Validate link to confirm whether the system can detect the file(s) based on the provided configuration. If successful, the detected file names are displayed without errors. Otherwise, any issues are highlighted in the Validation section.
Events
Under the event section, we add details related to event data like its format, quote character, character escape, etc.
-
Event Data Format - It defines the format of the event data, which can be selected from a dropdown menu-
- Delimited: It requires a Field Delimiter (i.e. \t for tab-delimited).
- Apache Log: It requires an Apache Log Format and a Regex Pattern.
- Hive Text: It requires a Serde and Serde Properties.
-
Magic Pen - The magic pen (
) feature enables users to preview event data by specifying the event data format and field delimiter values. This preview feature provides a raw data table view, allowing users to determine the appropriate parameters for the event data format based on the previewed data.
Contains Header (Toggle)
Depending on the file type, a Contains Header toggle is displayed to indicate whether the first row contains column headers.
- For Parquet or Avro files, where column names are always embedded in the file, this toggle is not displayed.
- For file types like CSV, the toggle is required to specify whether the first row contains headers, and it is displayed accordingly.
Additionally, when the Contains Header toggle is displayed and enabled, the Auto-Fill option appears in the Schema screen to automatically detect and populate column names. If the toggle is disabled, the Auto-Fill option is not shown.
Incremental Load (Toggle)
When working with data ingestion, the choice to enable or disable the Incremental Load toggle depends on how the data is organized and delivered:
- Non-Partitioned Data: If the data is not partitioned and you need to ingest a single, static file, you can disable the 'Incremental Load' toggle. This ensures the ingestion process targets only that specific file.
- Partitioned Data: If the data is partitioned based on daily or hourly intervals, enabling the 'Incremental Load' toggle is recommended. This feature allows the system to automatically detect and ingest new partitioned files as they are delivered within these intervals. By doing so, the system processes only the most recent data without reprocessing existing files.
Enabling the toggle populates additional fields that help in identifying the date pattern and only files matching with the job execution dates are picked up for the data ingestion. The details of the populated fields are as follows:
Load Type
This dropdown allows you to specify how files should be selected based on their timestamp. It provides two options:
-
- Daily: Select this option if you want files to be filtered based only on the date.
- Hourly: Select this option if you want files to be filtered based on both the date and the hour.
Note: The date or hour must be present in the file name for the system to perform the filtering. To extract the date and hour from file names, the system uses either Index or Regex methods, as explained in the subsequent sections.
Date Parsing
This section provides fields for parsing the date from file names. The fields are as follows:
- Date Pattern - Specifies the format of the date in the file name (e.g., yyyy-mm-dd). This defines how the system interprets the date or time format in the file name.
-
Date Extraction Type: Determines how the date is extracted from the file name. It can be one of the following:
-
Index: This method uses character positions (indices) to extract the date. It requires the following fields:
- Start Index: The starting position of the date in the file name (indexing starts at 0).
-
End Index: The ending position of the date in the file name.
If the load type is Hourly, additional fields are provided: - Hours Start Index: The starting index of the hour component in the file name.
-
Hours End Index: The ending index of the hour component.
Example - For a file named data_20250107_12.txt, if the date (20250107) starts at index 5 and ends at index 12, you would specify Start Index = 5 and End Index = 12. If the hour (12) starts at index 13 and ends at 15, you would provide those indices accordingly.
-
Regex: This method uses a regular expression (regex) pattern to extract the date. Fields required for regex extraction are as follows:
- Regex Pattern: Defines the regex string for identifying the date in the file name.
-
Group Number: Specifies the regex group number that contains the date.
If the load type is Hourly, an additional field is available: -
Hour Group Number: Identifies the regex group number for extracting the hour.
Example - For a file named data_2025-01-07_12.txt, you might use a regex like data_(\d{4}-\d{2}-\d{2})_(\d{2})\.txt, where group 1 captures the date (2025-01-07) and group 2 captures the hour (12).
-
Index: This method uses character positions (indices) to extract the date. It requires the following fields:
Date Manipulation
This feature offers the ability to adjust the partition date by a specified number of days, either forward or backward, relative to the file date. For example, if the file date is 2018-08-01, but the content of the file pertains to 2018-07-31, this functionality allows for alignment accordingly. You can refer to more details on the usage of this feature by visiting the article Understanding Data Manipulation in From File Process.