Ingesting a Zip file in Syntasa follows a similar process as ingesting any other text file, with the additional step of specifying which files inside the Zip need to be picked up for processing. The system will first detect the Zip file, and then identify the data files within it based on the file pattern provided.
If Incremental Load is enabled, the system will look for the Zip file that matches the date provided for the job execution. The date parsing logic ensures that the correct file is selected based on the scheduled execution time. This allows for processing new data files over time, whether on a daily, hourly, or other scheduled basis.
Now, let's walk through the process for both ingesting a zip file using incremental load or without Incremental Load.
Ingesting Zip file Without Incremental Load
When you're working with a specific Zip file without enabling the Incremental Load, you provide the exact file name in the Source File Pattern field and the pattern for the data files inside the Zip in the Event File Pattern field.
Here's how to configure the process:
- Select Zip as File Type: Start by selecting Zip as the file type. This will enable the Event File Pattern field, where you define the pattern for the data files within the Zip file.
- Source File Pattern: In the Source File Pattern field, enter the exact name or pattern for the Zip file you want to ingest. For example: ProductDataset.zip
- Event File Pattern: In the Event File Pattern field, specify the pattern for the data files inside the Zip file. For example, if the Zip file contains multiple CSV files named ProductFileV001.csv, ProductFileV002.csv, etc., use the following pattern: ProductFileV(.*).csv
- Job Execution: When the job is executed:
- The system will first look for the ProductDataset.zip file.
- Then, it will extract and process all the files inside the Zip that match the ProductFileV(.*).csv pattern.
This configuration is ideal for processing a specific Zip file that contains data files you wish to extract.
Ingesting Zip File With Incremental Load
For scenarios where you need to automatically process multiple Zip files over time (e.g., daily or hourly), you can enable Incremental Load. This will allow the system to automatically detect and process new Zip files that match the specified date pattern.
Here's how to configure the process:
- Select Zip as File Type: As with the single file ingestion process, start by selecting Zip as the file type. This will enable the Event File Pattern field.
- Source File Pattern (Dynamic): In the Source File Pattern field, use a dynamic pattern to capture multiple Zip files. For example, if your Zip files are named ProductDatasetV20250101.zip, ProductDatasetV20250102.zip, etc., use this pattern: ProductDatasetV(.*).zip
- Event File Pattern (Dynamic): In the Event File Pattern field, define the pattern for the files inside each Zip archive. If each Zip file contains CSV files such as ProductFileV001.csv, ProductFileV002.csv, etc., use this pattern: ProductFileV(.*).csv
- Job Execution: When the job runs, it will:
- Use the Incremental Load logic to detect the correct Zip file that matches the date of the job execution. For instance, if the job is scheduled for January 1, 2025, it will pick up ProductDatasetV20250101.zip.
-
Extract and process all the data files within the Zip archive that match the ProductFileV(.*).csv pattern. The combined data from all the matching files will be used for data ingestion corresponding to January 1st, 2025.
- On subsequent job executions (e.g., for January 2, 2025), it will automatically detect and process the corresponding Zip file, like ProductDatasetV20250102.zip.
This configuration is useful for scheduling daily or hourly jobs to process new Zip files as they become available, allowing for automated, incremental data ingestion.
Handling Tar Files
The process for Tar files is similar to Zip files. You simply select Tar as the file type and proceed with the same configuration steps: specify the Source File Pattern and Event File Pattern, and the system will extract and process the files accordingly.