The Add New & Replace Modified process mode in Syntasa is designed to efficiently handle data updates by incorporating new records while ensuring that existing records are updated if they have changed. This mode ensures that your dataset remains up to date by checking for both newly arriving data and modifications to previously processed records.
Unlike a full replacement process, which deletes and recreates an entire table, this mode operates incrementally, modifying only the necessary date partitions. It is a preferred option for maintaining historical data integrity while ensuring that updated records are correctly reflected in the output table.
Understanding Partition Metadata
Modern data processing systems, such as BigQuery and Hive tables in Syntasa’s event store, support partitioned tables that allow data to be stored and updated at the partition level. When changes are made to a specific date partition, only the metadata, such as the modified timestamp, is updated for that partition. This metadata is essential for incremental processing workflows, as it helps identify which partitions contain new or updated records. By leveraging this information, Syntasa's "Add New & Replace Modified" process mode efficiently determines whether a partition should be replaced with fresher data or left unchanged.
How does "Add New & Replace Modified" Work?
This process mode performs two primary functions:
-
Adding New Data: It identifies date partitions in the incoming data that do not exist in the output table and appends them as new entries.
-
Replacing Modified Data: For a date partition that exist in the output table as well as in the input source, it compares the modified timestamps of date partition of both the sources. If the date partition of input source has a newer modified timestamp than the existing date partition, the existing data in the output table for that date partition is overwritten with the updated version.
Example:
Assume Source B (the output table) already has the following records:
Record | Partition Date | Modified Timestamp |
James | 1st Jan 2025 | 3rd Jan 2025 19:22:22 |
Alex | 2nd Jan 2025 | 3rd Jan 2025 19:22:22 |
Jay | 3rd Jan 2025 | 3rd Jan 2025 19:22:22 |
Assume Source A (the new incoming data) has the following records:
Record | Partition Date | Modified Timestamp |
James | 1st Jan 2025 | 3rd Jan 2025 19:22:22 |
Siri | 2nd Jan 2025 | 3rd Jan 2025 20:00:22 |
Joe | 3rd Jan 2025 | 4th Jan 2025 19:00:00 |
Rick | 4th Jan 2025 | 4th Jan 2025 10:10:10 |
Job Execution:
On January 5th, 2025, at 17:15:00, you run a job that fetches new data from Source A for the dates 1st Jan 2025 to 4th Jan 2025. After the job completion, the output table is updated as below:
Record | Partition Date | Modified Timestamp |
James | 1st Jan 2025 | 3rd Jan 2025 19:22:22 |
Siri | 2nd Jan 2025 | 5th Jan 2025 17:15:00 |
Joe | 3rd Jan 2025 | 5th Jan 2025 17:15:00 |
Rick | 4th Jan 2025 | 5th Jan 2025 17:15:00 |
Explanation of Changes:
- 1st Jan- Unchanged Partitions: The data partition in output table for 1st Jan 2025 (James) remains unchanged since the modified time matches in both sources.
-
2nd Jan and 3rd Jan - Replacing Modified Data: The system compares the modified timestamps of both the date partition. Since the input source has a more recent timestamp for these two partitions, the system replaces the corresponding partitions in the output table with the updated data.
-
4th Jan - Adding New Data: Since 4th Jan 2025 does not exist in the output table but is present in the input source, it will be added as a new partition in output table.
When Should You Use the 'Add New & Replace Modified' Process Mode?
This process mode is ideal for environments where data is frequently updated or new records are continuously ingested. This process mode is recommended for most scheduled production jobs. Here's why:
-
Handling Ongoing Data Updates: In production environments, data is continuously updated or newly arriving. This mode efficiently manages such scenarios by adding new data and ensuring modifications to existing records are reflected in the processed output.
-
Maintaining Data Accuracy: By replacing modified records, this mode helps keep processed data aligned with the latest information from the source.
-
Efficiency for Incremental Updates: Instead of reprocessing the entire dataset, this approach focuses only on new and modified data, making it more efficient for scheduled production jobs.
-
Preferred Over "Drop & Replace" in Production: Unlike "Drop & Replace," which deletes the entire target table, this mode preserves valuable historical data, making it the safer choice for production workflows.