The Add New & Replace Modified process mode in Syntasa is designed to efficiently handle data updates by incorporating new date partitions while ensuring that existing date partitions are updated if they have changed. This mode ensures that your dataset remains up to date by checking for both newly arriving data and modifications to previously processed records.
Unlike a full replacement process, which deletes and recreates an entire table, this mode operates incrementally, modifying only the necessary date partitions. It is a preferred option for maintaining historical data integrity while ensuring that updated records are correctly reflected in the output table.
Understanding Partition Metadata
Modern data processing systems, such as BigQuery and Hive tables in Syntasa’s event store, support partitioned tables that allow data to be stored and updated at the partition level. When changes are made to a specific date partition, only the metadata, such as the modified timestamp, is updated for that partition. This metadata is essential for incremental processing workflows, as it helps identify which partitions contain new or updated records. By leveraging this information, Add New & Replace Modified process mode efficiently determines whether a partition should be replaced with fresher data or left unchanged.
How does "Add New & Replace Modified" Work?
This process mode performs two primary functions:
-
Adding New Data: It identifies date partitions in the incoming data that do not exist in the output table and appends them as new entries.
-
Replacing Modified Data: For a date partition that exist in the output table as well as in the input source, it compares the modified timestamps of date partition of both the sources. If the date partition of input source has a newer modified timestamp than the existing date partition, the existing data in the output table for that date partition is overwritten with the updated version.
Note: Even when the input source holds identical data for a given date partition, the system will still overwrite the partition if the input’s modified timestamp is more recent than that of the output. This is because the system bases its decision solely on the modified timestamp, not on the actual content of the records.
Example
In the example below, we explain the scenarios in which data under a date partition in the output table is preserved, overwritten, or newly added. Refer to the data below to understand this example:
- Existing Output Table (A) currently includes partitions for: 1st Jan – 3rd Jan 2025
- New Input source (B) contains data for the date range: 3rd-4th Jan 2025
- Job Execution Date Range is defined as: 1st Jan – 4th Jan 2025
- Job Execution Timestamp: January 5th, 2025 17:15:00
Assume Source A (the output table) already has the following records:
| Record | Partition Date | Modified Timestamp |
| James | 1st Jan 2025 | 3rd Jan 2025 19:22:22 |
| Alex | 2nd Jan 2025 | 3rd Jan 2025 19:22:22 |
| Jay | 3rd Jan 2025 | 3rd Jan 2025 19:22:22 |
Assume Source B (the input source) has the following records:
| Record | Partition Date | Modified Timestamp |
| James | 1st Jan 2025 | 3rd Jan 2025 19:22:22 |
| Siri | 2nd Jan 2025 | 3rd Jan 2025 20:00:22 |
| Jay | 3rd Jan 2025 | 3rd Jan 2025 19:25:00 |
| Rick | 4th Jan 2025 | 4th Jan 2025 10:10:10 |
Job Execution:
After the job completion, the output table is updated as below:
| Record | Partition Date | Modified Timestamp |
| James | 1st Jan 2025 | 3rd Jan 2025 19:22:22 |
| Siri | 2nd Jan 2025 | 5th Jan 2025 17:15:00 |
| Jay | 3rd Jan 2025 | 5th Jan 2025 17:15:00 |
| Rick | 4th Jan 2025 | 5th Jan 2025 17:15:00 |
Explanation of Changes:
- 1st Jan- Unchanged Partitions: The data partition in output table for 1st Jan 2025 (James) remains unchanged since the modified time matches in both sources.
-
2nd Jan - Replacing Modified Data with Updated data: The system compares the modified timestamps of both the date partition. Since the input source has a more recent timestamp for this partition, the system replaces the corresponding partitions in the output table with the updated data.
-
3rd Jan - Replacing Modified Data with Identical data - The input source holds identical data for 3rd Jan 2025 (Jay) but with new modified timestamp. So, the system will still overwrite this date partition as the system bases its decision solely on the modified timestamp, not on the actual content of the records.
-
4th Jan - Adding New Data: Since 4th Jan 2025 does not exist in the output table but is present in the input source, it will be added as a new partition in output table.
Since the partitions for January 2nd, 3rd, and 4th were added or overwritten, their modified timestamps in the output table are updated with the execution timestamp. The modified timestamp for January 1st remains unchanged, as that partition was not overwritten.
When Should You Use the 'Add New & Replace Modified' Process Mode?
This process mode is ideal for environments where data is frequently updated or new records are continuously ingested. This process mode is recommended for most scheduled production jobs. Here's why:
-
Handling Ongoing Data Updates: In production environments, data is continuously updated or newly arriving. This mode efficiently manages such scenarios by adding new data and ensuring modifications to existing records are reflected in the processed output.
-
Maintaining Data Accuracy: By replacing modified records, this mode helps keep processed data aligned with the latest information from the source.
-
Efficiency for Incremental Updates: Instead of reprocessing the entire dataset, this approach focuses only on new and modified data, making it more efficient for scheduled production jobs.
-
Preferred Over "Drop & Replace" in Production: Unlike "Drop & Replace," which deletes the entire target table, this mode preserves valuable historical data, making it the safer choice for production workflows.