Data processing pipelines in Syntasa offer multiple process modes to efficiently manage data ingestion and updates. These modes are designed to handle different use cases, ensuring that data is either added, updated, or entirely replaced based on the required workflow. Choosing the right process mode depends on factors such as whether the dataset is partitioned, whether historical data should be retained, and how modifications should be managed.
Syntasa offers four data processing modes:
-
Drop & Replace
This mode completely removes all existing data from the output table and replaces it with newly processed data. It is ideal when the entire dataset needs to be refreshed, ensuring that only the most recent data remains. It works for both partitioned and non-partitioned datasets, making it the most universally applicable mode. -
Replace Date Range
This mode deletes and replaces data only for a specified date range, leaving the rest of the table intact. It is useful when only certain partitions need to be refreshed, such as when reprocessing historical data or correcting data for a specific period without impacting other partitions. This mode is best suited for partitioned datasets. -
Add New Only
This mode adds only new partitions that do not already exist in the output table. If a partition exists in both the output table and the new incoming data, it remains unchanged. This ensures that only truly new data is processed, making it efficient for incremental data updates in partitioned tables. -
Add New & Replace Modified
This mode is a combination of two operations: adding new partitions and updating existing partitions if their modified timestamps indicate a change. It ensures that new data is appended while also updating records that have changed, maintaining data accuracy without unnecessary reprocessing. This mode is ideal for scheduled production workflows that require both historical preservation and real-time updates.
Note on Process Mode Availability
While all four process modes are available for partitioned data, only Drop & Replace and Replace Date Range are shown when dealing with non-partitioned data. The reason for this is:
-
Drop & Replace is applicable in all scenarios since it removes and replaces the entire table, making it suitable for both partitioned and non-partitioned datasets.
-
Replace Date Range is typically designed for partitioned data, but in cases where users want to retain data of existing date partition and add data of new date partition, it serves as an alternative to a full table drop.
For datasets without partitions or when the partitioning feature is turned off, other process modes (Add New Only and Add New & Replace Modified) are not applicable, as they rely on partition-level metadata to determine new and modified records.