When executing a job in Syntasa, the application provides different processing modes to handle data ingestion and updates efficiently. These modes determine how new data interacts with existing records in the target dataset. Choosing the right mode ensures data consistency and optimizes storage and processing.
Syntasa offers four data processing modes:
- Drop and Replace
- Replace Date Range
- Add New Only
- Add New and Replace Modified
Each mode serves a specific purpose depending on how data is structured and how frequently updates occur. In the following sections, we’ll explore each mode in detail, starting with Drop and Replace.
Drop and Replace
The Drop & Replace mode is designed to completely overwrite an existing dataset with new incoming data during a job execution. This mode is useful when you want to ensure that the latest dataset reflects only the new data and does not retain any previous records.
How Does Drop & Replace Work?
-
Extract Data from Source:
- The system extracts data from Source A, which could be a file-based storage (e.g., S3, GCS) or a database (e.g., MySQL, BigQuery).
- This extracted data is then processed in an App Process, such as transformation, filtering, or enrichment.
-
Replace Data in Destination (Source B):
- The processed data is written to Source B (output table or dataset).
- Before inserting the new data, the system drops all existing records in the output table, regardless of the date range.
- Once the old data is deleted, the new dataset is fully replaced.
Example Scenario
Before Job execution:
Assume Source B (the output table) already has the following records:
Date | Records |
1st Jan 2025 | Record 1 |
2nd Jan 2025 | Record 2 |
3rd Jan 2025 | Record 3 |
4th Jan 2025 | Record 4 |
5th Jan 2025 | Record 5 |
Assume Source A (the incoming data) has the following records:
Date | Records |
1st Feb 2025 | New Record 1 |
2nd Feb 2025 | New Record 2 |
3rd Feb 2025 | New Record 3 |
Job Execution:
- You run a job that fetches new data from Source A for the dates 1st Feb to 3rd Feb 2025
- The system first drops all existing data in Source B.
- Then, it inserts the new records.
After Job execution:
Now, Source B contains only the newly ingested data, and all old records are lost. For example:
Date | Records |
1st Feb 2025 | New Record 1 |
2nd Feb 2025 | New Record 2 |
3rd Feb 2025 | New Record 3 |
Even if the job was executed for a specific date range (e.g., 1st Jan to 3rd Jan 2025), the system would still drop all records (including those from January) and replace them entirely with the new dataset.
Important Consideration for Drop & Replace Mode
A crucial detail about the Drop & Replace mode is that the existing data in Source B is only dropped if incoming data is available during job execution.
- If the job runs and incoming data exists, the system will drop all existing data in Source B and replace it with the incoming dataset—even if the incoming data is identical to the existing data.
- However, if no incoming data is available (meaning the job does not receive any records to process), then the existing data in Source B remains untouched.
In simple terms:
- ✔ If there is incoming data → Drop existing data and replace it.
- ❌ If there is no incoming data → No dropping, and existing data stays unchanged.
This ensures that data is only removed when a new dataset is provided, preventing unnecessary deletions when no data is available for processing.
Note: The Drop & Replace mode is generally not recommended for Production Environments as this mode drops all existing data in the target table, which can lead to significant data loss and disruption if used unintentionally.
The system displays a confirmation prompt when selecting Drop & Replace in a production environment, outlining its implications before proceeding. To continue, the user must manually enter "DROP AND REPLACE" to confirm the action.
When to Use the "Drop & Replace" Process Mode?
The drop-and-replace mode is best suited for scenarios where a complete refresh of data is required. It can be used in the following cases:
-
Development & Testing: When constructing new schemas, testing concepts, or running jobs multiple times, Drop & Replace ensures that tables always reflect the latest structure. This is especially useful in interactive development workflows, where frequent changes require a clean dataset with each execution.
-
Production Reprocessing: Use this mode when a full reprocess is necessary, such as when there’s a fundamental change in the data source or processing logic that requires recalculating and rewriting the entire dataset.
-
Schema Updates: If the schema of a table changes, executing a job with Drop & Replace removes the outdated data and loads new data that conforms to the updated schema.
-
Non-Persistent or Easily Reprocessed Data: When the existing data does not need to be retained or can be quickly regenerated, running the pipeline once with Drop & Replace can be an effective approach.
Key Takeaways
If Drop & Replace must be used, following points should be considered:
- Permanent Data Deletion: Once dropped, the data cannot be recovered unless a backup or archival mechanism is in place. Be absolutely certain that you want to permanently delete the existing data in the target table.
- Prefer Development Environments: This mode is best suited for development where data loss is less critical, and datasets can be easily regenerated.
- Use in Production Only When Absolutely Necessary: If a full reprocess is required in production, ensure all stakeholders are aware of the impact and approve the action.
- Consider "Replace Date Range" Instead: If only a specific partition of data needs to be refreshed, using "Replace Date Range" might be a more appropriate and less disruptive alternative.