The Drop & Replace mode is designed to completely overwrite an existing dataset with new incoming data during a job execution. This mode is useful when you want to ensure that the latest dataset reflects only the new data and does not retain any previous records.
How Does Drop & Replace Work?
- Extract Data from Input Source(B):
- The system extracts data from Source A (e.g., S3, GCS, MySQL, BigQuery).
- This data is then processed through app logic such as transformation, filtering, or enrichment.
- Drop and Re-create Output Table (A):
- Before writing the output, the system drops the existing output table (A) entirely.
- It then re-creates the table (with updated schema if needed) and inserts the newly processed data.
Example
In the example below, we illustrate how data is replaced during ingestion with Drop and Replace mode.
Refer to the data below to understand this example:
- Existing Output Table (A) currently includes partitions for: 1st Jan – 5th Jan 2025
- Input source (B) contains data for the date range: 1st Jan to 3rd Jan 2025
- Job Execution Date Range is defined as: 1st Jan – 5th Jan 2025
- Job Execution Timestamp: January 5th, 2025 20:00:00
Before Job execution:
Assume the output table (A) already has the following records:
Date | Data | Modified Timestamp |
1st Jan 2025 | firefox | 5th Jan 2025 19:22:22 |
2nd Jan 2025 | opera | 5th Jan 2025 19:22:22 |
3rd Jan 2025 | chrome | 5th Jan 2025 19:22:22 |
4th Jan 2025 | opera | 5th Jan 2025 19:22:22 |
5th Jan 2025 | opera | 5th Jan 2025 19:22:22 |
Assume the input source (B) has the following records:
Date | Data |
1st Jan 2025 | opera |
2nd Jan 2025 | chrome |
3rd Jan 2025 | opera |
After Job Execution:
After executing the job in Drop and Replace Mode, the output table will be updated as follows:
Date | Data | Modified Timestamp |
1st Jan 2025 | opera | 5th Jan 2025 20:00:00 |
2nd Jan 2025 | chrome | 5th Jan 2025 20:00:00 |
3rd Jan 2025 | opera | 5th Jan 2025 20:00:00 |
Explanation of Changes:
- Before ingesting new records, the existing output table was dropped, which removed all existing data.
- Records were then inserted for date partitions available in the input source(1st Jan to 3rd Jan 2025) that fell within the execution date range (January 1st to 5th, 2025).
- The modified timestamp for all partitions was updated to the execution timestamp, as all data was replaced after the table drop.
Important Consideration for 'Drop & Replace' Process Mode
A crucial detail about the Drop & Replace mode is that the existing data in Source B is only dropped if incoming data is available during job execution.
- If the job runs and incoming data exists, the system will drop all existing data in Source B and replace it with the incoming dataset—even if the incoming data is identical to the existing data.
- However, if no incoming data is available (meaning the job does not receive any records to process), then the existing data in Source B remains untouched.
In simple terms:
- ✔ If there is incoming data → Drop existing data and replace it.
- ❌ If there is no incoming data → No dropping, and existing data stays unchanged.
This ensures that data is only removed when a new dataset is provided, preventing unnecessary deletions when no data is available for processing.
Note: The Drop & Replace mode is generally not recommended for Production Environments as this mode drops all existing data in the target table, which can lead to significant data loss and disruption if used unintentionally.
The system displays a confirmation prompt when selecting Drop & Replace in a production environment, outlining its implications before proceeding. To continue, the user must manually enter "DROP AND REPLACE" to confirm the action.
When to Use the "Drop & Replace" Process Mode?
The drop-and-replace mode is best suited for scenarios where a complete refresh of data is required. It can be used in the following cases:
-
Development & Testing: When constructing new schemas, testing concepts, or running jobs multiple times, Drop & Replace ensures that tables always reflect the latest structure. This is especially useful in interactive development workflows, where frequent changes require a clean dataset with each execution.
-
Production Reprocessing: Use this mode when a full reprocess is necessary, such as when there’s a fundamental change in the data source or processing logic that requires recalculating and rewriting the entire dataset.
-
Schema Updates: If the schema of a table changes, executing a job with Drop & Replace removes the outdated data and loads new data that conforms to the updated schema.
-
Non-Persistent or Easily Reprocessed Data: When the existing data does not need to be retained or can be quickly regenerated, running the pipeline once with Drop & Replace can be an effective approach.
Key Takeaways
If Drop & Replace must be used, following points should be considered:
- Permanent Data Deletion: Once dropped, the data cannot be recovered unless a backup or archival mechanism is in place. Be absolutely certain that you want to permanently delete the existing data in the target table.
- Prefer Development Environments: This mode is best suited for development where data loss is less critical, and datasets can be easily regenerated.
- Use in Production Only When Absolutely Necessary: If a full reprocess is required in production, ensure all stakeholders are aware of the impact and approve the action.
- Consider "Replace Date Range" Instead: If only a specific partition of data needs to be refreshed, using "Replace Date Range" might be a more appropriate and less disruptive alternative.