Once we have completed creating a process data pipeline, the next step is to create a job for processing the data. In this article, we will guide you through the process of creating a job with single and multiple processes, setting the date range for execution, copying jobs, and more. Here is a step-by-step breakdown of the topics covered in this article:
- How to create a new job?
- Understanding process modes
- Understanding Execution Date Range
- How to copy existing jobs?
Creating a New Job
Before going ahead, it is assumed that you have created an app and created the data pipeline with the required processes on the development Workflow canvas. Before deploying this code to production, it is recommended to run the job and check if the data pipeline created works fine or not. Any issues encountered during the job execution can be fixed under development mode. Once the app is deployed to production mode, we can't make changes under that.
Here are the steps to create a new job:
- Open the application
- Go to the 'Workflow' option under the development section from the left side menu.
- Click the 'Job' option shown on the top right side of the corner. (Highlighted in below screenshot)
- Select the 'Create New' option.
- After clicking the 'Create New' option, you will be shown the 'Create Job' pop-up screen as shown below:
- Let's explore the general fields displayed on this screen:
- Name: This field allows you to assign a name to the job.
- Tags: You can associate any number of tags with the application. For more details on tags, please refer to the 'Tags' section.
- Description: This field provides a brief description of the job's purpose, visible to users.
- Copy Job: This option enables you to copy an existing job from the same application. We have covered the copying job feature separately in this article. Please click here to learn more about this feature.
- The next options on this screen are related to the tasks used for processing the data. When you click the 'Add New' link, it will populate three fields as shown below:
- Process Name (Dropdown): The values in this field are auto-populated and include all the processes available on the workflow canvas. These processes accept input and process incoming data according to the specified configuration.
- Runtime Template (Dropdown): The values in this field are auto-populated and display all the runtime templates available to the logged-in user. The configuration defined in the selected runtime template will be applied when creating the cluster.
- Process Mode (Dropdown): Process modes provide the app with instructions on how to handle the processing of incoming data, such as dropping any data or only processing new data. You can find more information on Process Modes here.
- Once you have selected values for all three fields, you can click 'Apply' to add the step. You can add multiple steps within a job.
- Once the first step is added, you will find a new section called 'Execution Date Range' with date fields. This date range specifies how much data you want to process. More information on 'Execution Date Range' can be found here.
-
After selecting the data range for processing data, you have two options:
- Save: This option saves the job with the values provided. You can find the saved job by clicking the 'Jobs' option.
- Save + Execute: This option saves the job and also initiates its execution immediately.
Process Modes
Process modes provide the app with instructions on how to handle the processing of incoming data, such as dropping any data or only processing new data. This field includes four values - Drop & Replace, Replace Date Range, Add New & Replace Modified, and Add New Only. Here is the description of each option:
Process Mode | Description | Recommendation |
---|---|---|
Drop & Replace | This mode permanently deletes the target table (if exists) and creates a new table. | Use only in Development workflows where new schemas are being constructed and concepts being tested, or in Production when it is clear that a full re-process is necessary. |
Replace Date Range | This mode permanently deletes only the partitions relevant to the date range and re-creates. | Use for Development workflows where schema structure does not change and data outside of processing date range should remain, and Production workflow manual processing in cases where data needs to be completely replaced for the selected processing period of time. |
Add New & Replace Modified | This mode uses source data, e.g. raw files, input datasets, not processed by the respective app pipeline for the provided data range. Also, replace data that has been processed previously, but the modified date has changed since running for the given date range. | Use for most scheduled Production jobs. |
Add New Only | This mode uses source data, e.g. raw files, and input datasets, not processed by the respective app pipeline for the provided data range only. |
Use for scheduled Production jobs where it is guaranteed that previously processed source data will not change |
Execution Date Range
When running a job in Syntasa, you can specify the amount of data you want it to process. This is done by setting the date range. Syntasa offers several options for you to choose from:
- Custom Date: Select a specific date range by entering a start date and an end date.
-
N Days: Specify the number of days of data you want the job to process. Here's how to use this option:
- Number of Days - Enter the number of days of data you want to include.
-
Ending on - Choose when the timeframe should end:
- Yesterday: Process data from the previous day.
- Current Date: Process data up to today.
- Last Available Date: Process data up to the most recent data available in the system(this can be checked under Data Preview).
-
Offset Dates (applicable only for N days): “Offset Dates” toggle allows you to shift the start & end date of the execution. When the toggle is disabled, the start and end date are set based on configured fields. When the toggle is enabled, the “Move end date by” field becomes active, allowing you to enter a specific number of days to move the start & end date forward or backward.
For example, in the screenshot below, the toggle is enabled and the value in the “Move end date by” field is “-2”. This means that the start & end dates will be shifted back by two days from what it would have been if the toggle were disabled.
- Last Week: Process data from the previous week
- Last Month: Process data from the previous month.
- Month to Date: Process data from the first day of the current month to today.
- Quarter to Date: Process data from the first day of the current quarter to today.
- Year to Date: Process data from the first day of the current year to today.
Copying the Job
If you want to duplicate an existing job, you can utilize the 'Copy Job' feature available on the 'Create Job' screen. Enabling the toggle 'Copy Job' will populate a new field named 'Source', which includes all existing jobs. Upon selecting the desired job from this field, all other information will be populated based on the configuration of the selected job.
Note: You can only copy jobs available within the same application and environment (dev/prod).
In the next article 'Executing a job', we will cover how to run the job that we just created.