One of the key capabilities in Syntasa’s data platform is extracting and organizing data efficiently from external databases. The From DB process allows you to not only fetch data but also store it in a partitioned format—ideal for performance, scalability, and incremental processing.
Partitioned output is especially useful when dealing with large datasets or time-based data, as it helps in organizing the data into logical groups (e.g., by date or hour), making downstream processing more efficient and targeted.
In this article, we’ll learn how to extract data from a database and store it in a partitioned format based on date values.
Prerequisites
Before you begin, ensure you have the following set up in your Syntasa environment:
- A Configured Database Connection: You must have an existing and tested connection to your source database (e.g., PostgreSQL, SQL Server, Amazon Athena). This is created and managed under Resources > Connections. To know more about connection, please visit the article Getting Started with Connection.
- An Application Workflow: You need an existing app or a new one where you can build your data extraction workflow.
- The source table must contain a column with valid date values to be used for partitioning.
Step-by-Step Guide For Daily Partitioned Output
Follow these steps to configure the 'From DB' process to pull data with hourly partitioning in the output table:
Step 1: Add and Configure the Database Connection
- First, navigate to your application's development workflow canvas.
- From the component palette on the left, find and drag the Database connection component onto the canvas.
- Click on the new Database connection node to open its configuration panel.
- From the dropdown menu, select the specific database connection you previously created under Resources > Connections. In this article, for demonstration purposes, we will use a PostgreSQL database connection as the source.
- Click the tick icon to confirm your selection and save the node configuration.
Step 2: Add and Connect the 'From DB' Process
- Next, you will add the From DB process itself.
- Find the From DB process in the left component palette and drag it onto the canvas.
- Click and drag a line from the output port of your database connection node to the input port of the From DB node. This link tells the process which database to use for the extraction.
Step 3: Configure the 'From DB' Process
Click the From DB component on the canvas to open its configuration panel. You will see several tabs on the left. Here is how to configure each for a daily partitioned load:
-
General Screen: Here, you can optionally change the display name of the process as it appears on the workflow canvas.
-
Input Screen: This screen is critical for defining the source data.
- Table Name: In this required field, enter the exact name of the table you want to extract (e.g., tb_user_logins).
-
Incremental Load Toggle: Turn this ON to enable partitioning. Turning on incremental load automatically creates a special field named
partition_datein the Schema screen.
-
Schema Screen: This screen is for defining the table's structure. This screen lets you control which columns to extract.
- Enabling Incremental Load in the Input screen automatically adds a special field
partition_datein the Schema screen. This field is used to generate a partitioned output based on the selected Time Source column. This column can be renamed as well. -
Time Source Column: Select the column from your input table that contains the date values to be used for partitioning (e.g., login_time). The selected column's values will be used by
partition_datecolumn to create date partitions. - Time Format Type: Enter the correct date-time format for the Time Source column.
-
Auto-Fill all columns: To automatically load all columns from your source table, click the Auto-fill link. This will populate a list of all available columns along with their data types. Once populated, you can customize the list by reordering, renaming, or removing columns as needed.
Unlike the non-partitioned approach—where leaving this screen blank fetches all columns and rows—in the partitioned setup, you must explicitly list the columns you want to extract. Since a new column named
partition_dateis automatically added, only the columns displayed in this list will be fetched from the source. -
To extract specific columns: Enter the exact names of the columns you want. Only the columns that you list here will be extracted.
Please note that the extraction process will fail if you specify a column name that does not exist in the source table.
- Enabling Incremental Load in the Input screen automatically adds a special field
-
Output Screen: Here you can manage the output settings.
-
Partition Type: Choose the Daily option. This will enable the output data to be partitioned daily. Data will be stored in folders like
partition_date=2025-01-01/ - File Format: Choose from Parquet, ORC, Delta, Avro, or TextFile.
- Output Table Name: (Optional) Rename the output dataset.
- Storage Location: View the cloud path where the data will be written.
- Once you have reviewed the settings, save the configuration and close the panel.
-
Partition Type: Choose the Daily option. This will enable the output data to be partitioned daily. Data will be stored in folders like
Step 4: Run the 'From DB' Process
- Your workflow is now configured to perform a full, daily, partition-based extraction.
- Create a new Job to execute the workflow. You can refer articles below to learn more about creating and executing a job:
-
Select the execution date range that defines the period for which you want to extract data. For example, if you select a date range from 1st Jan to 3rd Jan 2025, the job will reference the column specified under the Time Source Column (e.g.,
login_time) and extract only the records that fall within this range.Suppose the source table contains the following data—running the job for the specified range will fetch only the records where the date matches, such as those with visitor_id from 1 to 9, and store them in the output table.
- You can choose the process mode during job execution based on your specific requirements. To learn more about the available process modes and how they work, refer to the article: Process modes. Currently, the Add New and Replace Modified process mode in the FROM DB process behaves the same as the Add New Only mode.
- Running the job will trigger the From DB process, which connects to your database and extracts all data from the specified table into a new dataset in your cloud storage.
- You can view the list of generated partitions by navigating to the Output node and opening the State screen.
- You can preview the output data from the most recently generated partition by navigating to the Output node and selecting the Preview screen.
Once data has been successfully extracted, the output dataset is ready for use in downstream processes within Syntasa. You can now connect it to:
- A Transform process to perform SQL-based transformations.
- A Spark processor for custom data manipulation with Scala or Python.
- Other app processes to join it with different datasets or prepare it for analytics and reporting.
To run the execution on the production workflow, simply deploy the app from development to production, create a production job, and execute it. The output data will then be saved to the production database. For more details on development vs production workflows and the deployment process, refer to the following articles:
In the next article, we’ll explore how to extract data from a database and store it in an hourly partitioned format using the From DB process.