One of the most fundamental tasks in any data workflow is extracting data from a source database. The Syntasa From DB process provides a streamlined and powerful way to accomplish this. This guide will focus on the simplest extraction method: creating a non-partitioned output.
A non-partitioned extraction is a straightforward process that takes a complete snapshot of a source table and copies the entire dataset into a single output location within the Syntasa environment. This method is ideal for smaller-to-medium-sized tables or for situations where you need to analyze or process the entire dataset as one unit.
Prerequisites
Before you begin, ensure you have the following set up in your Syntasa environment:
- A Configured Database Connection: You must have an existing and tested connection to your source database (e.g., PostgreSQL, SQL Server, Amazon Athena). This is created and managed under Resources > Connections. To know more about connection, please visit the article Getting Started with Connection.
- An Application Workflow: You need an existing app or a new one where you can build your data extraction workflow.
Step-by-Step Guide to Non-Partitioned Extraction
Follow these steps to configure the 'From DB' process to pull data without partitioning in output table:
Step 1: Add and Configure the Database Connection
- First, navigate to your application's development workflow canvas.
- From the component palette on the left, find and drag the Database connection component onto the canvas.
- Click on the new Database connection node to open its configuration panel.
- From the dropdown menu, select the specific database connection you previously created under Resources > Connections.
- Click the tick icon to confirm your selection and save the node configuration.
Step 2: Add and Connect the 'From DB' Process
- Next, you will add the 'From DB' process itself.
- Find the From DB app process in the left component palette and drag it onto the canvas.
- Click and drag a line from the output port of your database connection node to the input port of the 'From DB' node. This link tells the process which database to use for the extraction.
Step 3: Configure the 'From DB' Process
Click the From DB component on the canvas to open its configuration panel. You will see several tabs on the left. Here is how to configure each for a non-partitioned load:
-
General Screen: Here you can optionally change the display name of the process as it appears on the workflow canvas.
-
Input Screen: This screen is critical for defining the source data.
- Table Name: In this required field, enter the exact name of the table you want to extract (e.g., tb_product).
-
Incremental Load Toggle: Ensure this toggle is turned off. Keeping it disabled instructs Syntasa to perform a full table dump, which is what we want for a non-partitioned output
-
Schema Screen: This screen is for defining the table's structure. This screen lets you control which columns to extract.
- Leave the Time Source Column field blank. This field is used for time-based partitioning, so it must be empty for a non-partitioned extraction.
- To extract all columns: Leave this entire screen blank. The process will automatically fetch every column from the source table (the equivalent of a SELECT * query).
- To Auto-Fill all columns: To automatically load all columns from your source table, click the Auto-fill link. This action will populate a list of every column, along with its data type. After the columns appear, you can modify the list by reordering, renaming, or removing columns.
-
To extract specific columns: Enter the exact names of the columns you want. Only the columns that you list here will be extracted.
Important: The extraction process will fail if you specify a column name that does not exist in the source table.
-
Output Screen: Here you can manage the output settings.
- You can select the file format for the stored data (e.g., Parquet, ORC, Avro). By default, Parquet is selected.
- You can rename the output table name that is created.
- You can also view the cloud storage location where the output data will be saved.
- Once you have reviewed the settings, save the configuration and close the panel.
Step 4: Run the 'From DB' Process
- Your workflow is now configured to perform a full, non-partitioned extraction.
- Create a new Job to execute the workflow. You can refer below articles to learn more about creating and executing a job:
- Running the job will trigger the From DB process, which connects to your database and extracts all data from the specified table into a new dataset in your cloud storage. You can preview the output data by navigating to the Output node and selecting the Preview screen.
Once your non-partitioned data has been successfully extracted, the output dataset is ready for use in downstream processes within Syntasa. You can now connect it to:
- A Transform process to perform SQL-based transformations.
- A Spark processor for custom data manipulation with Scala or Python.
- Other app processes to join it with different datasets or prepare it for analytics and reporting.
To run the execution on the production workflow, simply deploy the app from development to production, create a production job, and execute it. The output data will then be saved to the production database. For more details on development vs production workflows and the deployment process, refer to the following articles:
In the next article, we’ll explore how to extract data from a database and store it in a partitioned format using the From DB process.