Create External Dataset – SYNTASA™

Creating a reference to an external dataset in Syntasa is a simple and guided process. The Create Dataset screen allows you to define a dataset that points to a Hive table located outside the Syntasa-managed environment, such as cloud object storage. Once created, this external dataset can be used seamlessly within application processes, just like an internal dataset.

Steps to Create Dataset Pointing to External Location

Follow these steps to reach the Create Dataset screen:

Navigate to the Event Store screen that is being used in your application.
Under the Event Store, go to the Dataset tab.
Click the Create (➕) icon.

This action opens the Create Dataset screen.

Create Dataset – Field Details

On the Create Dataset screen, provide the following information:

1. Dataset Name

This is a logical reference name used within the Syntasa environment.
It represents the external dataset but does not create or move any physical data.
Choose a meaningful name so it is easily identifiable when used in application processes.
Once the dataset is created, it can not be renamed.

2. External Location Path

Enter the path to the external storage location where the Hive table resides or where output data should be written.
This path must be accessible from the Syntasa runtime environment.

3. Development / Production Toggle

Since only one external path is provided during dataset creation, it is important to define which workflow (Development or Production) should write output data to the external location.

Background

In the Syntasa application:

When a job runs in the Development workflow, output data is normally written to the development database path.
When a job runs in the Production workflow, output data is normally written to the production database path.

With External Datasets, you must explicitly specify which workflow should use the external location.

Toggle Behavior

When the toggle is OFF, the dataset is created only in the development database when you click Save.

If the process is executed from the development workflow, the output is written to the external location.
When the application is deployed to the production workflow, a dataset is created in the production database that points only to the internal Syntasa location. In this case, production executions do not write to the external location.

When the toggle is ON , datasets are created in both the development and production databases when you click Save.

The development dataset is configured to use the internal Syntasa location.
The production dataset is configured to use the external location.

When you register a dataset for Production, Syntasa automatically creates a corresponding Development dataset internally using the event store’s default Development database path.

4. Test Connection

Click Test to validate the external path.
If the path is correct and accessible, Syntasa validates the connection.
Upon successful validation, Syntasa detects:
- Whether data already exists at the path.
- The underlying Hive file format (for example, TextFile, Parquet, Avro, etc.).

Based on this detection, you are prompted to select the appropriate file format for reading or writing data.

5. Schema Definition

After selecting the file format, define the dataset schema using one of the following options:

Get Schema: Automatically populates the schema based on the files present in the external location. The system calls the API, which scans the latest modified file in the external path to extract column names and types. Please note that the system typically only returns partition information for Text files.
Manual Schema Creation: Add columns manually.
Import Schema: Import schema details from an Excel file.
Export Schema: Export the schema shown on grid into excel file.
Clear: Clears the data from grid

This flexibility allows you to either reuse an existing schema or define a new one as per your requirements.

6. Create the Dataset

Once all required details are filled in:

Review the dataset configuration.
Click Create.

The external dataset is now created and ready to be used in application processes for reading data from or writing data to the external location.

Identifying Internal or External Datasets in the List

On the Datasets screen, you can view all datasets available within the Event Store.

Datasets are organized by workflow:

Development datasets appear under the Development accordion
Production datasets appear under the Production accordion

Internal Dataset

An internal dataset is automatically created in the Event Store whenever a process is created within an associated application.

Datasets that appear in both Development and Production workflows without any visual indicator (icon or color) represent internal datasets. These datasets point to internal storage.

External Datasets (Green Icon)

Datasets represented with a green icon() indicate that they point to an external location.

When you open such a dataset, the Details & Schema screen differs from that of an internal dataset, reflecting its external configuration.

Internal Dataset with Grey Icon (Mixed Case)

A dataset represented with a grey icon () indicates a mixed scenario: The dataset itself points to an internal location, but its corresponding dataset in the other workflow points to an external location.

Example

The order_output dataset under Development is shown with a green icon, indicating that it points to an external location. Its Details & Schema screen also differs from that of an internal dataset.
The same dataset under Production is shown with a grey icon. This means:
- The Production dataset points to an internal location
- Its corresponding dataset in Development points to an external location
The events_output dataset appears in both workflows without any icon, indicating that both Development and Production datasets point to internal locations.

Athena Integration (AWS Environments)

In AWS environments, datasets that use the Delta file format are automatically integrated with Athena. When the dataset is registered, Syntasa creates an Athena table named <table_name>_athena, enabling users to run SQL queries on the data directly through the AWS Athena console.

{[{category.name}]}