Creating a reference to an external dataset in Syntasa is a simple and guided process. The Create Dataset screen allows you to define a dataset that points to a Hive table located outside the Syntasa-managed environment, such as cloud object storage. Once created, this external dataset can be used seamlessly within application processes, just like an internal dataset.
Steps to Create Dataset Pointing to External Location
Follow these steps to reach the Create Dataset screen:
- Navigate to the Event Store screen that is being used in your application.
- Under the Event Store, go to the Dataset tab.
- Click the Create (➕) icon.
This action opens the Create Dataset screen.
Create Dataset – Field Details
On the Create Dataset screen, provide the following information:
1. Dataset Name
- This is a logical reference name used within the Syntasa environment.
- It represents the external dataset but does not create or move any physical data.
- Choose a meaningful name so it is easily identifiable when used in application processes.
- Once the dataset is created, it can not be renamed.
2. External Location Path
- Enter the path to the external storage location where the Hive table resides or where output data should be written.
- This path must be accessible from the Syntasa runtime environment.
3. Development / Production Toggle
Since only one external path is provided during dataset creation, it is important to define which workflow (Development or Production) should write output data to the external location.
Background
In the Syntasa application:
- When a job runs in the Development workflow, output data is normally written to the development database path.
- When a job runs in the Production workflow, output data is normally written to the production database path.
With External Datasets, you must explicitly specify which workflow should use the external location.
Toggle Behavior
When the toggle is OFF, the dataset is created only in the development database when you click Save.
- If the process is executed from the development workflow, the output is written to the external location.
When the application is deployed to the production workflow, a dataset is created in the production database that points only to the internal Syntasa location. In this case, production executions do not write to the external location.
When the toggle is ON , datasets are created in both the development and production databases when you click Save.
- The development dataset is configured to use the internal Syntasa location.
- The production dataset is configured to use the external location.
When you register a dataset for Production, Syntasa automatically creates a corresponding Development dataset internally using the event store’s default Development database path.
4. Test Connection
- Click Test to validate the external path.
- If the path is correct and accessible, Syntasa validates the connection.
- Upon successful validation, Syntasa detects:
- Whether data already exists at the path.
- The underlying Hive file format (for example, TextFile, Parquet, Avro, etc.).
Based on this detection, you are prompted to select the appropriate file format for reading or writing data.
5. Schema Definition
After selecting the file format, define the dataset schema using one of the following options:
- Get Schema: Automatically populates the schema based on the files present in the external location. The system calls the API, which scans the latest modified file in the external path to extract column names and types. Please note that the system typically only returns partition information for Text files.
- Manual Schema Creation: Add columns manually.
- Import Schema: Import schema details from an Excel file.
- Export Schema: Export the schema shown on grid into excel file.
- Clear: Clears the data from grid
This flexibility allows you to either reuse an existing schema or define a new one as per your requirements.
6. Create the Dataset
Once all required details are filled in:
- Review the dataset configuration.
- Click Create.
The external dataset is now created and ready to be used in application processes for reading data from or writing data to the external location.
Athena Integration (AWS Environments)
In AWS environments, datasets that use the Delta file format are automatically integrated with Athena. When the dataset is registered, Syntasa creates an Athena table named <table_name>_athena, enabling users to run SQL queries on the data directly through the AWS Athena console.