In the Syntasa application, when any process is executed, the transformed data (output) is written into a dataset that resides as a Hive table within the Syntasa-managed environment. This behavior was consistent across all application processes prior to the 9.0 release. Users did not have the option to directly read from or write data to locations outside the Syntasa environment; all intermediate and final outputs were managed internally.
With the 9.0 release, Syntasa introduced a new capability called External Dataset, which allows processes to read data from and write output data to external storage locations. This enhancement provides greater flexibility in how data is stored, accessed, and shared across systems.
What’s New in 9.0
The External Dataset feature enables users to create a dataset that points to a Hive table located outside the Syntasa environment. Instead of persisting output data only within Syntasa-managed Hive tables, users can now configure datasets that point to external storage locations such as cloud object storage.
This change allows Syntasa processes to seamlessly interact with data that already exists outside Syntasa, as well as write transformed output directly to those external locations.
How External Dataset Helps?
The External Dataset feature provides several benefits to clients:
- Data portability: Clients can store processed data outside Syntasa, making it easier to share data with other tools, platforms, or teams without additional export steps.
- Reduced data duplication: Instead of copying data into Syntasa-managed storage, clients can directly reference existing external Hive tables.
- Better integration with existing data lakes: Organizations that already maintain data lakes in cloud storage can continue to use them as the source and destination for Syntasa processing.
- Improved control over storage: Clients can manage data lifecycle, retention, and access policies directly in their own cloud storage environments.
Overall, this feature enables Syntasa to fit more naturally into modern data architectures where storage and compute are often decoupled.
Prerequisites
Before using the External Dataset feature, ensure the following prerequisites are met:
- A reference to dataset must be created in Syntasa that points to a Hive table located in an external storage location.
- The service account that executes the Syntasa application job must have the required permissions (read and/or write) on the external storage location.
- The external location should already be accessible from the Syntasa runtime environment.
Without the appropriate access permissions, the job execution will fail when attempting to read from or write to the external dataset.
Compatibility and Limitations
The External Dataset feature is not cross-platform independent. This means:
- If Syntasa is deployed on GCP, it can read data from and write data to Google Cloud Storage (GCS) only.
- If Syntasa is deployed on AWS, it can read data from and write data to Amazon S3 only.
Cross-cloud access (for example, reading from S3 when running on GCP) is not supported yet.