Feature Overview
The Register Dataset feature allows users to manually onboard external data into the Syntasa platform without requiring a standard ingestion job. This is particularly useful for referencing existing data lakes (S3, GCS, HDFS) or pre-processed tables.
Key Capabilities:
- Manual Registration: Point directly to an external file path (S3/GCS/HDFS) to register it as a dataset.
- Environment Linking: Automatically handles the relationship between Development and Production datasets. When registering a Production dataset, the system can automatically provision the corresponding Development definition.
- Athena Integration (AWS): Automatically creates queryable Athena tables for Delta format datasets.
- Process Output: Registered external datasets can be selected as "Existing Outputs" in Process Nodes, allowing jobs to write to these external locations.
Configuration Reference
Registration Form
When creating a new dataset via the Register Dataset UI, the following fields are available:
| Field Name | Description | Options / Constraints |
|---|---|---|
| Dataset Name | The unique identifier for the dataset within the Event Store. | • Alphanumeric, underscores. • Immutable after creation. |
| Event Store | The logical container where the dataset metadata will reside. | Must be selected from available Event Stores. |
| File Format | The format of the underlying data files. | • Delta (Triggers Athena table creation in AWS)• Parquet• Avro• Text/CSV• JSON |
| Location (Path) | The physical storage path for the data. | • Must be a valid URI (e.g., s3://bucket/path/).• Validation: The system calls dataset/validate to ensure the path is accessible. |
| Schema Definition | Defines the column structure (Name, Type). | • Auto-Fill: Fetches schema from the latest file at the Location.• Manual: User adds columns manually. • Mandatory: Dataset cannot be saved without a valid schema. |
| Partitioning | Defines how the data is physically organized. | Extracted automatically during Auto-Fill or defined manually. |
API: Update Metrics
For partitioned tables, metrics (row counts, size) do not update automatically. Use the dataset/update-metrics API to refresh metadata.
| Parameter | Description | Example |
|---|---|---|
tableName | Name of the registered dataset. | "dev_external_sales_data" |
database | The Hive/Glue database name. | "dev_eventstore_prod" |
environment | The target environment. | "PRODUCTION" or "DEVELOPMENT" |
Operational Behavior
Deployment Logic (Dev → Prod)
When deploying an application containing an External Dataset from Development to Production, the system follows specific logic to ensure data safety and continuity:
- Reuse Existing: If a Production external dataset with the same name already exists, the deployment reuses it. It does not overwrite the path or settings.
- Create New: If the Production dataset does not exist, a new one is created using the Event Store's default Production Path (
eventStore.productionPath/<datasetName>). - Snapshots: A new Snapshot dataset is always created using the Snapshot Path (
eventStore.snapshotPath/<datasetName>).
Data Safety & Job Modes
- DROP_REPLACE Protection: Running a job in
DROP_REPLACEmode against an External Dataset will NOT delete the physical data or the dataset definition. This prevents accidental data loss for externally managed files. - Copy Mode: External datasets default to
Copy Mode: Noneduring deployment, meaning data is never physically copied between environments by the deployment process itself.
AWS Athena Integration
- Trigger: Occurs only when
File Formatis Delta and the environment is AWS. - Action: The system automatically creates a table in AWS Athena.
- Naming Convention:
<table_name>_athena. - Timing: Happens immediately upon successful registration.
Troubleshooting
| Error / Issue | Technical Cause | Recommended Action |
|---|---|---|
| "Path validation failed" | The system cannot access the provided S3/GCS/HDFS URI. | • Verify the path exists. • Check IAM roles/permissions for the Syntasa backend service account. |
| Athena table missing | Dataset format is not Delta or environment is not AWS. | • Confirm File Format is set to Delta.• Verify the dataset was registered in an AWS environment. |
| Metrics show 0 rows | Partitioned data requires manual metric updates. | Call the dataset/update-metrics API after writing new data. |
| Schema Auto-fill fails | No files found at the location or unsupported format. | • Ensure at least one data file exists at the path. • For Text files, only partition info is returned; columns must be added manually. |