Overview
In a modern data ecosystem, security is defined by the Principle of Least Privilege. Syntasa’s Data Plane Access Control is a robust security framework designed to ensure that compute resources—such as Spark clusters and JupyterLab notebooks—have access only to the specific data they need to perform a task.
By bridging the gap between Syntasa’s identity management and cloud-native security (AWS IAM and GCP IAM), this feature ensures that even if a user has access to a compute environment, their ability to read or write data is strictly governed by their Event Store assignments.
The Architecture: Control Plane vs. Data Plane
To understand this feature, it is helpful to distinguish between the two “planes” of operation:
- Control Plane (Syntasa UI/API): Where you manage workflows, share notebooks, and configure pipelines.
- Data Plane (AWS/GCP Infrastructure): Where the actual data resides (S3, GCS, BigQuery) and where the computation (Spark, Python) happens.
Data Plane Access Control ensures that the permissions you set in the Control Plane are physically enforced in the Data Plane.
Implementation in AWS
In AWS environments, Syntasa leverages IAM Session Policies to provide dynamic, fine-grained security.
How it Works
- Identity Resolution: When a user starts a Spark session or Notebook, Syntasa identifies the specific Event Stores shared with that user.
- Policy Generation: The Syntasa Auth Service generates a temporary IAM Policy JSON. This policy explicitly grants access only to the S3 buckets/prefixes and Glue databases associated with those Event Stores.
- STS AssumeRole: The system uses the AWS Security Token Service (STS) to “assume” the cluster’s IAM role, but it attaches the generated policy as a Session Policy.
- Scoped Credentials: AWS returns temporary credentials that represent the intersection of the cluster’s broad permissions and the user’s specific session policy.
- Injection: These credentials are automatically injected into the Spark configuration (fs.s3a.access.key, etc.), ensuring the Spark engine is physically unable to touch unauthorized data.
Implementation in GCP
In GCP environments, the framework utilizes Service Account Scoping and IAM Conditions to achieve similar isolation.
How it Works
- Service Account Impersonation: Instead of using a single “God-mode” service account for all users, Syntasa generates short-lived OAuth2 access tokens.
- Resource Scoping: These tokens are scoped to specific BigQuery datasets and Google Cloud Storage (GCS) buckets.
- VPC Service Controls: For high-security environments, this integrates with GCP VPC Service Controls to ensure data cannot be exfiltrated outside of the authorized perimeter.
Key Capabilities
Secure Notebook Environments
Every JupyterLab notebook kernel is initialized with a unique set of scoped credentials. If two users are working on the same cluster, their individual kernels will have different data access rights based on their own Syntasa permissions.
Identity-Aware Scheduled Jobs
Production pipelines no longer run as a generic “System” user.
- Owner-Based Security: Scheduled jobs resolve the identity of the Job Owner.
- Consistent Enforcement: The same session policies applied during development are applied in production, ensuring that a job cannot access data that the owner is not authorized to see.
Spark SQL Extension
Syntasa provides a custom Spark SQL Extension that acts as a final gatekeeper. It intercepts SQL queries (like SELECT * FROM table) and validates them against the user’s authorized metadata before the query even reaches the data source.
Security & Compliance Benefits
- Elimination of Over-Privileged Roles: You no longer need to give your Spark clusters broad "S3:* " permissions. The cluster starts with a base role, and the session policy narrows it down.
- Multi-Tenant Isolation: Multiple teams can share the same Kubernetes cluster or EMR environment without the risk of one team accessing another team’s sensitive data.
- Enhanced Auditability: Because every session is unique, cloud audit logs (AWS CloudTrail or GCP Cloud Audit Logs) show exactly which Syntasa user accessed which file, providing a clear chain of custody.
Configuration
Data Plane Access Control is managed via feature flags in the deployment configuration:
- RUNTIME_SESSION_POLICY_ENABLED: Set to true to enable scoping for Spark clusters.
- NOTEBOOK_SESSION_POLICY_ENABLED: Set to true to enable scoping for JupyterLab notebooks.
Note: Enabling this feature requires that the base IAM roles/Service Accounts have the necessary trust relationships to allow for AssumeRole or Impersonation operations.