Data Authorization – SYNTASA™

Data Authorization is the platform's first layer of defense for who can read and write what data. It runs inside the Spark engine: every SQL statement, DataFrame operation, and DDL or DML command is intercepted and checked against your access before any data is touched. If you're allowed to access the resource, the query runs normally. If you're not, the query is blocked immediately with a clear message identifying what was denied.

There is a second layer of defense, Session Policy, described in Session Policy. Session Policy covers things this layer can't see — direct boto3 calls, raw S3 reads, anything that bypasses Spark entirely. Together the two layers cover the full surface.

Both layers use the same source of truth — your data plane assignments — and both are entirely automatic. You don't configure anything as a notebook user; the platform handles it on every kernel start and every job submission.

Data planes — the unit of access

A data plane is the platform's name for a collection of data that belongs to a specific business domain or team — for example, "Marketing Events" or "Platform Telemetry." Every table, S3 path, and streaming source on the platform belongs to a data plane. Some data planes also include external datasets — additional S3 paths registered as part of the data plane. These are treated as part of the same access scope.

Access is binary:

Assigned to a data plane — you can read and write its tables, S3 paths, and external datasets.
Not assigned — you have no access to it, full stop.

There is no read-only tier today. If you're assigned to a data plane, you have read and write access; if you're not, you have neither.

The default Hive database is always accessible

The Hive `default` database is open to every user without any data plane assignment. It is intended for scratch and exploratory work — temporary tables, intermediate results, things that don't belong to a specific domain. Don't store persistent shared datasets there; put those in a proper data plane so access can be managed.

Users with the System Admin role bypass all authorization checks and have unrestricted access. The role is reserved for platform-level operations.

In notebook flows

When you launch a notebook kernel, your identity is bound to that kernel session — there is no separate login, no credentials to provide. From that point on, every SQL cell and every DataFrame operation is checked against your data plane assignments before execution begins.

If the query is allowed, it runs normally. The check happens in the background and adds no perceptible delay.
If the query is denied, the kernel raises a SyntasaAccessDeniedException with a message identifying the specific database, table, or S3 path that was blocked, along with your user ID. You will not see partial results. Take the error message to your admin to request access.

Notebook process runs (scheduled or triggered) work the same way, whether the source is a notebook card or a plain JupyterLab notebook: the job runs under the identity of the notebook owner, and their data plane assignments are checked on every operation. Same enforcement as an interactive run.

In Spark batch jobs

When a Spark batch job is submitted, it runs using a shared service account at the infrastructure level — but authorization is enforced against your identity as the job owner, not the service account. The job submission records your user ID, and the authorization extension uses that ID to check assignments throughout the run.

The check happens at plan analysis time — before execution begins. The extension inspects the full logical plan of the job and identifies every table, database, and S3 path it intends to touch. If any of those are out of scope for the owner, the job fails immediately with a SyntasaAccessDeniedException; nothing has been read or written yet. You get actionable feedback within seconds rather than after a long-running job partially completes.

System jobs — scheduled pipelines and automated workflows — behave the same way. They run under the identity of the user who created or last configured them. If a previously working scheduled job starts failing with an authorization error, the most likely cause is that the owner's data plane assignments were changed.

What is and isn't covered

Data Authorization intercepts everything that goes through the Spark engine:

Spark SQL: SELECT, INSERT, UPDATE, DELETE, CREATE TABLE, DROP TABLE, CTAS.
DataFrame API read and write operations (DataSource V1 and V2).
Hive table access — any DDL or DML targeting a Hive-managed table.
Delta table operations, including MERGE INTO.
Streaming source reads and sink writes.
Spark checkpoint path writes.

It does not cover anything that bypasses the Spark engine:

Raw AWS SDK / boto3 calls that read or write S3 directly without going through a Spark DataFrame or SQL query. The Spark extension never sees them. Session Policy (Section 14) is the layer that handles this.
JDBC data sources — connections made through JDBC are not Spark-managed paths or Hive databases and are not intercepted.
Non-Spark Python in your notebook — Python-only cells that don't invoke Spark (a requests call, local file I/O, an arbitrary HTTP fetch) are not subject to Data Authorization.

How your admin enables it

Data Authorization is enabled at the platform level by an admin in syntasa-config:

syntasa-config
syntasa_authz_enabled: "true"

With that flag set, the platform automatically injects the necessary Spark configuration into every kernel and job at startup — settings that identify the current user, point to the authorization service, and control permission caching. You don't pass any of this manually.

If you're seeing access denied errors when you expect to have access, ask your admin to check that syntasa_authz_enabled is in the expected state and that your data plane assignments include the resource you're trying to reach.

Things to know

Restart your kernel to pick up new assignments. If your admin grants or revokes access, the simplest way to make the change take effect is to restart your kernel — a fresh session fetches current permissions on its first query. Decisions are also cached per session for five minutes, so if you wait long enough an active session will eventually pick up the change without a restart, but the restart is the reliable path.
No read-only tier. Access is binary. There is no way today to grant query-only access to a data plane.
System Admin bypass. Users with the System Admin role have unrestricted access. There is no enforcement layer above System Admin.

FAQ

How do I request access to a data plane?

Contact your admin through your organization's normal process. Tell them which data plane you need and what you're trying to do with it. Once assigned, your access is active within about 5 minutes.

I had access yesterday and now I'm denied. What happened?

Your data plane assignment was probably changed by an admin, or the table was moved to a different data plane. Confirm with your admin.

My scheduled pipeline started failing with an authorization error. Why?

System jobs run under the identity of the job owner. If the owner's assignments changed, or the pipeline now references a resource the owner can't access, it fails at plan analysis before any data is touched. Read the error message for the blocked resource and contact your admin.

How fast does access reflect after my admin updates it?

Up to 5 minutes due to permission caching. For immediate effect, restart your kernel — new sessions always fetch current permissions on their first query.

Can I see what data planes I currently have access to?

Yes — your assignments are visible in the Syntasa portal under your user settings or in the data catalog. There is no notebook-side method that lists your assignments today; the portal is the source of truth. Ask your admin if you need help finding the view.

I'm reading from S3 directly using Python, not Spark. Is that covered?

No — Data Authorization only intercepts operations through the Spark engine. Direct S3 access via boto3 or the AWS SDK is governed by Session Policy instead.

Related sections

Session Policy The complementary layer that catches what Data Authorization can't see — direct boto3 calls, raw S3, anything that bypasses Spark.
How They Work Together Side-by-side comparison of the two layers, when each applies, and what neither covers.

{[{category.name}]}