Scope
In: EMR + Kubernetes Spark steps, Athena, EnrichProcessor verification, notebook dependency-download flow, notebook Spark conf (for Spark history writes to system bucket).
Out: GCP / Azure / OnPrem (SA is AWS-only today), authz JAR distribution model, global init script sandboxing, backward compatibility (SA is unreleased — direct cutover).
Why
Service Accounts (SA) attach user AWS credentials to runtimes/workspaces. Today those credentials are injected globally into Spark conf (spark.hadoop.fs.s3a.access.key), so the worker uses the user SA for every S3 access — including Syntasa system folders (syntasa-config/, syntasa-logs/, syn-spark-history/, syn-workspace/, syn-file-uploads/, syn-volumes/).
Result: admins must grant every SA access to all Syntasa system folders. The list grows with the platform, it leaks internal implementation into the SA policy, and it makes the SA model leaky (if an admin narrows an SA to "only their data," the platform breaks).
SA is unreleased — fix the model now before customer policies congeal around the wrong shape.
The Idea
Invariant: The user's SA credentials may only exist inside syn-data/notebook kernel pods as inert data destined for one sink — the Spark job submission conf for customer-data access. They never become an active credential for any in-pod operation, and the cluster never uses them for Syntasa-owned storage.
Enforced at three independent layers (defense in depth):
Layer | Where | What |
|---|---|---|
1. Type-level |
| New |
2. Call-site |
| Code touching system folders derives system creds from the infra row, ignores caller-supplied security. |
3. Cluster-level | Spark conf (syn-data batch + notebook Spark sessions) | Per-bucket override pins the Syntasa bucket to the cluster's own identity. SA conf still injected, but never reaches the system bucket. |
After this lands, an SA's IAM policy needs only customer-bucket grants. The Syntasa bucket is owned by the cluster's IAM role (EMR EC2 instance profile or EKS node role), full stop.
Two S3 schemes are in play and both must be covered. Credential conf is injected under both fs.s3a.* (S3A, used for s3a:// URIs) and fs.s3.* (EMRFS / legacy S3, used for s3:// URIs which EMR resolves via EmrFileSystem). The per-bucket override must be set for both schemes:
Scheme | Per-bucket key | Provider class (EMR and K8s) |
|---|---|---|
S3A ( |
|
|
EMRFS ( |
|
|
One provider class works for both EMR and K8s because both inherit IAM via the EC2 instance metadata service:
- EMR (EC2 workers): workers query IMDS directly and receive the EMR EC2 instance profile.
- EKS / K8s (current Syntasa deployment): pods do not use IRSA —
syntasa-sahas noeks.amazonaws.com/role-arnannotation (verified empirically on kqa, 2026-05-14). Pods inherit the EKS node IAM role via IMDS (http://169.254.169.254/...).InstanceProfileCredentialsProviderqueries IMDS and resolves to the node role — exactly the system identity we want for system-bucket access.
Why this works (and why env vars don't shadow it): InstanceProfileCredentialsProvider queries IMDS directly and ignores the AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY env vars we inject for user-SA propagation. The per-bucket override therefore correctly resolves to cluster identity even though user-SA env vars are present on the same pod for other use.
Do NOT use DefaultAWSCredentialsProviderChain — that chain walks EnvironmentVariableCredentialsProvider first, and we inject AWS_ACCESS_KEY_ID env vars for the user SA. The chain would resolve to the user SA before reaching IMDS and silently defeat Layer 3.
Future-state note: if Syntasa migrates K8s to IRSA in a future release, InstanceProfileCredentialsProvider would stop working on K8s pods. At that time, switch to WebIdentityTokenCredentialsProvider for K8s. Not in scope for this work — kqa/kdev today use node-IAM, not IRSA.
The Spec
1. New model field — EmrStepJobRequest.sparkJobCredentials
Module: syntasa-core (com.syntasa.lib.infra.jobs)
Add to the EmrStepJobRequest subclass only (AWS-only scope visible in types):
private SparkJobAwsCredentials sparkJobCredentials; // null when no SA configured
New POJO SparkJobAwsCredentials { accessKey, secretKey, sessionToken, Source source } with Source ∈ { SA_IAM_USER, SA_IAM_ROLE_ASSUMED, SESSION_POLICY_ASSUMED } (telemetry).
In-memory only — no Kafka schema, no DB migration.
2. Credential services stop mutating Security
Module: syntasa-runtime-service
AwsServiceAccountService.applyAwsServiceAccountCredentials— resolve SA (existing priority:iamUserkeys →iamRole.roleArnAssumeRole). PopulatesparkJobCredentials. Remove mutations at lines 45-46, 58-60.SessionPolicyService.applySessionCredentials— same shape. Remove mutations at lines 100-102.
StepJobService.executeJob flow is unchanged structurally.
Unit test (locks the invariant): assert request.getSecurity() is reference-equal and field-equal before vs. after SA application.
3. Step utils: read new field + per-bucket override (both schemes) — syn-data batch path
Module: syntasa-runtime-service
EmrStepJobUtils (syntasa-runtime-service/.../utils/spark/amazon/EmrStepJobUtils.java, lines 199-222):
- Replace
getSecurity()reads withgetSparkJobCredentials(). Skip injection when null. - Unconditionally add per-bucket overrides for both schemes:
// S3A — covers any s3a:// paths
addSparkConf(args, "spark.hadoop.fs.s3a.bucket." + systemBucket + ".aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider");
// EMRFS — covers s3:// paths (the scheme EmrCopyUtils writes; the scheme EMR uses for system-folder reads at runtime)
addSparkConf(args, "spark.hadoop.fs.s3.bucket." + systemBucket + ".customAWSCredentialsProvider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider");where systemBucket = amazonStorage.getBucket().
Line 110 (getSecurity() for abort step) stays unchanged — now correctly uses system identity for the EMR control-plane call. (Latent bug fix.)
KubernetesSparkStepJobUtils has two injection blocks — fix both:
- Block 1 (lines 549-562): SparkConf. Read
getSparkJobCredentials(). Emit per-bucket override for both S3A and EMRFS schemes, both pointing tocom.amazonaws.auth.InstanceProfileCredentialsProvider(same as EMR — K8s pods inherit the EKS node IAM role via IMDS in the current Syntasa deployment; verified on kqa). - Block 2 (lines 869-877): env vars. Read
getSparkJobCredentials(). No per-bucket override needed here.
4. EmrCopyUtils derives system creds internally — signature unchanged
Module: syntasa-runtime-service
Keep the existing signature — copyDependencies(Security security, Storage storage, Config config, SparkCluster sparkCluster). Don't drop the Security parameter, because ClusterService.java:317-321 branches between cloud and OnPrem and passes different things:
if (providerType == ONPREM) {
copyUtils.copyDependencies(runTime.getRuntimeSecurity(), ...); // OnPrem HDFS/Kerberos creds
} else {
copyUtils.copyDependencies(request.getSecurity(), ...); // request Security (cloud)
}OnPrem still needs its RuntimeSecurity path; cloud implementers should ignore the parameter.
Inside EmrCopyUtils.copyDependencies (and the Dataproc/Azure implementers when their specs follow): ignore the passed security parameter and build a fresh system AmazonSecurity from the infra row using the existing fallback chain (useInstanceProfile=true → IRSA / instance profile; else infra static keys — same as PR #2426). Document with a comment that the parameter is intentionally unused for cloud system-folder I/O.
Precondition (implementation note): at method entry, throw IllegalStateException if the passed AmazonSecurity has a non-empty sessionToken. A session token is a clear signal the caller passed STS- or SA-derived credentials (which are never valid for system-folder I/O). Today's call sites in ClusterService pass freshly-deserialized infra security with no session token, so the precondition does not fire on today's paths.
OnPremCopyUtils is unchanged — it continues to use the passed RuntimeSecurity to access on-prem HDFS.
5. Athena writes to system bucket use system creds
Module: syntasa-athena-executor
In AthenaConnectionUtils.init (lines 63-88): after determining S3OutputLocation, check if its bucket equals amazonStorage.getBucket() (the system bucket). If yes, force the driver's credentials to com.amazonaws.auth.InstanceProfileCredentialsProvider (same provider class as §3 — pods/workers inherit the cluster identity via IMDS in both EMR and K8s). If no (customer bucket), leave as today.
6. EnrichProcessor — verify Action serialization carries system creds
Module: syntasa-app-library
EnrichProcessor.scala:290-313 reads SystemUtils.infra.amazonCloudProvider.getSecurity(). SystemUtils.infra is freshly deserialized from the Action (SystemUtils.scala:167) — not the same object as StepJobRequest.getSecurity().
Once §1 + §2 hold (request.getSecurity() = system creds), the Action carries system creds → EnrichProcessor automatically inherits the right identity.
Implementation task: trace once where the Action is serialized and confirm it reads from request.getSecurity() (now system), not from the new sparkJobCredentials. Document the trace in the PR.
7. Notebook flow: dep downloads + Spark conf builder
Module: syntasa-notebooks
The notebook flow has two distinct system-folder access paths that need fixing — and a small set of paths that are already correct. Verified end-to-end by reading bootstrap-kernel.sh, dependency_utils.py, jupyterhub_config.py, the AWS kernel pod template, aws_credential_config.py, and the three resource managers.
Broken paths (fix needed)
(7a) Three places download from syntasa-config/deps/ via aws s3 cp with user-SA env vars active:
- Driver kernel pod, runtime-init script — sourced by Python/Scala observer at Spark-session creation time. Bootstrap line 222 (
fetch_aws_credentials) has already exportedAWS_ACCESS_KEY_ID=<user SA>. - Executor pod, global-init script — sourced at executor bootstrap (line 381). Pod inherits
spark.executorEnv.AWS_*= user SA from the driver's Spark conf. - Executor pod, runtime-init script — sourced at executor bootstrap (line 382). Same env.
(7b) Notebook Spark sessions write event logs to syn-spark-history/ using user SA:
aws_credential_config.py:configure_aws_credentials (lines 6-51) is the notebook-side equivalent of EmrStepJobUtils/KubernetesSparkStepJobUtils. It's called by kubernetes_resource_manager.py:172, yarn_resource_manager.py:74, and ray_resource_manager.py:52 — and it injects user-SA credentials into the Spark session globally. The same managers later set spark.eventLog.dir = s3://<bucket>/syn-spark-history/ (e.g. kubernetes_resource_manager.py:277). Result: Spark event log writes go to the system bucket using user SA. §3's per-bucket override does not apply here because §3 only touches the syn-data batch step utils — the notebook Spark session is built by separate code in the notebook repo.
Already-correct paths (no change needed)
- Driver kernel-pod global-init script (bootstrap line 215) runs before
fetch_aws_credentials— noAWS_*env vars yet → falls through default chain → IMDS → node role. - Authz JAR download (line 220) uses
S3FileSystemboto3 withConfiguration.SYN_S3_*(sourced fromKERNEL_S3_*env vars). Perjupyterhub_config.py:213-219,KERNEL_S3_*is propagated from the notebook-service pod's own AWS env — that pod runs assyntasa-sawith system-tier identity (or empty, falling through to IMDS). Never user-SA. syn-workspaces3fs mount (line 245) uses-o iam_role=auto— s3fs queries IMDS directly and ignoresAWS_*env vars. Even thoughfetch_aws_credentialsruns before the mount and sets user-SA env, s3fs uses IMDS → node role.filesystem.shworkspace sync — same source as authz JAR (system tier).- Standalone spark-history-server pod (
history-server.sh) — uses S3A with default credential chain (no AWS_* in pod) → IMDS → node role. Readssyn-spark-history/under system identity. - HTTP-only Flask CLI commands (
download-global-init-script,notebook runtimes script) — no S3.
Fix A — dep downloads (_cloud_storage_download_cmd)
Modify _cloud_storage_download_cmd in dependency_utils.py:361-383 (only ever called for system-bucket paths {configFolder}/deps/...). Wrap the AWS branch in an env-stripping subshell:
return (
f'( unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN; '
f'aws s3 cp "s3://{bucket}/{src_path}" "{dest_path}"{endpoint_opt} )'
)The unset applies only to the subshell — parent shell's user-SA env preserved for subsequent customer-bucket downloads. Inside, aws s3 cp falls through default credential chain → IMDS → node role. One change covers all three broken cases in (7a).
Fix B — notebook Spark conf (configure_aws_credentials)
Modify aws_credential_config.py to accept a new system_bucket parameter; when supplied, emit the same per-bucket overrides as §3 (S3A + EMRFS, both InstanceProfileCredentialsProvider):
if system_bucket:
spark_conf = spark_conf.set(
f"spark.hadoop.fs.s3a.bucket.{system_bucket}.aws.credentials.provider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider")
spark_conf = spark_conf.set(
f"spark.hadoop.fs.s3.bucket.{system_bucket}.customAWSCredentialsProvider",
"com.amazonaws.auth.InstanceProfileCredentialsProvider")Update all three callers (kubernetes_resource_manager.py, yarn_resource_manager.py, ray_resource_manager.py) to pass system_bucket=os.environ.get("KERNEL_SYN_BUCKET"). After this, notebook Spark sessions write spark.eventLog.dir = s3://<bucket>/syn-spark-history/ using the cluster identity, not user SA.
Why not a Flask CLI proxy (originally considered)
Executor pods don't have KERNEL_REFRESH_TOKEN in their env, so they can't auth to such an endpoint. The unset-subshell + Spark-conf approach is simpler, covers driver and executor uniformly, and adds no new server-side endpoint.
Known limitation (not in scope)
_cloud_url_download_cmd (lines 386-406) handles spark.jars URLs the user/admin configures directly. If those URLs point at the system bucket (e.g., s3://<system-bucket>/syntasa-config/...), the bootstrap aws s3 cp runs with whatever AWS env is active — user SA on driver post-fetch and on executor. The legitimate path for Syntasa-supplied jars is syntasa.jar.dependencies.names. Pointing spark.jars directly at the system bucket is going around the supported config; file a follow-up issue if it becomes a real concern.
Implementation Order
Each step is independently shippable and bisectable.
§1 + invariant test — add field; no behavior change yet.
§2 — credential services populate the field; remove mutations. Invariant test now passes.
§3 — syn-data step utils read new field, emit per-bucket override (S3A + EMRFS). Batch job flow is fully fixed at this point.
§4 —
EmrCopyUtilsderives system creds internally (signature kept; OnPrem unaffected).§5 — Athena.
§6 — EnrichProcessor trace + PR docs.
§7 Fix A —
_cloud_storage_download_cmdunset-subshell wrap (notebook dep downloads).§7 Fix B —
configure_aws_credentialsper-bucket override (notebook Spark eventLog writes).Ship reference IAM policy doc (minimal post-fix SA policy).
Pre-flight Check
Run on the target cluster before merging — verifies the IMDS-based pod IAM model the design assumes:
kubectl --context=<cluster> exec -n syntasa <any-pod> -- sh -c \ 'wget -q -T 3 -O- http://169.254.169.254/latest/meta-data/iam/security-credentials/'
Expected output: the EKS node role name (e.g. EKS_EC2_DefaultRole). If this returns empty or times out, the cluster's pod IAM model differs from kqa/kdev — STOP and re-evaluate the provider class before deploying.
Testing
Unit: §1 invariant —
getSecurity()unchanged through SA applicationUnit:
sparkJobCredentialspopulated with correctSourcefor all priority pathsUnit:
EmrStepJobUtilsemits per-bucket override unconditionally for both S3A and EMRFS; SA conf only when field non-nullUnit:
KubernetesSparkStepJobUtils— both blocks, both schemesUnit:
EmrCopyUtilsrejects Security with session token (precondition); resolves system creds correctly otherwiseUnit:
AthenaConnectionUtilsswitches provider when output is on system bucketUnit:
_cloud_storage_download_cmdfor AWS provider returns a subshell-wrapped command withunset AWS_*andaws s3 cpinside; GCP/Azure branches unchangedUnit:
configure_aws_credentialsemits per-bucket override (S3A + EMRFS) for system bucket whensystem_bucketis supplied; preserves user-SA global conf for customer-bucket accessIntegration: EMR step + SA → assert Spark args contain SA
access.key(S3A and EMRFS) AND both per-bucket overrides (S3A and EMRFS)Integration: K8s Spark step + SA → assert same on
SparkConfbuilder, withInstanceProfileCredentialsProvideras the provider classE2E (kqa): SA with minimal IAM policy (customer bucket only, zero syntasa-bucket grants) — submit Spark job, run notebook with runtime+deps + Spark session (Spark history writes succeed), run Enrich step with Athena view. All succeed.
Regression: existing tests pass with updated mock signatures.
Open Questions for Implementation
System-bucket naming convention. The system bucket is configured per-infrastructure (not globally), so a single startup check doesn't fit. Trust the infra row at emission time.
Action serialization trace (§6) — confirm at impl time; document in PR.
Rollback
Each step is independently revertible. No on-disk format changes. Revert §3 → restores prior Spark-conf behavior; revert §7 → restores direct aws s3 cp and previous notebook Spark conf.
Behavior Changes (Beneficial, Not Regressions)
After Phase A, request.getSecurity() returns infra credentials. Four control-plane sites read this value for EMR / EKS API calls:
File | Line | Operation |
|---|---|---|
| 110 | EMR abort step |
| 329 | EMR cancel-steps API |
| 163 | EKS |
| 382 | EKS |
Today these read SA-mutated security (user SA). After this fix they read infra security. This is structurally correct — control-plane operations should run as system identity. System identity always has those permissions; a narrowly-scoped SA may not. The plan does not modify these four lines.
Audit Appendix — Verified Clean, No Changes
Module | Why clean |
|---|---|
| No credential code; Layer-3 covers all Spark-conf-driven access. |
| Fresh-deserialized |
| Reads |
| Per-request connection creds; never on SA path; no system-folder writes. |
|
|
| Cluster identity. |
| No distinct credential flow. |
| S3A with default credential chain — no AWS_* in pod env → IMDS → node role. Reads |
|
|
| boto3 client built from |
| Uses |
Audit Appendix — Other AmazonSecurity Mutation Sites (Out of Scope)
Eight other sites mutate AmazonSecurity, but with infra-row creds, not user SA — not violating the invariant. Listed so future reviewers don't re-litigate: DataGatewayInfrastructureDeserializerUtils, InfrastructureSecurityUtils, ActionStoreService, ProcessStatusToMetricsConverter, ClusterService (multiple), plus three more found via grep.
Follow-up Specs (Not This Work)
GCP / Azure / OnPrem SA isolation (when SA expands to those clouds)
K8s migration to IRSA — would require switching the per-bucket provider class to
WebIdentityTokenCredentialsProviderGlobal init script sandboxing (privilege-escalation surface from user-supplied pre-SA scripts)
_cloud_url_download_cmdsystem-bucket detection (if user-suppliedspark.jarspointing at the system bucket becomes a real concern)Defensive refactor of the 8
AmazonSecuritymutation sites above