Every Syntasa notebook ships with a unified utility namespace called synutils. It is the one entry point for everything the platform exposes to your code: secrets, package installs, notifications, connection metadata, the data registry, Spark helpers, cloud storage, and uploaded files. Using synutils instead of pulling in raw SDKs means you get the platform's auth, credential decryption, and integration plumbing for free, in one consistent shape across Python and Scala.
The same name, synutils, works in both languages. In Python it is a module-level attribute already imported into every kernel; in Scala it is a REPL alias that resolves to the same backing object. Methods and module names are identical across languages — the differences are limited to language idioms (Python dict vs Scala Map, named arguments vs Scala's name = value). This chapter shows both forms for the headline method of each module and Python only for secondary methods, with a note when the Scala signature differs.
Quick start
python
# Python — synutils is already in scope
print(synutils.infrastructure.providerType) # AWS / GOOGLE / AZURE
print(synutils.infrastructure.bucket) # default storage bucket
df = synutils.spark.createDataFrame("user_events") # dataset → Spark DataFrame
scala
// Scala — synutils is already in scope
println(synutils.infrastructure.providerType) // AWS / GOOGLE / AZURE
println(synutils.infrastructure.bucket) // default storage bucket
val df = synutils.spark.createDataFrame("user_events") // dataset → Spark DataFrame
That is the entire setup. The rest of this chapter walks through each module in order of how often you will reach for it.
Working with secrets — SecretString
Three modules — credentials, connections, and infrastructure — return secret values. Rather than handing back raw strings, the platform wraps secrets in a SecretString type that redacts itself in print and log output. The actual secret value is still there; you have to ask for it explicitly to reveal it.
In Python, SecretString is a `str` subclass; in Scala, it has an implicit conversion to `String`. Both languages treat it the same way:
- Printing or logging renders ********** — your secrets won't accidentally leak into notebook output, exception messages, or log files.
- On a single SecretString, call .get() (preferred) or .unseal() (legacy alias) to reveal the raw string value — for passing to SDKs that need it.
- On a SecretDict / SecretMap (returned by methods like credentials.getAll(name)), call .unseal() to get back a regular dict / Map of plaintext strings.
- In Scala, the implicit conversion to String fires automatically when a JDBC connector or similar API expects a String — no explicit unseal needed.
String manipulation reveals the value
Rule of thumb: printing and logging are safe; anything that touches the underlying characters reveals the value.
Because SecretString inherits from string, all of these leak:
"prefix-" + secret · secret[:5] · secret.upper() · secret.encode() · json.dumps({"key": secret})
Prefer f-strings or s-interpolation over + concatenation if you want redaction to hold across log lines.
credentials — secret store lookup
The platform's credential store is the right place to keep secrets your notebooks need: API keys, service account passwords, tokens. synutils.credentials reads them on demand and wraps every value in a SecretString so you can use them without leaking them into your notebook output.
| Method | Purpose |
|---|---|
| get(name, key) | Fetch a single secret value — returns SecretString. |
| getAll(name) | All secrets under a name — returns SecretDict (Python) / SecretMap (Scala). |
| describe(name) | Human-readable credential description (creator, last updated, type). |
| getMetadata(name) | Full raw metadata payload from the credential service. |
| read(name) | Retrieves the complete credential object at once |
| list() | All credential names you have access to. |
The most common pattern — fetch one secret, pass it to an SDK:
python
# Python
import boto3
secret_key = synutils.credentials.get("aws_creds", "secret_access_key")
print(secret_key) # **********
client = boto3.client("s3", aws_secret_access_key=secret_key.get())
scala
// Scala — implicit conversion to String fires automatically for JDBC
import java.sql.DriverManager
val pwd = synutils.credentials.get("db_creds", "password")
println(pwd) // **********
val conn = DriverManager.getConnection(jdbcUrl, "user", pwd)If you need every value under a credential name as a regular dict, call .unseal() on the result of getAll():
python
# Python
raw = synutils.credentials.getAll("aws_creds").unseal()
client = boto3.client(
"s3",
aws_access_key_id=raw["access_key"],
aws_secret_access_key=raw["secret_key"],
)lib — install packages and JARs at runtime
Bash-driven dependency declarations (covered in Init Scripts & Dependencies) run during kernel bootstrap and Spark session creation. synutils.lib is for the moment after the kernel is alive — you're in the middle of a notebook, you realize you need a library, and you want to install it without restarting. lib handles the install on the driver and (by default) ships it to every Spark worker in the same call.
Python — installPyPI / installCondaPackage
| Method | Purpose |
|---|---|
| installPyPI(packages, acrossAllNodes=True) | pip install. Auto-detects URLs, files in deps/python/, plain package names. Multiple packages space-delimited. |
| installCondaPackage(packages, acrossAllNodes=True) | conda install (channel conda-forge). Multiple packages space-delimited. |
By default the install runs on the kernel and on every Spark worker so distributed code can use the package. Pass acrossAllNodes=False if you only need it on the kernel.
Plain Jupyter %pip install and !pip install also work in a notebook cell, but they only install on the kernel — not on the Spark executors. Use synutils.lib.installPyPI when you need the package available to distributed Spark code.
python
# Plain PyPI package
synutils.lib.installPyPI("requests")
# Multiple packages, version-pinned
synutils.lib.installPyPI("requests pandas>=2.0 numpy")
# Local zip from the cluster's deps/python folder
# resolves to <bucket>/<config-folder>/deps/python/simple_module.zip
synutils.lib.installPyPI("simple_module.zip")
# Full cloud path (s3://, gs://, https://)
synutils.lib.installPyPI("s3://my-bucket/python_modules/simple_module.zip")
# Mix everything in one call
synutils.lib.installPyPI("requests simple_module.zip s3://my-bucket/other.whl")
# Kernel-only — skip distribution to Spark workers
synutils.lib.installPyPI("requests", acrossAllNodes=False)
# conda-forge package
synutils.lib.installCondaPackage("scipy")Scala — installJars
Install JARs into a running Scala kernel. Three source styles, mixed freely in a single space-delimited call:
| Source | Example |
|---|---|
| Maven coordinates | org.joda:joda-money:1.0.4 |
| Filename in the deps folder (<bucket>/<config-folder>/deps/jars/) | GreeterWithDollar.jar |
| Full cloud-storage path | gs://my-bucket/greeter.jar |
scala
// Single source
synutils.lib.installJars("org.joda:joda-money:1.0.4")
synutils.lib.installJars("greeterWithDollar.jar")
synutils.lib.installJars("gs://my-bucket/greeter.jar")
// Multiple, space-delimited
synutils.lib.installJars("org.joda:joda-money:1.0.4 greeterWithDollar.jar gs://my-bucket/greeter.jar")
lib vs dependency properties
lib runs interactively in your notebook after the kernel is up — quick installs while you work. dependency properties (the syntasa.python.dependencies.* / syntasa.jar.dependencies.* Spark configs) install at session creation, before any cell runs. Use Init Scripts & Dependencies for things every run of the notebook needs; use lib for ad-hoc additions during a single session.
notifications — send platform notifications
Send email from a notebook with optional attachments. Attachments can be a file path, in-memory bytes, or a (filename, bytes) / (filename, bytes, mimetype) tuple — mix them in a single call.
| Method | Purpose |
|---|---|
| send(recipients, subject, message, attachments=…, useDefaultHtmlTemplate=True) | Send an email notification. Recipients can be a single address or comma-separated. |
python
# Plain file path
synutils.notifications.send(
recipients="a@x.com, b@y.com",
subject="Daily report",
message="See attached",
attachments=["/tmp/report.pdf"],
)
# In-memory bytes — (filename, bytes) tuple
csv_bytes = b"id,name\n1,alice\n"
synutils.notifications.send(
recipients="a@x.com",
subject="Daily export",
message="Attached CSV",
attachments=[("daily.csv", csv_bytes)],
)
# In-memory bytes with explicit content type — (filename, bytes, mimetype)
import json
data = {"ok": True, "rows": 42}
synutils.notifications.send(
recipients="a@x.com",
subject="Run status",
message="See attached JSON",
attachments=[("run_status.json", json.dumps(data).encode(), "application/json")],
)
# Mix paths and in-memory content
synutils.notifications.send(
recipients="a@x.com",
subject="Job done",
message="Logs + summary",
attachments=[
"/var/log/job.log",
("summary.csv", b"job,seconds\nfoo,42\n"),
],
)
Set useDefaultHtmlTemplate=False if you want the message body sent verbatim, without the platform's default HTML wrapper.
connections — connection metadata + decrypted parameters
Connections are platform-managed configurations for external systems — Snowflake, JDBC databases, S3 buckets, message queues, anything you've registered. synutils.connections looks them up by name. Encrypted parameter fields are auto-decrypted and wrapped in SecretString so the value is usable but doesn't leak in print output.
| Method | Purpose |
|---|---|
| get(name) | Full connection object — parameters auto-decrypted, encrypted fields wrapped in SecretString. |
| getParam(name, param) | Returns a single parameter value as a plain str / String. Encrypted fields come back unwrapped, so be careful when logging or printing the result. |
| getAllParams(name) | All parameters — preserves SecretString wrapping for encrypted fields. |
| clearCache() | Drop cached responses (force a re-fetch on next call). |
Python
# Python
conn = synutils.connections.get("my_snowflake")
host = synutils.connections.getParam("my_snowflake", "host")
# Plain (non-encrypted) field — printed normally
print(conn["parameters"]["host"]) # snow.example.com
# Encrypted field — auto-wrapped, redacts in print, reveals via .get()
pw = conn["parameters"]["password"]
print(pw) # **********
print(pw.get()) # actual password
Scala
// Scala
val conn = synutils.connections.get("my_snowflake")
val params = conn("parameters").asInstanceOf[Map[String, Any]]
// Encrypted field — pattern-match for type-safe access
val pwd = params("password").asInstanceOf[SecretString]
println(pwd) // **********
println(pwd.get()) // actual password
// JDBC — implicit conversion fires (no explicit unseal needed)
import java.sql.DriverManager
DriverManager.getConnection(jdbcUrl, "svc_account", pwd)Each connection type has a different set of encrypted fields (password, privateKey, accessKey, etc.). The platform decrypts whichever fields are encrypted for the connection type and wraps them in SecretString. Non-encrypted fields like host, port, and database come back as plain strings.
Scala wraps every parameter — sensitive and non-sensitive — as SecretString so the API has one uniform return type. Printing a non-sensitive value such as host also redacts; call .get() to reveal it
auth — tokens
Auth is to generate access tokens to make any platform api calls.
| Method | Purpose |
|---|---|
| getAccessToken() | Fresh access token generated on each call |
| getRefreshToken() | Refresh token generated for this session |
| getAuthType() | Auth type set for this session |
python
# Python
access_token = synutils.auth.getAccessToken()
refresh_token = synutils.auth.getRefreshToken()
auth_type = synutils.auth.getAuthType()
Scala
// Scala
val accessToken = synutils.auth.getAccessToken()
val refreshToken = synutils.auth.getRefreshToken()
val authType = synutils.auth.getAuthType()
infrastructure — platform infra metadata
Read-only access to the platform's infrastructure metadata: cloud provider, region, default storage bucket, metastore details, and so on. Useful when you want your notebook to behave differently on AWS vs GCP, or to pick up the platform's bucket without hard-coding it. Two ways to get at the data — bare attributes for the most-used values, section getters for fuller payloads.
Properties (read-only attributes)
| Property | Returns |
|---|---|
| providerType | Cloud provider — "AWS" / "GOOGLE" / "AZURE" |
| region | Cloud region |
| projectId | GCP project ID (empty for AWS/Azure) |
| bucket | Default storage bucket |
| fileSystemPrefix | Filesystem prefix (s3://, gs://, abfs://) |
| storagePath | Full storage path (prefix + bucket) |
| sshType | SSH connection type |
| metastoreType | Metastore type (AWS_GLUE / HIVE / BIGQUERY) |
| metastoreHostname | Metastore hostname |
Section getters
| Method | Returns |
|---|---|
| getConfig() | config section — region, projectId, etc. |
| getStorage() | storage section — bucket, paths |
| getNetwork() | network section |
| getMetastore() | metastore section |
| getSecurity() | security section — cloud creds, SSH |
| getAll() | Full payload (all sections) |
| getGlobalInitScript() | Global init script — Optional[str] / Option[String]. None / empty if no global script is configured. |
| clearCache() | Drop the cached payload. |
Python returns dict everywhere; Scala returns Map[String, AnyRef] — same shape, language-native types.
python
# Python
print(synutils.infrastructure.providerType)
print(synutils.infrastructure.bucket)
print(synutils.infrastructure.getStorage()) # full storage dict
# Encrypted fields are SecretString-wrapped (see "Working with secrets" above)
sec = synutils.infrastructure.getSecurity()
print(sec) # {'accessKey': '**********', ...}
sec["secretKey"].get() # decrypted plaintextLegacy aliases on infrastructure
Older notebooks may use names like get_config_from_metadata(), get_storage_from_metadata(), asDict(). These still work — they delegate to the canonical methods above. Prefer the canonical names in new code.
eventstores and datasets — the data registry
These two modules are paired: an event store is a logical data unit (a path + a metastore database, with separate values per environment), and a dataset is a registered table inside an event store. You'll use them together to address platform-managed data without hard-coding paths or database names.
eventstores
| Method | Purpose |
|---|---|
| get(name, lookupType="name", env="development") | Full event store object with resolved path + database for the requested environment. |
| getPath(name, lookupType="name", env="development") | Storage path only. |
| getDatabase(name, lookupType="name", env="development") | Hive database name only. |
| configure(defaultName=…, defaultEnv=…) | Set a default event store + environment so path / database / name properties resolve without arguments. |
| path / database / name | Properties returning the values of the default event store (requires configure(...) first). |
| clearCache() | Drop the response cache. |
lookupType controls how the first argument is interpreted: "name" (the default) treats it as the event store's logical name and uses env to pick the environment; "database" treats it as a database name and auto-detects which environment it belongs to.
python
# Lookup by name + environment
es = synutils.eventstores.get("click_stream", env="production")
print(es["path"], es["database"])
# Just the path or database
synutils.eventstores.getPath("click_stream", env="development")
synutils.eventstores.getDatabase("click_stream", env="production")
# Lookup by database — env auto-detected from the database name
es = synutils.eventstores.get("click_stream_dev", lookupType="database")
# Configure defaults — then use bare properties
synutils.eventstores.configure(defaultName="click_stream", defaultEnv="development")
print(synutils.eventstores.path)
print(synutils.eventstores.database)
# Switch environment without changing the name
synutils.eventstores.configure(defaultEnv="production")
print(synutils.eventstores.path)
scala
// Scala — same surface, named arguments
val es = synutils.eventstores.get("click_stream", env = "production")
println(s"${es("path")} ${es("database")}")
synutils.eventstores.configure(defaultName = "click_stream", defaultEnv = "development")
println(synutils.eventstores.path)datasets
| Method | Purpose |
|---|---|
| get(datasetName) | DataSet object with table name, partition columns, etc. |
| create(datasetName, fileFormat=PARQUET) | Register a new dataset. |
| list(eventStoreName) | All datasets under an event store (dev + prod combined). Filter by environment field if needed. |
DataSet object methods: tableName(), getPartitionColumns(), getNonPartitionColumns(), isPartitioned().
python
# Inspect a dataset
ds = synutils.datasets.get("user_events")
print(ds.tableName(), ds.isPartitioned())
# Register a new dataset (default file format is PARQUET)
synutils.datasets.create("my_new_dataset")
synutils.datasets.create("my_avro_dataset", fileFormat="AVRO")
# List datasets under an event store, filter by environment
prod = [d for d in synutils.datasets.list("TestStore") if d["environment"] == "PRODUCTION"]spark — dataset → DataFrame and write helpers
Spark-side helpers that wrap a dataset name into a DataFrame and handle the write back to an event store. These are the methods you'll reach for whenever a notebook needs to read or write a registered dataset rather than work against raw paths.
| Method | Purpose |
|---|---|
| createDataFrame(datasetName, from_date=None, to_date=None) | Read a dataset into a DataFrame, optionally filtered by date range. |
| isTableExists(dataset) | True if the Hive table backing this dataset exists. |
| createTable(df, name, partitionedDateColumn="") | Create a Hive table from a DataFrame. |
| writeToEventStore(df, datasetName, …) | Write a DataFrame to an event store. Auto-creates the dataset if missing. |
| writeDatasetToEventStore(df, datasetName) | Convenience wrapper around writeToEventStore — uses the dataset's defaults. |
| writeFileToEventStore(localPath, eventstorePath) | Push a local file into the event store. |
The most common pattern — read a dataset, transform, write back:
python
# Python
df = synutils.spark.createDataFrame("user_events")
df.show(5)
# Transform...
result = df.filter(df.country == "US")
# Write back as a registered dataset
synutils.spark.writeDatasetToEventStore(result, "user_events_us")
scala
// Scala
val df = synutils.spark.createDataFrame("user_events")
val result = df.filter(df("country") === "US")
synutils.spark.writeDatasetToEventStore(result, "user_events_us")writeToEventStore — full signature
For writes that need more control than the convenience wrapper, use the full method:
| Parameter | Default | Purpose |
|---|---|---|
| df | — | Source DataFrame to write. |
| datasetName | — | database.tablename format. |
| numPartitions | None / 0 | If > 0 and the dataset is partitioned, adds DISTRIBUTE BY <partition_cols>, floor(rand()*numPartitions) to control output file count per partition. |
| partitionedDateColumn | None / "" | Override the dataset's configured partition column. If set, the dataset is updated before writing. |
| isOverwrite | True | True → INSERT OVERWRITE TABLE (replaces partition data); False → INSERT INTO TABLE (appends). |
| overrideProcessMode | True | When True, recreates the table even if it exists. Set False to preserve an existing table definition. |
| fileFormat | PARQUET | Used only when the dataset must be created (404 from the dataset API). Ignored if the dataset already exists. Supported: PARQUET, ORC, AVRO, DELTA, TEXTFILE. |
python # Append (don't overwrite existing partitions) synutils.spark.writeToEventStore(df, "analytics.user_events", isOverwrite=False) # Control output file count per partition synutils.spark.writeToEventStore(df, "analytics.user_events", numPartitions=8) # Auto-create as Avro if dataset doesn't exist yet synutils.spark.writeToEventStore(df, "analytics.new_avro_dataset", fileFormat="AVRO")
isOverwrite is partition-level for partitioned tables
For partitioned tables, isOverwrite=True overwrites at the partition level — only the partitions present in the DataFrame are replaced. For non-partitioned tables, the entire table is overwritten.
| Parameter | Default | Purpose |
|---|---|---|
| df | — | Source DataFrame to write. |
| datasetName | — | database.tablename format. |
| numPartitions | None / 0 | If > 0 and the dataset is partitioned, adds DISTRIBUTE BY <partition_cols>, floor(rand()*numPartitions) to control output file count per partition. |
| partitionedDateColumn | None / "" | Override the dataset's configured partition column. If set, the dataset is updated before writing. |
| isOverwrite | True | True → INSERT OVERWRITE TABLE (replaces partition data); False → INSERT INTO TABLE (appends). |
| overrideProcessMode | True | When True, recreates the table even if it exists. Set False to preserve an existing table definition. |
| fileFormat | PARQUET | Used only when the dataset must be created (404 from the dataset API). Ignored if the dataset already exists. Supported: PARQUET, ORC, AVRO, DELTA, TEXTFILE. |
fs — direct cloud storage
Direct access to the underlying object store, with the same API across S3, GCS, Azure, and HDFS. The synutils.fs object auto-routes by the URI scheme of the path you pass in — no separate clients to instantiate.
Most calls fall into one of four shapes:
- ls / listRecursive / exists — list and probe.
- upload / download / uploadFolder / downloadFolder — move data between local disk and the object store.
- copy / move / rename / delete / mkdir — manage what's there.
- content / head / stream / writeText / uploadStream — read or write contents directly. Use content for small files, head for previews, and stream for large files you don't want fully in memory.
| Method | Purpose |
|---|---|
| ls(path) | List entries (non-recursive). |
| listRecursive(path) | Recursive list. |
| exists(path) | True if file or folder-prefix exists. |
| upload(local, remote) | Upload single file. |
| download(remote, local) | Download single file. |
| uploadFolder(localDir, remoteDir) | Recursive upload. |
| downloadFolder(remoteDir, localDir) | Recursive download. |
| copy(src, dest) | Server-side copy. |
| move(src, dest) | Move (across buckets / containers permitted). |
| rename(old, new) | Rename in place — same bucket / container only. |
| delete(path) | Delete file or prefix. |
| mkdir(path) | Create directory marker. |
| content(path) | Read full text content. |
| head(path, maxBytes=65536) | Read first N bytes as text. |
| writeText(path, content) | Write a text file. |
| stream(path) | Lazy read stream for large files. |
| uploadStream(fileObj, path) | Upload from a file-like object. |
python
# Python
synutils.fs.upload("local.csv", "gs://my-bucket/remote.csv")
print(synutils.fs.exists("gs://my-bucket/remote.csv"))
print(synutils.fs.ls("gs://my-bucket/"))
scala
// Scala
synutils.fs.upload("local.csv", "gs://my-bucket/remote.csv")
println(synutils.fs.exists("gs://my-bucket/remote.csv"))
println(synutils.fs.ls("gs://my-bucket/"))Legacy aliases on fs (Python only)
Older code may call put, rm, mv, exist, is_exists, create_folder, list, upload_folder, download_folder, upload_stream. They still work — they delegate to the canonical methods above. Prefer the canonical names in new code.
files — uploaded file objects
On the platform, a file object (a Syntasa-platform concept, not Python's file type) pairs a base cloud-storage path with a list of files registered under it — the object's parameters. synutils.files resolves these registered objects to full cloud paths and — for DATA_FILE objects in supported formats — reads them directly into a Spark DataFrame so you don't have to wire the read up by hand.
| Method | Purpose |
|---|---|
| get(name) | Full file object dict / Map from the API (cached per name). |
| getPath(name) | Full cloud paths for all files in this object — returns List[str]. |
| getMetadata(name) | Curated subset of metadata with renamed keys. |
| createDataFrame(name, fileName, sep=None, header=True, inferSchema=True) | Read a registered file into a Spark DataFrame. DATA_FILE objects only; supported fileFormat: DELIMITED, JSON, PARQUET, ORC, AVRO. |
| clearCache() | Drop cached responses. |
The sep, header, and inferSchema parameters apply to DELIMITED only and are ignored for the other formats. When sep is None (the default), it falls back to the delimiter configured on the file object, then to a comma. JSON, PARQUET, ORC, and AVRO use Spark's native readers.
python
# Inspect a file object
info = synutils.files.get("daily_report")
print(info["objectTypeKey"], info["fileFormat"])
# Get full cloud paths for every file in the object
paths = synutils.files.getPath("daily_report")
# ['gs://my-bucket/reports/sales.csv', 'gs://my-bucket/reports/orders.csv']
# Read one file as a DataFrame (uses the object's configured delimiter)
df = synutils.files.createDataFrame("daily_report", "sales.csv")
df.show(5)
# Override CSV options for this read only
df = synutils.files.createDataFrame(
"daily_report", "sales.tsv",
sep="\t", header=False, inferSchema=False,
)
# JSON / PARQUET / ORC / AVRO — sep / header / inferSchema are ignored
events = synutils.files.createDataFrame("event_dump", "events.json")
sales = synutils.files.createDataFrame("sales_dump", "2024-01.parquet")getPath() returns all files; createDataFrame() reads exactly one
getPath() gives you every cloud path under a file object — useful when you want Spark to read everything in one go via spark.read.csv(synutils.files.getPath("daily_report")). createDataFrame() reads exactly one file at a time, identified by fileName.