Notebook Utilities (synutils) – SYNTASA™

Every Syntasa notebook ships with a unified utility namespace called synutils. It is the one entry point for everything the platform exposes to your code: secrets, package installs, notifications, connection metadata, the data registry, Spark helpers, cloud storage, and uploaded files. Using synutils instead of pulling in raw SDKs means you get the platform's auth, credential decryption, and integration plumbing for free, in one consistent shape across Python and Scala.

The same name, synutils, works in both languages. In Python it is a module-level attribute already imported into every kernel; in Scala it is a REPL alias that resolves to the same backing object. Methods and module names are identical across languages — the differences are limited to language idioms (Python dict vs Scala Map, named arguments vs Scala's name = value). This chapter shows both forms for the headline method of each module and Python only for secondary methods, with a note when the Scala signature differs.

Quick start

python
# Python — synutils is already in scope
print(synutils.infrastructure.providerType)        # AWS / GOOGLE / AZURE
print(synutils.infrastructure.bucket)              # default storage bucket
df = synutils.spark.createDataFrame("user_events") # dataset → Spark DataFrame

scala
// Scala — synutils is already in scope
println(synutils.infrastructure.providerType)        // AWS / GOOGLE / AZURE
println(synutils.infrastructure.bucket)              // default storage bucket
val df = synutils.spark.createDataFrame("user_events") // dataset → Spark DataFrame

That is the entire setup. The rest of this chapter walks through each module in order of how often you will reach for it.

Working with secrets — SecretString

Three modules — credentials, connections, and infrastructure — return secret values. Rather than handing back raw strings, the platform wraps secrets in a SecretString type that redacts itself in print and log output. The actual secret value is still there; you have to ask for it explicitly to reveal it.

In Python, SecretString is a `str` subclass; in Scala, it has an implicit conversion to `String`. Both languages treat it the same way:

Printing or logging renders ********** — your secrets won't accidentally leak into notebook output, exception messages, or log files.
On a single SecretString, call .get() (preferred) or .unseal() (legacy alias) to reveal the raw string value — for passing to SDKs that need it.
On a SecretDict / SecretMap (returned by methods like credentials.getAll(name)), call .unseal() to get back a regular dict / Map of plaintext strings.
In Scala, the implicit conversion to String fires automatically when a JDBC connector or similar API expects a String — no explicit unseal needed.

String manipulation reveals the value

Rule of thumb: printing and logging are safe; anything that touches the underlying characters reveals the value.

Because SecretString inherits from string, all of these leak:

"prefix-" + secret · secret[:5] · secret.upper() · secret.encode() · json.dumps({"key": secret})

Prefer f-strings or s-interpolation over + concatenation if you want redaction to hold across log lines.

credentials — secret store lookup

The platform's credential store is the right place to keep secrets your notebooks need: API keys, service account passwords, tokens. synutils.credentials reads them on demand and wraps every value in a SecretString so you can use them without leaking them into your notebook output.

Method	Purpose
get(name, key)	Fetch a single secret value — returns SecretString.
getAll(name)	All secrets under a name — returns SecretDict (Python) / SecretMap (Scala).
describe(name)	Human-readable credential description (creator, last updated, type).
getMetadata(name)	Full raw metadata payload from the credential service.
read(name)	Retrieves the complete credential object at once
list()	All credential names you have access to.

The most common pattern — fetch one secret, pass it to an SDK:

python
# Python
import boto3
 
secret_key = synutils.credentials.get("aws_creds", "secret_access_key")
print(secret_key)              # **********
client = boto3.client("s3", aws_secret_access_key=secret_key.get())

scala
// Scala — implicit conversion to String fires automatically for JDBC
import java.sql.DriverManager
 
val pwd = synutils.credentials.get("db_creds", "password")
println(pwd)                                     // **********
val conn = DriverManager.getConnection(jdbcUrl, "user", pwd)

If you need every value under a credential name as a regular dict, call .unseal() on the result of getAll():

python
# Python
raw = synutils.credentials.getAll("aws_creds").unseal()
client = boto3.client(
    "s3",
    aws_access_key_id=raw["access_key"],
    aws_secret_access_key=raw["secret_key"],
)

lib — install packages and JARs at runtime

Bash-driven dependency declarations (covered in Init Scripts & Dependencies) run during kernel bootstrap and Spark session creation. synutils.lib is for the moment after the kernel is alive — you're in the middle of a notebook, you realize you need a library, and you want to install it without restarting. lib handles the install on the driver and (by default) ships it to every Spark worker in the same call.

Python — installPyPI / installCondaPackage

Method	Purpose
installPyPI(packages, acrossAllNodes=True)	pip install. Auto-detects URLs, files in deps/python/, plain package names. Multiple packages space-delimited.
installCondaPackage(packages, acrossAllNodes=True)	conda install (channel conda-forge). Multiple packages space-delimited.

By default the install runs on the kernel and on every Spark worker so distributed code can use the package. Pass acrossAllNodes=False if you only need it on the kernel.

Plain Jupyter %pip install and !pip install also work in a notebook cell, but they only install on the kernel — not on the Spark executors. Use synutils.lib.installPyPI when you need the package available to distributed Spark code.

python
# Plain PyPI package
synutils.lib.installPyPI("requests")
 
# Multiple packages, version-pinned
synutils.lib.installPyPI("requests pandas>=2.0 numpy")
 
# Local zip from the cluster's deps/python folder
#   resolves to <bucket>/<config-folder>/deps/python/simple_module.zip
synutils.lib.installPyPI("simple_module.zip")
 
# Full cloud path (s3://, gs://, https://)
synutils.lib.installPyPI("s3://my-bucket/python_modules/simple_module.zip")
 
# Mix everything in one call
synutils.lib.installPyPI("requests simple_module.zip s3://my-bucket/other.whl")
 
# Kernel-only — skip distribution to Spark workers
synutils.lib.installPyPI("requests", acrossAllNodes=False)
 
# conda-forge package
synutils.lib.installCondaPackage("scipy")

Scala — installJars

Install JARs into a running Scala kernel. Three source styles, mixed freely in a single space-delimited call:

Source	Example
Maven coordinates	org.joda:joda-money:1.0.4
Filename in the deps folder (<bucket>/<config-folder>/deps/jars/)	GreeterWithDollar.jar
Full cloud-storage path	gs://my-bucket/greeter.jar

scala
// Single source
synutils.lib.installJars("org.joda:joda-money:1.0.4")
synutils.lib.installJars("greeterWithDollar.jar")
synutils.lib.installJars("gs://my-bucket/greeter.jar")
 
// Multiple, space-delimited
synutils.lib.installJars("org.joda:joda-money:1.0.4 greeterWithDollar.jar gs://my-bucket/greeter.jar")

lib vs dependency properties

lib runs interactively in your notebook after the kernel is up — quick installs while you work. dependency properties (the syntasa.python.dependencies.* / syntasa.jar.dependencies.* Spark configs) install at session creation, before any cell runs. Use Init Scripts & Dependencies for things every run of the notebook needs; use lib for ad-hoc additions during a single session.

notifications — send platform notifications

Send email from a notebook with optional attachments. Attachments can be a file path, in-memory bytes, or a (filename, bytes) / (filename, bytes, mimetype) tuple — mix them in a single call.

Method	Purpose
send(recipients, subject, message, attachments=…, useDefaultHtmlTemplate=True)	Send an email notification. Recipients can be a single address or comma-separated.

python
# Plain file path
synutils.notifications.send(
    recipients="a@x.com, b@y.com",
    subject="Daily report",
    message="See attached",
    attachments=["/tmp/report.pdf"],
)
 
# In-memory bytes — (filename, bytes) tuple
csv_bytes = b"id,name\n1,alice\n"
synutils.notifications.send(
    recipients="a@x.com",
    subject="Daily export",
    message="Attached CSV",
    attachments=[("daily.csv", csv_bytes)],
)
 
# In-memory bytes with explicit content type — (filename, bytes, mimetype)
import json
data = {"ok": True, "rows": 42}
synutils.notifications.send(
    recipients="a@x.com",
    subject="Run status",
    message="See attached JSON",
    attachments=[("run_status.json", json.dumps(data).encode(), "application/json")],
)
 
# Mix paths and in-memory content
synutils.notifications.send(
    recipients="a@x.com",
    subject="Job done",
    message="Logs + summary",
    attachments=[
        "/var/log/job.log",
        ("summary.csv", b"job,seconds\nfoo,42\n"),
    ],
)

Set useDefaultHtmlTemplate=False if you want the message body sent verbatim, without the platform's default HTML wrapper.

connections — connection metadata + decrypted parameters

Connections are platform-managed configurations for external systems — Snowflake, JDBC databases, S3 buckets, message queues, anything you've registered. synutils.connections looks them up by name. Encrypted parameter fields are auto-decrypted and wrapped in SecretString so the value is usable but doesn't leak in print output.

Method	Purpose
get(name)	Full connection object — parameters auto-decrypted, encrypted fields wrapped in SecretString.
getParam(name, param)	Returns a single parameter value as a plain str / String. Encrypted fields come back unwrapped, so be careful when logging or printing the result.
getAllParams(name)	All parameters — preserves SecretString wrapping for encrypted fields.
clearCache()	Drop cached responses (force a re-fetch on next call).

Python
# Python
conn = synutils.connections.get("my_snowflake")
host = synutils.connections.getParam("my_snowflake", "host")
 
# Plain (non-encrypted) field — printed normally
print(conn["parameters"]["host"])             # snow.example.com
 
# Encrypted field — auto-wrapped, redacts in print, reveals via .get()
pw = conn["parameters"]["password"]
print(pw)                                     # **********
print(pw.get())                               # actual password

Scala
// Scala
val conn = synutils.connections.get("my_snowflake")
val params = conn("parameters").asInstanceOf[Map[String, Any]]
 
// Encrypted field — pattern-match for type-safe access
val pwd = params("password").asInstanceOf[SecretString]
println(pwd)              // **********
println(pwd.get())        // actual password
 
// JDBC — implicit conversion fires (no explicit unseal needed)
import java.sql.DriverManager
DriverManager.getConnection(jdbcUrl, "svc_account", pwd)

Each connection type has a different set of encrypted fields (password, privateKey, accessKey, etc.). The platform decrypts whichever fields are encrypted for the connection type and wraps them in SecretString. Non-encrypted fields like host, port, and database come back as plain strings.

Scala wraps every parameter — sensitive and non-sensitive — as SecretString so the API has one uniform return type. Printing a non-sensitive value such as host also redacts; call .get() to reveal it

auth — tokens

Auth is to generate access tokens to make any platform api calls.

Method	Purpose
getAccessToken()	Fresh access token generated on each call
getRefreshToken()	Refresh token generated for this session
getAuthType()	Auth type set for this session

python
# Python
access_token  = synutils.auth.getAccessToken()
refresh_token = synutils.auth.getRefreshToken()
auth_type     = synutils.auth.getAuthType()

Scala
// Scala
val accessToken  = synutils.auth.getAccessToken()
val refreshToken = synutils.auth.getRefreshToken()
val authType     = synutils.auth.getAuthType()

infrastructure — platform infra metadata

Read-only access to the platform's infrastructure metadata: cloud provider, region, default storage bucket, metastore details, and so on. Useful when you want your notebook to behave differently on AWS vs GCP, or to pick up the platform's bucket without hard-coding it. Two ways to get at the data — bare attributes for the most-used values, section getters for fuller payloads.

Properties (read-only attributes)

Property	Returns
providerType	Cloud provider — "AWS" / "GOOGLE" / "AZURE"
region	Cloud region
projectId	GCP project ID (empty for AWS/Azure)
bucket	Default storage bucket
fileSystemPrefix	Filesystem prefix (s3://, gs://, abfs://)
storagePath	Full storage path (prefix + bucket)
sshType	SSH connection type
metastoreType	Metastore type (AWS_GLUE / HIVE / BIGQUERY)
metastoreHostname	Metastore hostname

Section getters

Method	Returns
getConfig()	config section — region, projectId, etc.
getStorage()	storage section — bucket, paths
getNetwork()	network section
getMetastore()	metastore section
getSecurity()	security section — cloud creds, SSH
getAll()	Full payload (all sections)
getGlobalInitScript()	Global init script — Optional[str] / Option[String]. None / empty if no global script is configured.
clearCache()	Drop the cached payload.

Python returns dict everywhere; Scala returns Map[String, AnyRef] — same shape, language-native types.

python
# Python
print(synutils.infrastructure.providerType)
print(synutils.infrastructure.bucket)
print(synutils.infrastructure.getStorage())   # full storage dict
 
# Encrypted fields are SecretString-wrapped (see "Working with secrets" above)
sec = synutils.infrastructure.getSecurity()
print(sec)                                    # {'accessKey': '**********', ...}
sec["secretKey"].get()                        # decrypted plaintext

Legacy aliases on infrastructure

Older notebooks may use names like get_config_from_metadata(), get_storage_from_metadata(), asDict(). These still work — they delegate to the canonical methods above. Prefer the canonical names in new code.

eventstores and datasets — the data registry

These two modules are paired: an event store is a logical data unit (a path + a metastore database, with separate values per environment), and a dataset is a registered table inside an event store. You'll use them together to address platform-managed data without hard-coding paths or database names.

eventstores

Method	Purpose
get(name, lookupType="name", env="development")	Full event store object with resolved path + database for the requested environment.
getPath(name, lookupType="name", env="development")	Storage path only.
getDatabase(name, lookupType="name", env="development")	Hive database name only.
configure(defaultName=…, defaultEnv=…)	Set a default event store + environment so path / database / name properties resolve without arguments.
path / database / name	Properties returning the values of the default event store (requires configure(...) first).
clearCache()	Drop the response cache.

lookupType controls how the first argument is interpreted: "name" (the default) treats it as the event store's logical name and uses env to pick the environment; "database" treats it as a database name and auto-detects which environment it belongs to.

python
# Lookup by name + environment
es = synutils.eventstores.get("click_stream", env="production")
print(es["path"], es["database"])
 
# Just the path or database
synutils.eventstores.getPath("click_stream", env="development")
synutils.eventstores.getDatabase("click_stream", env="production")
 
# Lookup by database — env auto-detected from the database name
es = synutils.eventstores.get("click_stream_dev", lookupType="database")
 
# Configure defaults — then use bare properties
synutils.eventstores.configure(defaultName="click_stream", defaultEnv="development")
print(synutils.eventstores.path)
print(synutils.eventstores.database)
 
# Switch environment without changing the name
synutils.eventstores.configure(defaultEnv="production")
print(synutils.eventstores.path)

scala
// Scala — same surface, named arguments
val es = synutils.eventstores.get("click_stream", env = "production")
println(s"${es("path")} ${es("database")}")
 
synutils.eventstores.configure(defaultName = "click_stream", defaultEnv = "development")
println(synutils.eventstores.path)

datasets

Method	Purpose
get(datasetName)	DataSet object with table name, partition columns, etc.
create(datasetName, fileFormat=PARQUET)	Register a new dataset.
list(eventStoreName)	All datasets under an event store (dev + prod combined). Filter by environment field if needed.

DataSet object methods: tableName(), getPartitionColumns(), getNonPartitionColumns(), isPartitioned().

python
# Inspect a dataset
ds = synutils.datasets.get("user_events")
print(ds.tableName(), ds.isPartitioned())
 
# Register a new dataset (default file format is PARQUET)
synutils.datasets.create("my_new_dataset")
synutils.datasets.create("my_avro_dataset", fileFormat="AVRO")
 
# List datasets under an event store, filter by environment
prod = [d for d in synutils.datasets.list("TestStore") if d["environment"] == "PRODUCTION"]

spark — dataset → DataFrame and write helpers

Spark-side helpers that wrap a dataset name into a DataFrame and handle the write back to an event store. These are the methods you'll reach for whenever a notebook needs to read or write a registered dataset rather than work against raw paths.

Method	Purpose
createDataFrame(datasetName, from_date=None, to_date=None)	Read a dataset into a DataFrame, optionally filtered by date range.
isTableExists(dataset)	True if the Hive table backing this dataset exists.
createTable(df, name, partitionedDateColumn="")	Create a Hive table from a DataFrame.
writeToEventStore(df, datasetName, …)	Write a DataFrame to an event store. Auto-creates the dataset if missing.
writeDatasetToEventStore(df, datasetName)	Convenience wrapper around writeToEventStore — uses the dataset's defaults.
writeFileToEventStore(localPath, eventstorePath)	Push a local file into the event store.

The most common pattern — read a dataset, transform, write back:

python
# Python
df = synutils.spark.createDataFrame("user_events")
df.show(5)
 
# Transform...
result = df.filter(df.country == "US")
 
# Write back as a registered dataset
synutils.spark.writeDatasetToEventStore(result, "user_events_us")

scala
// Scala
val df = synutils.spark.createDataFrame("user_events")
val result = df.filter(df("country") === "US")
synutils.spark.writeDatasetToEventStore(result, "user_events_us")

writeToEventStore — full signature

For writes that need more control than the convenience wrapper, use the full method:

Parameter	Default	Purpose
df	—	Source DataFrame to write.
datasetName	—	database.tablename format.
numPartitions	None / 0	If > 0 and the dataset is partitioned, adds DISTRIBUTE BY <partition_cols>, floor(rand()*numPartitions) to control output file count per partition.
partitionedDateColumn	None / ""	Override the dataset's configured partition column. If set, the dataset is updated before writing.
isOverwrite	True	True → INSERT OVERWRITE TABLE (replaces partition data); False → INSERT INTO TABLE (appends).
overrideProcessMode	True	When True, recreates the table even if it exists. Set False to preserve an existing table definition.
fileFormat	PARQUET	Used only when the dataset must be created (404 from the dataset API). Ignored if the dataset already exists. Supported: PARQUET, ORC, AVRO, DELTA, TEXTFILE.

python
# Append (don't overwrite existing partitions)
synutils.spark.writeToEventStore(df, "analytics.user_events", isOverwrite=False)
 
# Control output file count per partition
synutils.spark.writeToEventStore(df, "analytics.user_events", numPartitions=8)
 
# Auto-create as Avro if dataset doesn't exist yet
synutils.spark.writeToEventStore(df, "analytics.new_avro_dataset", fileFormat="AVRO")

isOverwrite is partition-level for partitioned tables

For partitioned tables, isOverwrite=True overwrites at the partition level — only the partitions present in the DataFrame are replaced. For non-partitioned tables, the entire table is overwritten.

Parameter	Default	Purpose
df	—	Source DataFrame to write.
datasetName	—	database.tablename format.
numPartitions	None / 0	If > 0 and the dataset is partitioned, adds DISTRIBUTE BY <partition_cols>, floor(rand()*numPartitions) to control output file count per partition.
partitionedDateColumn	None / ""	Override the dataset's configured partition column. If set, the dataset is updated before writing.
isOverwrite	True	True → INSERT OVERWRITE TABLE (replaces partition data); False → INSERT INTO TABLE (appends).
overrideProcessMode	True	When True, recreates the table even if it exists. Set False to preserve an existing table definition.
fileFormat	PARQUET	Used only when the dataset must be created (404 from the dataset API). Ignored if the dataset already exists. Supported: PARQUET, ORC, AVRO, DELTA, TEXTFILE.

fs — direct cloud storage

Direct access to the underlying object store, with the same API across S3, GCS, Azure, and HDFS. The synutils.fs object auto-routes by the URI scheme of the path you pass in — no separate clients to instantiate.

Most calls fall into one of four shapes:

ls / listRecursive / exists — list and probe.
upload / download / uploadFolder / downloadFolder — move data between local disk and the object store.
copy / move / rename / delete / mkdir — manage what's there.
content / head / stream / writeText / uploadStream — read or write contents directly. Use content for small files, head for previews, and stream for large files you don't want fully in memory.

Method	Purpose
ls(path)	List entries (non-recursive).
listRecursive(path)	Recursive list.
exists(path)	True if file or folder-prefix exists.
upload(local, remote)	Upload single file.
download(remote, local)	Download single file.
uploadFolder(localDir, remoteDir)	Recursive upload.
downloadFolder(remoteDir, localDir)	Recursive download.
copy(src, dest)	Server-side copy.
move(src, dest)	Move (across buckets / containers permitted).
rename(old, new)	Rename in place — same bucket / container only.
delete(path)	Delete file or prefix.
mkdir(path)	Create directory marker.
content(path)	Read full text content.
head(path, maxBytes=65536)	Read first N bytes as text.
writeText(path, content)	Write a text file.
stream(path)	Lazy read stream for large files.
uploadStream(fileObj, path)	Upload from a file-like object.

python
# Python
synutils.fs.upload("local.csv", "gs://my-bucket/remote.csv")
print(synutils.fs.exists("gs://my-bucket/remote.csv"))
print(synutils.fs.ls("gs://my-bucket/"))

scala
// Scala
synutils.fs.upload("local.csv", "gs://my-bucket/remote.csv")
println(synutils.fs.exists("gs://my-bucket/remote.csv"))
println(synutils.fs.ls("gs://my-bucket/"))

Legacy aliases on fs (Python only)

Older code may call put, rm, mv, exist, is_exists, create_folder, list, upload_folder, download_folder, upload_stream. They still work — they delegate to the canonical methods above. Prefer the canonical names in new code.

files — uploaded file objects

On the platform, a file object (a Syntasa-platform concept, not Python's file type) pairs a base cloud-storage path with a list of files registered under it — the object's parameters. synutils.files resolves these registered objects to full cloud paths and — for DATA_FILE objects in supported formats — reads them directly into a Spark DataFrame so you don't have to wire the read up by hand.

Method	Purpose
get(name)	Full file object dict / Map from the API (cached per name).
getPath(name)	Full cloud paths for all files in this object — returns List[str].
getMetadata(name)	Curated subset of metadata with renamed keys.
createDataFrame(name, fileName, sep=None, header=True, inferSchema=True)	Read a registered file into a Spark DataFrame. DATA_FILE objects only; supported fileFormat: DELIMITED, JSON, PARQUET, ORC, AVRO.
clearCache()	Drop cached responses.

The sep, header, and inferSchema parameters apply to DELIMITED only and are ignored for the other formats. When sep is None (the default), it falls back to the delimiter configured on the file object, then to a comma. JSON, PARQUET, ORC, and AVRO use Spark's native readers.

python
# Inspect a file object
info = synutils.files.get("daily_report")
print(info["objectTypeKey"], info["fileFormat"])
 
# Get full cloud paths for every file in the object
paths = synutils.files.getPath("daily_report")
# ['gs://my-bucket/reports/sales.csv', 'gs://my-bucket/reports/orders.csv']
 
# Read one file as a DataFrame (uses the object's configured delimiter)
df = synutils.files.createDataFrame("daily_report", "sales.csv")
df.show(5)
 
# Override CSV options for this read only
df = synutils.files.createDataFrame(
    "daily_report", "sales.tsv",
    sep="\t", header=False, inferSchema=False,
)
 
# JSON / PARQUET / ORC / AVRO — sep / header / inferSchema are ignored
events = synutils.files.createDataFrame("event_dump", "events.json")
sales  = synutils.files.createDataFrame("sales_dump", "2024-01.parquet")

getPath() returns all files; createDataFrame() reads exactly one

getPath() gives you every cloud path under a file object — useful when you want Spark to read everything in one go via spark.read.csv(synutils.files.getPath("daily_report")). createDataFrame() reads exactly one file at a time, identified by fileName.

{[{category.name}]}