Syntasa Notebook Processes allow you to embed Jupyter notebooks directly into application workflows, enabling advanced analytics, custom transformations, and data science logic using Python or Scala—all while remaining fully schedulable and production-ready.
This article provides a step-by-step guide to creating an empty notebook, querying a Hive table using Spark SQL, and returning the results as a dataset that can be consumed by downstream workflow nodes.
Overview
A Notebook Process runs a Jupyter notebook as part of a Syntasa job using a Spark-compatible runtime. At execution time, Syntasa injects system parameters and manages output paths automatically, allowing notebooks to function as first-class, reusable workflow components.
Common use cases include:
Querying Hive tables
Performing feature engineering
Running custom aggregations
Returning structured datasets to downstream processes
Prerequisites
Before you begin, ensure the following requirements are met:
Notebooks Component Enabled
The Notebooks module must be enabled in your Syntasa environment.Workspace Access
You must have access to a Notebook Workspace (for example, Default Workspace).Compute Runtime Available
A Spark-compatible runtime must be configured, such as:Kubernetes-based Spark
Step 1: Create a New Notebook
Start by creating an empty notebook in your workspace.
Navigate to the Notebooks module from the main Syntasa sidebar.
Select your target Workspace.
Click + CREATE.
In the creation form, provide the following details:
Notebook Name: A unique name (for example,
hive_query_tutorial)Language: Select your preferred language (for example, PYTHON)
Workspace: Confirm the correct workspace is selected
Connect Notebook to Syntasa Runtime:
Toggle ON to attach a compute cluster
Select a Runtime Template and Instance
Click SAVE.
The notebook is now created and visible in the selected workspace.
Step 2: Add the Notebook Process to a Workflow
Next, integrate the notebook into an application workflow.
Open the Composer and select your application.
From the left-hand palette, locate the Notebook node under the Process section.
Drag and drop the Notebook node onto the canvas.
Click the node to open the configuration panel.
In the General tab:
Process Name: Enter a descriptive name
Select Workspace: Choose the workspace containing your notebook
Select Notebook: Select
hive_query_tutorial
Click SAVE.
The notebook is now part of your workflow and ready for configuration.
Step 3: Write the Hive Query
At runtime, Syntasa injects several system parameters directly into the notebook environment, including:
database– The event store (Hive) database namefromDate– Start of the execution date rangetoDate– End of the execution date rangelocation– Output path for writing results
Launch the Notebook
In the Notebook node configuration panel, click LAUNCH NOTEBOOK.
This opens the Jupyter interface in a new browser tab.
Example: Querying a Hive Table (Python)
In a new code cell, use the Spark session to query a Hive table:
# Syntasa injects 'database', 'fromDate', and 'toDate' as global variables
# The Spark session is available as 'spark'
query = f"""
SELECT
user_id,
event_type,
event_time
FROM {database}.events
WHERE event_time BETWEEN '{fromDate}' AND '{toDate}'
"""
df = spark.sql(query)
# Display a preview of the results
df.show(10)
This query:
Uses the injected
databasevariable to reference the correct Hive schemaFilters records using the injected execution date range
Returns a Spark DataFrame that can be further transformed or written out
Step 4: Return the Results as a Dataset
To make the query results available to downstream workflow nodes, write the DataFrame to the injected output location.
# Write the result dataset to the output location provided by Syntasa
df.write.mode("overwrite").parquet(location)When the Notebook Process completes:
The written dataset becomes the formal output of the process
Downstream nodes can consume this data just like outputs from any other Syntasa process
Execution and Validation
Run the application job from the Composer
Once execution completes successfully, the Notebook Process output node will indicate completion
You can inspect the rendered notebook output and logs directly from the Output node in the workflow
Summary
Querying Hive tables using a Notebook Process in Syntasa provides a powerful and flexible way to combine Spark SQL, Jupyter-based development, and production-grade orchestration. By leveraging injected runtime parameters and managed output paths, notebooks can seamlessly transition from exploratory analysis to fully operational workflow components.