How to Query a Hive Table Using a Notebook Process – SYNTASA™

Syntasa Notebook Processes allow you to embed Jupyter notebooks directly into application workflows, enabling advanced analytics, custom transformations, and data science logic using Python or Scala—all while remaining fully schedulable and production-ready.

This article provides a step-by-step guide to creating an empty notebook, querying a Hive table using Spark SQL, and returning the results as a dataset that can be consumed by downstream workflow nodes.

Overview

A Notebook Process runs a Jupyter notebook as part of a Syntasa job using a Spark-compatible runtime. At execution time, Syntasa injects system parameters and manages output paths automatically, allowing notebooks to function as first-class, reusable workflow components.

Common use cases include:

Querying Hive tables
Performing feature engineering
Running custom aggregations
Returning structured datasets to downstream processes

Prerequisites

Before you begin, ensure the following requirements are met:

Notebooks Component Enabled
The Notebooks module must be enabled in your Syntasa environment.
Workspace Access
You must have access to a Notebook Workspace (for example, Default Workspace).
Compute Runtime Available
A Spark-compatible runtime must be configured, such as:
- Kubernetes-based Spark

Step 1: Create a New Notebook

Start by creating an empty notebook in your workspace.

Navigate to the Notebooks module from the main Syntasa sidebar.
Select your target Workspace.
Click + CREATE.
In the creation form, provide the following details:
- Notebook Name: A unique name (for example, hive_query_tutorial)
- Language: Select your preferred language (for example, PYTHON)
- Workspace: Confirm the correct workspace is selected
- Connect Notebook to Syntasa Runtime:
  - Toggle ON to attach a compute cluster
  - Select a Runtime Template and Instance
Click SAVE.

The notebook is now created and visible in the selected workspace.

Step 2: Add the Notebook Process to a Workflow

Next, integrate the notebook into an application workflow.

Open the Composer and select your application.
From the left-hand palette, locate the Notebook node under the Process section.
Drag and drop the Notebook node onto the canvas.
Click the node to open the configuration panel.
In the General tab:
- Process Name: Enter a descriptive name
- Select Workspace: Choose the workspace containing your notebook
- Select Notebook: Select hive_query_tutorial
Click SAVE.

The notebook is now part of your workflow and ready for configuration.

Step 3: Write the Hive Query

At runtime, Syntasa injects several system parameters directly into the notebook environment, including:

database – The event store (Hive) database name
fromDate – Start of the execution date range
toDate – End of the execution date range
location – Output path for writing results

Launch the Notebook

In the Notebook node configuration panel, click LAUNCH NOTEBOOK.
This opens the Jupyter interface in a new browser tab.

Example: Querying a Hive Table (Python)

In a new code cell, use the Spark session to query a Hive table:

# Syntasa injects 'database', 'fromDate', and 'toDate' as global variables
# The Spark session is available as 'spark'

query = f"""
SELECT
    user_id,
    event_type,
    event_time
FROM {database}.events
WHERE event_time BETWEEN '{fromDate}' AND '{toDate}'
"""

df = spark.sql(query)

# Display a preview of the results
df.show(10)

This query:

Uses the injected database variable to reference the correct Hive schema
Filters records using the injected execution date range
Returns a Spark DataFrame that can be further transformed or written out

Step 4: Return the Results as a Dataset

To make the query results available to downstream workflow nodes, write the DataFrame to the injected output location.

# Write the result dataset to the output location provided by Syntasa
df.write.mode("overwrite").parquet(location)

When the Notebook Process completes:

The written dataset becomes the formal output of the process
Downstream nodes can consume this data just like outputs from any other Syntasa process

Execution and Validation

Run the application job from the Composer
Once execution completes successfully, the Notebook Process output node will indicate completion
You can inspect the rendered notebook output and logs directly from the Output node in the workflow

Summary

Querying Hive tables using a Notebook Process in Syntasa provides a powerful and flexible way to combine Spark SQL, Jupyter-based development, and production-grade orchestration. By leveraging injected runtime parameters and managed output paths, notebooks can seamlessly transition from exploratory analysis to fully operational workflow components.

{[{category.name}]}