The Spark Processor in Syntasa provides a flexible interface to perform data transformations using different programming languages and manage output datasets effectively. It has two key screens — Code and Output — that together form the core of Spaark-based data processing within an application workflow.
Code Screen
The Code screen is where users write, upload, and manage their Spark code. It supports multiple programming languages and SQL, offering flexibility for data engineers and analysts to choose their preferred language for data transformation.
Common Features Across All Languages
Regardless of which language you use, the Code screen provides several common functionalities:
-
Parameters: You can reference parameters such as
@InputTable1,@OutputTable1, and custom variables directly in your code. These parameters dynamically link to datasets or configuration values defined within the processor. -
Upload Code: Allows you to upload a code file (for example,
.py,.scala, or.rfile). The uploaded code replaces the existing code in the editor. - Download Code: You can download the existing code in a text file format by clicking the Download button. This is helpful for maintaining versioned backups or sharing scripts.
- Copy to Clipboard: Enables quick copying of the entire code to the clipboard with one click.
Each programming language has its own specialized interface and options as described below:
Python Interface
The Python interface allows you to run PySpark code inside a Conda environment, which provides an isolated runtime for Python libraries and dependencies.
Key Features:
- Python Version Selection: You can choose a specific Python version (e.g., 3.9, 3.10, 3.11) for your Spark Processor. Each processor can operate with a different Python version independent of the runtime.
- Custom Conda Script: The interface supports a custom Conda setup script that runs before the Spark code executes. You can use it to install additional packages, set environment variables, or perform pre-execution setup tasks. (Refer to the article Understanding Conda Environment for details.)
- Library UI Interface: Provides a graphical interface to install and manage Python libraries under the Conda environment. You can learn more about Python library handling in the article Handling Python Libraries in Spark Processor.
Scala Interface
The Scala interface is designed for developers who prefer to write Spark transformations in Scala, the native language of Apache Spark.
Key Features:
- Code Validation: Before execution, you can validate the Scala code to catch compile-time errors. This feature helps prevent job failures due to syntax or type issues.
-
Library Management via Runtime: Scala library management is handled through the runtime configuration. You can install Scala libraries either via Maven or by uploading JAR files to the
/deps/jarsdirectory. (For more details, see the article How to Install Scala Libraries in Spark Processor.)
R Interface
The R interface enables users to write and execute R code within the Spark environment.
Key Features:
- Supports running R-based Spark transformations and statistical computations.
- Allows integration with SparkR and data frame manipulation using familiar R syntax.
SQL Interface
The SQL interface is designed for users who prefer a declarative approach to data manipulation.
Key Features:
- You can write Spark SQL queries directly in the editor.
- Queries can reference input datasets (e.g.,
@InputTable1) and generate result DataFrames that can be further transformed or written to output Event Stores.
Output Screen
The Output screen defines how and where your transformed data will be written after Spark execution. It provides complete flexibility to define output datasets, partitioning, and file formats.
Key Features:
- Dataset Configuration: You can rename the output dataset and specify a display name that appears in your application workflow.
- Partition Type: You can choose whether your output should be partitioned by Daily, Hourly, or not partitioned (None). Partitioning helps optimize query performance and data organization.
- File Format Selection: The processor supports multiple output formats such as Avro, ORC, Parquet, TextFile, and Delta. The format you choose determines how the data is stored in the Event Store.
-
Cloud Integration:
- On GCP, you can enable the Load to BigQuery toggle to automatically push your output data into a BigQuery table.
- On AWS, you can enable Load to Redshift for automatic loading into Amazon Redshift.
Multiple Output Datasets
Spark Processor allows you to configure and write to multiple output datasets within the same processor. This is particularly useful when you want to split a single source dataset into multiple logical tables.
For example, if your input dataset contains raw customer data, you can create three outputs:
-
@OutputTable1→ Personal Details -
@OutputTable2→ Payment Details -
@OutputTable3→ Order History
writeToEventStore(df_personal, "@OutputTable1")
writeToEventStore(df_payment, "@OutputTable2")
writeToEventStore(df_orders, "@OutputTable3")Using the writeToEventStore() utility provided by Syntasa, you can easily write different transformed DataFrames to each output dataset:
This setup enables a clean, modular, and scalable approach to handling complex transformations within a single Spark Processor.