Understanding Conda Environment for Spark Processor (Python) – SYNTASA™

Starting from version 8.0, Syntasa introduced the option to run Spark Processors inside a Conda environment. Conda is a package and environment manager that allows you to create isolated environments with specific Python versions and dependencies. This feature provides greater flexibility and consistency compared to traditional execution, where all Spark processors had to rely on the default Python version or the runtime settings.

With Conda, every Spark processor in a pipeline can run independently with its own Python version and libraries—while still sharing the same runtime. This ensures reproducibility, avoids version conflicts, and makes it easier to manage complex data pipelines.

Key Features of Conda Environment for Spark Processor

Python Version Selection
You can choose a specific Python version from the dropdown in the Spark Processor UI. This means each processor can run with its own Python version, independent of the runtime, allowing mixed environments within a single pipeline.
Library UI Interface
Introduced in version 8.2.0, the Libraries section in the Spark Processor UI allows you to directly add Python libraries. These libraries are automatically installed inside the Conda environment before execution. You can specify single or multiple libraries, pin versions, or even install from cloud storage and Git repositories—all without modifying your Spark code. To learn more about Library UI interface, please check article Handling Python Libraries in Spark Processor.
Custom Conda Script
Under the Customize option, you can provide a custom Conda script for additional setup, configurations, or package installations. This script is executed inside the Conda environment, after libraries are installed but before your Spark code runs.
Separate Environments per Processor
Unlike traditional execution where all processors share the same Python version and libraries, each Spark Processor now gets its own isolated Conda environment. This ensures consistent results, avoids dependency conflicts, and allows multiple processors in a pipeline to run with different Python versions or libraries on the same cluster runtime.

How Spark Processor Works with Conda Environment?

When you execute a Spark Processor with Conda enabled, the platform follows a structured sequence to ensure your job runs in an isolated and consistent environment across the cluster:

Cluster Initialization

At this stage, the system sets up the Spark runtime environment.
- Cluster Global Script – Runs once when the cluster starts. This script is configured under Admin Center → Infrastructure and applies globally to all runtimes.
- Cluster Runtime Script – If a custom runtime script is defined for the specific runtime, it is executed here. This is useful for environment-wide setup, such as system configurations or installing OS-level dependencies.
Spark Processor Execution

Once the cluster is ready, the Spark Processor goes through the following sequence:
- Conda Environment Setup – The platform downloads and installs the Conda environment for the selected Python version (online or offline, depending on configuration).
- Library Installation via UI – Any Python libraries defined in the Libraries section of the Spark Processor are installed into this Conda environment.
- Custom Conda Script Execution – If provided, the custom Conda script is executed to apply additional setup, configurations, or package installations specific to the Spark Processor.
- Spark Processor Code Execution – Finally, the Spark Processor code runs within the prepared Conda environment. Spark automatically distributes the environment files and installed dependencies from the driver to all executor nodes, ensuring consistent execution across the cluster.
Cluster Termination

After the Spark Processor completes execution, the cluster shuts down (unless otherwise configured). Once the cluster is shut down, the Conda environment and its cached contents are destroyed.

How Spark Uses Conda Across Cluster Nodes?

When you run a Spark Processor with Conda enabled, your job doesn’t just execute on a single machine. Spark distributes the workload across a driver node and multiple executor nodes. To ensure consistency, Spark automatically ships the Conda environment from the driver to all executors.

Here’s what happens:

The Conda environment (with the selected Python version and installed libraries) is first created on the driver node.
Spark then packages the necessary files and dependencies from this environment.
These packages are distributed to all executor nodes, ensuring they run with the exact same Python setup.
As a result, your code behaves the same on every node in the cluster, avoiding version mismatches or missing dependencies.

Example:

If you install pandas==2.3.0 and scikit-learn==1.5.0 in the Conda environment, Spark makes sure these libraries are available on all executors, so data transformations using Pandas or ML models with scikit-learn will run seamlessly across the cluster.

This mechanism removes the need to manually configure every cluster node and ensures a consistent, isolated environment for your Spark workloads.

How Conda Works with Multiple Spark Processors in a Job

When a job includes multiple Spark Processors, the way Conda environments are handled depends on whether the cluster is still active, the runtime being used, and the Python versions configured.

Cluster Restart: If the cluster has been terminated and a Spark Processor is executed again, a fresh Conda environment (with Python version, libraries, and custom Conda script) will be set up from scratch.
Cluster Reuse: If the cluster is still running and the same Spark Processor is executed again, the same Conda environment is reused (since it is cached in memory until the cluster shuts down).
Multiple Processors with Same Runtime & Python Version: Conda setup is only performed once. Subsequent processors using the same runtime and Python version will reuse the existing Conda environment, but their own libraries and custom Conda scripts will still be executed.
Multiple Processors with Different Python Versions or Runtimes: Each unique combination of runtime and Python version requires its own Conda setup.

Example Walkthrough

Suppose a job includes four Spark Processors:

Spark Processor A → Runtime A, Python 3.9
- Cluster A created → Conda 3.9 setup → Libraries of A installed → Custom Conda script of A executed → Spark code runs.
Spark Processor B → Runtime B, Python 3.10
- New Cluster B created → Conda 3.10 setup → Libraries of B installed → Custom Conda script of B executed → Spark code runs.
Spark Processor C → Runtime A, Python 3.9
- Cluster A already running → Conda 3.9 already set up → Only libraries of C installed → Custom Conda script of C executed → Spark code runs.
Spark Processor D → Runtime C, Python 3.9
- New Cluster C created → Conda 3.9 setup (separate from Cluster A, because runtime is different) → Libraries of D installed → Custom Conda script of D executed → Spark code runs

The example above applies only when the Terminate on Completion toggle is turned off for the runtimes. This ensures that runtimes are not shut down after completing a job step, allowing the Conda environment to remain cached in memory for the next Spark Processor execution.

If the cluster is stopped or terminated at any point, the Conda environment is destroyed. In such cases, it must be reinstalled during the next job step—even if the processors share the same runtime and Python version.

Key Takeaways

Conda setup happens once per runtime–Python version combination per active cluster.
Libraries and custom Conda scripts are always executed per processor, even if Conda setup is reused.
Conda environments are cached only while the cluster is alive; once terminated, the environment is destroyed and must be reinstalled for the next run.

Offline Installation of Conda

By default, Conda environments are downloaded from the web during execution. If your environment does not have internet access, you can configure the runtime to install Conda offline using Syntasa’s pre-packaged files:

syntasa.conda.offline.install=true

This setting ensures the Conda environment is installed from local packages instead of fetching them online.

Disabling Conda Environment

If you do not want to use the Conda environment, you can disable it by adding this configuration in the runtime:

syntasa.override.conda.env=true

When Conda is disabled:

The Library UI section is ignored.
The selected Python version in the UI is ignored.
The Spark processor executes using the default Python version or the version set in the runtime configuration.

{[{category.name}]}