Handling Python Libraries for Spark Processor – SYNTASA™

Python libraries are the building blocks that enable data scientists and engineers to perform complex operations efficiently. For example, a data scientist working on an e-commerce application may rely on Pandas for customer purchase analysis, Scikit-learn for building recommendation models, or Google Cloud Storage SDK to pull customer event logs directly from cloud storage. Without these libraries, building data pipelines or machine learning workflows in Spark Processors would be cumbersome and time-consuming. Hence, proper library management is a critical part of working in the Syntasa environment.

A Spark Processor comes with key Python standard libraries (os, sys, json, re, math, datetime) along with pyspark preinstalled, so you can start writing and running Spark code in Python without extra setup. For more advanced needs—like using pandas, numpy, scikit-learn, or cloud SDKs (google-cloud-storage, boto3)—you’ll need to install them separately.

Traditional Ways of Installing Libraries (Not Recommended)

There were two traditional methods for installing Python libraries in Spark Processors which are not recommended:

Using pip commands directly in code:
Developers often wrote installation commands like os.system("pip install pandas") inside their Spark Processor code. While this worked, the installation happened during each job execution, making it inefficient and inconsistent. Since the library was tied to a processor step rather than the underlying cluster, it slowed down runs and introduced reliability issues.
Configuring libraries in runtime setup:
Another approach was to configure the Spark runtime with a list of dependencies. While this ensured that libraries were available at cluster startup, it came with its own drawback: every Spark job on that runtime would install the libraries, even those that didn’t need them. This made environments heavier than necessary and could lead to conflicts between unrelated processes.

Both methods are functional but not optimal. They either wasted runtime resources or forced unnecessary dependencies into every Spark job.

UI-Based Library Installation (Recommended Approach)

Starting from Syntasa 8.2.0, a much cleaner solution was introduced: the Libraries section in the Spark Processor configuration UI. This approach eliminates the inefficiencies of the traditional methods and gives you fine-grained control over your dependencies.

With this feature, you can:

Directly add package names with versions (e.g., pandas==2.3.0, google-cloud-storage==2.14.0).
Include extra dependencies (e.g., cloudpathlib[s3,gs,azure]).
Install libraries from cloud storage paths (e.g., S3/GCS zip or wheel files).
Install directly from GitHub repositories for bleeding-edge or unreleased versions.

These libraries are installed automatically when the runtime initializes for the Spark Processor. This ensures they are available cluster-wide during execution but without forcing them onto other jobs that don’t need them.

For example, if you’re building a recommendation engine that needs scikit-learn and pandas, you can simply add those packages in the UI for that Spark Processor. Another Spark Processor in your pipeline that only needs boto3 for cloud data ingestion can install just that—keeping environments lean, clean, and efficient.

Different Ways of Installing Libraries

The Libraries section of the Spark Processor UI (introduced in version 8.2.0) provides flexibility in how you define and install Python packages. You can add single or multiple libraries, specify versions, or let Syntasa automatically pick the latest compatible version for your selected Python runtime.

Installing Single Library

As shown below in screenshot, a single library such as pandas can be added in multiple ways:
- Library name with version in separate fields: Enter the library name in the Library Name field (e.g., pandas) and specify the version in the Version field (e.g., 2.3.0). This approach makes it easy to test or switch between different versions.
- Library name with inline version: Enter the library and version together in the Library Name field (e.g., pandas==2.3.0) while leaving the Version field blank. This achieves the same result as above and installs the specified version.
- Library name only: Enter just the library name (e.g., pandas) in the Library Name field and leave the Version field blank. This installs the latest available version for the selected Python runtime—useful if you always want to work with the most up-to-date release.
Installing Multiple libraries

As shown below in screenshot, you can install multiple libraries (e.g., pandas and tensorflow) in different formats:
- Installing Multiple Libraries with Space Separator: You can install more than one library at a time by typing them in the Library Name/Path field with a space between each. For example: pandas==2.2 tensorflow==2.20. This will install pandas version 2.2 and tensorflow version 2.20 together, similar to how you would specify multiple packages in a pip command.
- Installing Multiple Libraries with Double Colon (::) Separator: Syntasa also supports using :: as a separator for multiple libraries. For example: pandas==2.2::tensorflow==2.20 This works the same way as using spaces but provides a cleaner, unambiguous format when handling multiple libraries in a single entry. It’s particularly helpful in configuration exports/imports or when avoiding issues with whitespace parsing.
- Installing Multiple Libraries Without Version Numbers: If you don’t specify a version, the processor installs the latest available versions of all listed libraries. For example: pandas::tensorflow. This will install the newest versions of pandas and tensorflow, ensuring your environment always stays up-to-date.
Installing Python library from Cloud Storage

Python libraries are not always installed directly from the internet; they can also be packaged and stored in your cloud environment for secure and offline installation. These libraries may come in different formats such as .zip, .tar.gz, or .whl (Wheel) files. Using Syntasa’s library management feature, you can simply provide the cloud storage path of the library file, and the system will automatically fetch and install it when the Spark Processor job starts.

A few key points to keep in mind:
- Cloud Storage Provider: You must use the storage service corresponding to your environment:
  - GCP → use Google Cloud Storage (GCS) paths (gs://...)
  - AWS → use Amazon S3 paths (s3://...)
  - Azure → use Azure Blob Storage paths (wasbs://... or abfs://...)
- Access Permissions: When a Spark Processor runs, it uses a service account to execute jobs. You can only reference buckets or containers where this service account has the appropriate access permissions.
- Practical Usage: For example, if you have a wheel file (pandas-2.2.2-py3-none-any.whl) or a zipped Python module (simple_module.zip) stored in your project bucket, you just need to provide the full cloud path (e.g., gs://my-bucket/libs/pandas-2.2.2-py3-none-any.whl). Syntasa will handle the rest.
  
  In the screenshot above, two libraries are being installed from Google Cloud Storage:
  - A wheel file (pandas-2.2.2-py3-none-any.whl)
  - A zip module (simple_module.zip)
Installing Python Library from Git Repository

In addition to PyPI and cloud storage, Syntasa also supports installing Python libraries directly from Git repositories such as GitHub, GitLab, or Bitbucket. This is useful if a library is not published on PyPI, or if you want to use a specific development version directly from source.
- Public repositories can be installed directly using the git+https:// format:
```
git+https://github.com/django/django.git
```
- Specific branch, tag, or commit can be referenced to ensure reproducibility:
```
git+https://github.com/django/django.git@stable/4.0.x
git+https://github.com/django/django.git@80d38de52bb2721a7b44fce4057bcff571afc23a
```
- Private repositories can also be installed by embedding credentials in the path. Please note: while embedding credentials works, it is not secure, as the credentials will be visible in logs and processor configuration.
  For example:
```
git+https://<username>:<token>@github.com/your-org/private-repo.git
```

With these flexible options, Syntasa’s Spark Processor makes it easy to manage Python dependencies—whether you need a single package, multiple libraries, or custom builds from cloud storage or Git repositories. By leveraging the right installation method, you can ensure your processes remain reproducible, efficient, and aligned with your project’s environment setup.

The Libraries section in the UI works only when the Conda environment is enabled. If Conda is disabled, you will need to rely on the traditional methods of installing Python libraries—either by configuring them in the runtime or installing them directly within the Spark code. For a deeper understanding of Conda and its role in managing environments and dependencies, please refer to the article Understanding Conda Environment for Spark Processor (Python)

{[{category.name}]}

Traditional Ways of Installing Libraries (Not Recommended)

UI-Based Library Installation (Recommended Approach)

Different Ways of Installing Libraries