Syntasa offers various process types, including those where users can write custom code in languages like SQL, Python, Scala, or R. These "code processes" are fundamental to building custom analytics datasets and solutions. To make these code processes flexible and adaptable, Syntasa utilizes parameters. Most commonly used code processes are the Spark processor, the BQ process, and the Code Container process.
Parameters enable users to pass in values and influence how the code within the process executes, without necessarily altering the underlying script itself. This design allows processes to handle different data sources and apply operations based on user-defined inputs.
Advantages of Using Parameters
Using parameters in Syntasa code processes offers significant advantages:
- Flexibility and Reusability: Parameters allow a single code script to be used in different scenarios or with different data sources without needing to modify the script itself. This means processes can handle varying inputs and outputs dynamically.
- Dynamic Execution: Parameters, especially system parameters, can pull in values determined by the job run, such as date ranges, allowing the process to operate on specific periods or partitions automatically.
- User Customization: Custom parameters provide a way for users to input specific values, allowing them to alter the behavior or logic of the script on the fly, such as applying custom filters to a dataset.
Types of Parameters
Syntasa code processes support the following four types of parameters, each serving a specific purpose in dynamic and configurable workflows:
- System Parameters
- Custom Parameters
- Input Parameters
- Output Parameters
System Parameters
System parameters are predefined variables made available by the Syntasa platform for use within your code processes. These parameters provide access to information related to the job execution environment or the data being processed, such as database names and date ranges selected for the job.
Here is the list of system parameters that can be used within code processes:
- @database: The @database parameter represents the database name of the event store used in the application for writing output. The value is automatically determined based on the workflow environment—development or production. For example, if the event store X is linked with two databases,
x_dev
andx_prod
, then the@database
parameter will resolve tox_dev
When the code runs in the development workflow, and tox_prod
When executed in the production workflow. Example:
CREATE TABLE IF NOT EXISTS @database.lookup_bq_process
- @fromDate: The
@fromDate
parameter represents the start time of a job execution. This includes both the readable date and the job's exact start time as an epoch (UNIX) timestamp, making it useful for uniquely identifying a job execution window. Example: If a job starts on January 1, 2025, at 13:00 hrs, then the value of @fromDate will be 2025-01-01_1735736400 - @toDate: The
@toDate
parameter represents the end time (completion time) of a job execution, formatted as a combination of the date and the epoch-style epoch timestamp. This parameter captures the exact moment when the job execution ends, which is useful for time-window-based queries, tracking, and output versioning. Example: If a job completes on January 1, 2025, at 14:00 hrs (UTC+5:30), then @toDate will have a value of 2025-01-01_1735740000. - @fixedFromDate: The
@fixedFromDate
parameter represents the start date selected for job execution. It returns the date in the format yyyy-MM-dd (e.g., 2025-01-01). This value is based on the input provided during job execution and is useful when you want to filter records from a specific date onward. Example: You want to filter data for all records starting from the selected start date:start_date = "@fixedFromDate"
df_filtered = df.filter(df["event_date"] >= start_date)If the user selects January 1, 2025, as the job’s start date, it
@fixedFromDate
will resolve to: 2025-01-01. -
@fixedToDate: The
@fixedToDate
parameter represents the end date selected for job execution. Like @fixedFromDate, it returns the date in yyyy-MM-dd format. This is used when you want to filter records up to and including the end date. Example: You want to filter data for all records up to the selected end date:end_date = "@fixedToDate"
df_filtered = df.filter(df["event_date"] <= end_date)If the user selects January 7, 2025, as the job’s end date, it
@fixedToDate
will resolve to: 2025-01-07. - @datesToProcess: The
@datesToProcess
parameter provides a comma-separated list of all dates within the job execution range. Each date is returned in the format: yyyy-MM-dd_epochTime. This includes every individual date between the job's selected start and end dates (inclusive), where each date is paired with its corresponding epoch timestamp (in seconds). This is especially useful for processing or filtering data partitioned by date or dynamically iterating over multiple days of data. If a job is executed from 2025-01-01 to 2025-01-03. Then@datesToProcess
will return a string like: 2025-01-01_1735689600, 2025-01-02_1735776000, 2025-01-03_1735862400
Here’s an example of how this parameter can be used in a Spark-based code process to dynamically read or transform data for each date in the range:
# Access the datesToProcess parameter
dates_string = "@datesToProcess"
# Split the string to get individual date_epoch entries
date_epoch_list = dates_string.split(",")
# Extract just the date part for each entry
date_list = [entry.split("_")[0] for entry in date_epoch_list]
# Example: Read or process data for each date
for date_str in date_list:
daily_df = spark.read.format("parquet").load(f"s3://your-bucket/data/date={date_str}")
# Apply transformations or aggregations
transformed_df = daily_df.filter(daily_df["status"] == "completed")
# Optionally write or union with other data
# e.g., accumulate into a final DataFrame - @numPartitions: The
@numPartitions
system parameter returns the number of date-based partitions created during the current job execution. It reflects how many new partitions (based on date) are being written by the processor during the run.Note: This value has no relation to existing partitions in the output dataset or the number of records processed. It strictly indicates how many new date partitions are created as part of the current job execution.
Use Case: The
Example: If a job is scheduled to run from January 1, 2025, to January 4, 2025, and it generates output partitions for January 1st to 3rd, but not for January 4th due to any reason, then@numPartitions
will return 3.@numPartitions
parameter is particularly useful in various scenarios, such as logging or auditing how many date partitions were generated during a job run. It also enables conditional logic, allowing you to take action based on the number of partitions created—for example, skipping downstream steps if fewer or more than a specific number of partitions were generated.
num_partitions = @numPartitions
if num_partitions == 0:
print("No new data partitions were created. Skipping downstream processing."
# Continue with the next step in the pipeline
else:
print(f"{num_partitions} date partitions created.")
# Continue with the next step in the pipeline - @location: The
@location
parameter represents the full storage path of the dataset where the output will be written when the code process runs. This applies to datasets stored in either Google Cloud Storage (GCS) or Amazon S3, depending on your environment.The value of @location is automatically determined by the system and points to the location configured in the Output tab of the code process. Each output dataset (e.g.,
@OutputTable1
) has a corresponding path that can be viewed in the UI and accessed via this parameter. If your output is configured to store data in GCS, the parameter might resolve to:gs://syntasa-output-data/marketing/campaign_results/
The
@location
parameter is especially useful when you want to reference or log the output path programmatically or you need to extract a portion of the path, such as the bucket name or a folder, for conditional logic or audits. Here is an example code for fetching bucket name or folder name# Get full output path from parameter
full_path = "@location"
# Extract bucket name from path
# E.g., 'gs://syntasa-output-data/marketing/campaign_results/' → 'syntasa-output-data'
bucket_name = full_path.split("/")[2]
# Extract the folder path after the bucket
# → 'marketing/campaign_results/'
folder_path = "/".join(full_path.split("/")[3:])
print("Full Output Path:", full_path)
print("Bucket Name:", bucket_name)
print("Folder Path:", folder_path) - @environment - The
@environment
parameter returns the current environment in which the process is being executed. This value is automatically set by Syntasa and will return 'Development' if the job is run within the development workflow and 'Production' if the job is run within the production workflow.This parameter is especially useful when you need to control logic, configurations, or data access based on the execution environment. It allows you to write conditional code that behaves differently in development vs. production, without modifying the script.
Example use case: To load sample data in development, but full data in production. Here is the code:environment = "@environment"
if environment == "Development":
# Read limited rows from sample dataset
input_df = spark.read.csv("s3a://demo-dev/sample_data.csv", header=True).limit(1000)
print("Running in Development mode: Using sample data.")
else:
# Read full production dataset
input_df = spark.read.csv("s3a://demo-prod/full_data.csv", header=True)
print("Running in Production mode: Using full dataset.")
Custom Parameters
Custom parameters in Syntasa allow users to define their own key-value pairs that can be accessed dynamically within a code processor. These are essentially user-defined variables, configured through the UI, which can be reused across your code logic without hardcoding the values.
Custom parameters are especially useful for storing sensitive or frequently changing values such as bucket paths, file names, or region codes. By externalizing these values, you avoid hardcoding them directly in the script, making the code easier to update and maintain. By using custom parameters, you can prevent exposing values directly in the code, keeping it flexible and secure.
Example:
Suppose you want to store the GCS path where raw input files are located. You can define a custom parameter in the code processor:
- Key: @InputFilePath
- Value: gs://syntasa-demo-bucket/raw/input/
You can use it in your code like this:
# Get the custom path parameter
file_path = @InputFilePath"
# Use the value to read the CSV file
df = spark.read.csv(file_path, header=True)
This allows you to change the file path from the UI without modifying the script.
Input Parameters
Input parameters are automatically generated when you connect an input (such as a connection or dataset) to a code process in Syntasa. These parameters serve as dynamic references to the connected resources, allowing your code to interact with them in a flexible and maintainable way.
How does the Input Parameters work?
- When you connect a connection (e.g., an S3 or GCS connection) to a code processor, Syntasa auto-generates a parameter such as
@InputConnection1
, which points to that connection. - If you connect another connection, a second parameter like
@InputConnection2
is created. - You can rename the param name (e.g.,
@InputConnection1
), but you cannot change the value, which reflects the exact name of the connection and cannot be altered after creation.
Why Use Input Parameters?
Input parameters are especially helpful when you want to programmatically access configuration details from a connection, such as:
- Access keys or secret keys (for cloud storage)
- Hostname and port (for databases)
- Project ID, bucket name, or keyfile (for GCS or BigQuery)
They eliminate the need to hardcode connection-specific information, keeping your code dynamic and environment-agnostic.
Example:
Let’s say you have connected an S3 connection to a Spark processor. Instead of hardcoding the access key and secret key, you can retrieve them using the input parameter.
# Get access credentials from the connection using the input parameter
access_key = getConnectionParam("@InputConnection1", "awsAccessKey")
secret_key = getConnectionParam("@InputConnection1", "awsSecretKey")
bucket = getConnectionParam("@InputConnection1", "bucketName")
Output Parameters
Output parameters are automatically generated by Syntasa to represent the output datasets associated with a code processor. These parameters point to the event store datasets (or other configured outputs) where the results of your transformation logic will be written.
By default, most code processors come with a primary output node, and a corresponding parameter like @OutputTable1
is generated. If the processor is configured with multiple output datasets, the parameters are incrementally named (e.g., @OutputTable1
, @OutputTable2
, etc.).
How Output Parameters Work?
- Auto-generated: When you connect a dataset to the output of a processor, a parameter such as It
@OutputTable1
is created automatically. - Fixed value: This parameter points to the name of the event store table or dataset and cannot be changed manually. However, you can change the param name (e.g.:
@OutputTable1)
Why Use Output Parameters?
- To dynamically reference the target dataset without hardcoding its name or location.
- To make your code environment-independent, allowing the same code to run in development or production with different output destinations.
Example:
Let’s say you’re transforming user data in a Spark processor and want to save the cleaned result to a configured event store table. You don’t need to manually reference the dataset path—instead, you can use the @OutputTable1
parameter directly.
# Final transformed DataFrame
final_df = filtered_df.select("user_id", "region", "last_login")
# Write to the output dataset (event store) using output parameter
writeToEventStore(final_df, "@OutputTable1")