How to Run Code in a Syntasa App
Running code in a Syntasa app differs from working in an interactive environment, such as Syntasa Notebooks or JupyterLab. In the Syntasa app, your code is written similarly to how you would in a script or an IDE, such as VS Code — it is executed all at once, not cell by cell.
Here’s a breakdown of how to structure, execute, and manage your code within a Syntasa application:
Understanding Runtime and Jobs
- Runtime: Each Syntasa app is associated with a runtime, typically a cloud-based virtual machine (VM) hosted on AWS or GCP.
- Job: When you execute your code, it runs as a job tied to the associated runtime.
- Non-interactive Execution: Unlike notebooks, you won’t see intermediate outputs. Any syntax or runtime errors will only appear after the job has finished executing.
Managing Dependencies (Libraries)
Install Python libraries within a Syntasa app:
-
- Navigate to the Customize>> Libraries dropdown in the Spark Processor process node in the UI.
- Specify the required libraries along with their versions (e.g., pandas 1.5.3, spacy 3.7.2).
- These libraries are automatically installed when the runtime is initialized.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score
print('Dependencies Imported')
Create or Load Your Data
For demonstration purposes, we’ll create a synthetic dataset with two features and a binary outcome. The outcome (denoted as treatment) represents the event of interest.
# Setting a seed for reproducibility
np.random.seed(42)
# Generate 1000 samples
n_samples = 1000
# Create two features (e.g., age and income).
# Normally distributed around 40
# Normally distributed around 60k
age = np.random.normal(40, 10, n_samples)
income = np.random.normal(60000, 15000, n_samples)
# Generate an outcome with some relationship to age and income
# Using a logistic function: P(treatment) = 1 / (1 + exp(-(β0 + β1*age + β2*income)))
# Here, we define coefficients for simulation purposes.
beta0 = -15
beta1 = 0.2
beta2 = 0.0001
# Calculate probabilities
linear_combination = beta0 + beta1 * age + beta2 * income
probability = 1 / (1 + np.exp(-linear_combination))
# Generate a binary outcome (treatment) based on the probability
treatment = np.random.binomial(1, probability)
# Create a DataFrame
data = pd.DataFrame({
'age': age,
'income': income,
'treatment': treatment
})
print('Dataset Created')
Split the Data for Training and Testing
To evaluate your model’s performance, split the data into training and testing sets.
X=data[['age','income']]
y=data['treatment']
# Split data: 70% for training and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print('Training and Testing Samples created')
Build the Propensity Model (Logistic Regression)
Instantiate the logistic regression model, fit it on the training data, and then predict probabilities on the test set.
# Initialize logistic regression model
model = LogisticRegression(solver='liblinear')
# Fit the model on the training data
model.fit(X_train, y_train)
# Get predicted probabilities for the test set (propensity scores)
y_proba = model.predict_proba(X_test)[:, 1]
# Convert the probabilities to a binary prediction (using 0.5 as the threshold)
y_pred = (y_proba >= 0.5).astype(int)
print('Results Generated')
a=accuracy_score(y_test,y_pred)
p=precision_score(y_test,y_pred)
r=recall_score(y_test,y_pred)
f=f1_score(y_test,y_pred)
print('Metrics Generated')
Producing and Saving output
- The tabular output must be a Spark DataFrame.
- This output persisted using the function: writeToEventStore(spark_dataframe, @output_placeholder)
- spark_dataframe: Your final Spark DataFrame.
- @output_placeholder: A reference to an event store defined in your app configuration.
df=pd.DataFrame({'Accuracy_Score':a,'Precision_Score':p,'Recall_Score':r,'F1_Score':f},index=[0])
sdf=spark.createDataFrame(df)
writeToEventStore(sdf,'@OutputTable1')
Placeholders in Syntasa
Syntasa uses placeholders to reference various resources. These include:
- System placeholders: Predefined identifiers such as input tables or event stores.
- Custom placeholders: Created as needed, they must follow the syntax:
@your_placeholder_name.
Placeholders ensure decoupling of logic from resource locations, making workflows more portable and modular.
Workflow Integration
Once the output DataFrame is written to an event store:
- It can be used as an input for the next process in the workflow.
- Or visualized via Superset dashboards, powered by the data stored in the backend (AWS/GCP-based).