Creating a Simple Propensity Model in Syntasa App – SYNTASA™

How to Run Code in a Syntasa App

Running code in a Syntasa app differs from working in an interactive environment, such as Syntasa Notebooks or JupyterLab. In the Syntasa app, your code is written similarly to how you would in a script or an IDE, such as VS Code — it is executed all at once, not cell by cell.

Here’s a breakdown of how to structure, execute, and manage your code within a Syntasa application:

Understanding Runtime and Jobs

Runtime: Each Syntasa app is associated with a runtime, typically a cloud-based virtual machine (VM) hosted on AWS or GCP.
Job: When you execute your code, it runs as a job tied to the associated runtime.
Non-interactive Execution: Unlike notebooks, you won’t see intermediate outputs. Any syntax or runtime errors will only appear after the job has finished executing.

Managing Dependencies (Libraries)

Install Python libraries within a Syntasa app:

- Navigate to the Customize>> Libraries dropdown in the Spark Processor process node in the UI.
- Specify the required libraries along with their versions (e.g., pandas 1.5.3, spacy 3.7.2).
- These libraries are automatically installed when the runtime is initialized.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

print('Dependencies Imported')

Create or Load Your Data

For demonstration purposes, we’ll create a synthetic dataset with two features and a binary outcome. The outcome (denoted as treatment) represents the event of interest.

# Setting a seed for reproducibility
np.random.seed(42)

# Generate 1000 samples
n_samples = 1000

# Create two features (e.g., age and income). 
# Normally distributed around 40
# Normally distributed around 60k
age = np.random.normal(40, 10, n_samples) 
income = np.random.normal(60000, 15000, n_samples)

# Generate an outcome with some relationship to age and income
# Using a logistic function: P(treatment) = 1 / (1 + exp(-(β0 + β1*age + β2*income)))
# Here, we define coefficients for simulation purposes.
beta0 = -15
beta1 = 0.2
beta2 = 0.0001

# Calculate probabilities
linear_combination = beta0 + beta1 * age + beta2 * income
probability = 1 / (1 + np.exp(-linear_combination))

# Generate a binary outcome (treatment) based on the probability
treatment = np.random.binomial(1, probability)

# Create a DataFrame
data = pd.DataFrame({
'age': age,
'income': income,
'treatment': treatment
})

print('Dataset Created')

Split the Data for Training and Testing

To evaluate your model’s performance, split the data into training and testing sets.

X=data[['age','income']]
y=data['treatment']

# Split data: 70% for training and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print('Training and Testing Samples created')

Build the Propensity Model (Logistic Regression)

Instantiate the logistic regression model, fit it on the training data, and then predict probabilities on the test set.

# Initialize logistic regression model
model = LogisticRegression(solver='liblinear')

# Fit the model on the training data
model.fit(X_train, y_train)

# Get predicted probabilities for the test set (propensity scores)
y_proba = model.predict_proba(X_test)[:, 1]

# Convert the probabilities to a binary prediction (using 0.5 as the threshold)
y_pred = (y_proba >= 0.5).astype(int)

print('Results Generated')

a=accuracy_score(y_test,y_pred)
p=precision_score(y_test,y_pred)
r=recall_score(y_test,y_pred)
f=f1_score(y_test,y_pred)

print('Metrics Generated')

Producing and Saving output

The tabular output must be a Spark DataFrame.
This output persisted using the function: writeToEventStore(spark_dataframe, @output_placeholder)
spark_dataframe: Your final Spark DataFrame.
@output_placeholder: A reference to an event store defined in your app configuration.

df=pd.DataFrame({'Accuracy_Score':a,'Precision_Score':p,'Recall_Score':r,'F1_Score':f},index=[0])
sdf=spark.createDataFrame(df)

writeToEventStore(sdf,'@OutputTable1')

Placeholders in Syntasa

Syntasa uses placeholders to reference various resources. These include:

System placeholders: Predefined identifiers such as input tables or event stores.
Custom placeholders: Created as needed, they must follow the syntax: @your_placeholder_name.

Placeholders ensure decoupling of logic from resource locations, making workflows more portable and modular.

Workflow Integration

Once the output DataFrame is written to an event store:

It can be used as an input for the next process in the workflow.
Or visualized via Superset dashboards, powered by the data stored in the backend (AWS/GCP-based).

Creating a simple propensity model in Syntasa App.py
3 KB Download

{[{category.name}]}