Below is an example of creating a simple propensity model using Python in a JupyterLab/Syntasa Notebook. In this example, we’ll use a logistic regression model from scikit-learn to estimate an event's probability (i.e., the propensity) (for example, a conversion or treatment) based on a couple of features and then write those results to an S3 bucket.
1. Set up Your Notebook
Begin by importing the necessary libraries. Certain cloud providers, such as AWS, offer prebuilt packages like boto3 and mlflow, so you might not need to install them—just import them directly. You’ll need:
Note: We're writing our results to the S3 bucket in the same AWS Account as our Syntasa platform. This offers a major advantage: our JupyterLab Integration and Syntasa Notebooks can access these resources natively, so they're already authenticated and don't need credentials. The same would apply to other cloud providers like GCP/Azure.
- Python 3.7+
-
NumPy and Pandas for data manipulation.
-
Matplotlib for plotting.
-
scikit-learn for building and evaluating the model.
-
pip install pandas numpy scikit-learn matplotlib boto3 mlflow
- We're writing our results to the S3 bucket in the same AWS Account as our Syntasa installation. This offers a major advantage: our JupyterLab Integration and Syntasa Notebooks can access these resources natively, so they're already authenticated and don't need credentials. The same would apply to other cloud providers like GCP/Azure.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
import boto3
2. Create or Load Your Data
For demonstration purposes, we’ll create a synthetic dataset with two features and a binary outcome. The outcome (denoted as treatment
) represents the event of interest.
# Setting a seed for reproducibility
np.random.seed(42)
# Generate 1000 samples
n_samples = 1000
# Create two features (e.g., age and income).
# Normally distributed around 40
# Normally distributed around 60k
age = np.random.normal(40, 10, n_samples)
income = np.random.normal(60000, 15000, n_samples)
# Generate an outcome with some relationship to age and income
# Using a logistic function: P(treatment) = 1 / (1 + exp(-(β0 + β1*age + β2*income)))
# Here, we define coefficients for simulation purposes.
beta0 = -15
beta1 = 0.2
beta2 = 0.0001
# Calculate probabilities
linear_combination = beta0 + beta1 * age + beta2 * income
probability = 1 / (1 + np.exp(-linear_combination))
# Generate a binary outcome (treatment) based on the probability
treatment = np.random.binomial(1, probability)
# Create a DataFrame
data = pd.DataFrame({
'age': age,
'income': income,
'treatment': treatment
})
print(data.head())
3. Split the Data for Training and Testing
To evaluate your model’s performance, split the data into training and testing sets.
X = data[['age', 'income']] # Predictors
y = data['treatment'] # Outcome
# Split data: 70% for training and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
4. Build the Propensity Model (Logistic Regression)
Instantiate the logistic regression model, fit it on the training data, and then predict probabilities on the test set.
# Initialize logistic regression model
model = LogisticRegression(solver='liblinear')
# Fit the model on the training data
model.fit(X_train, y_train)
# Get predicted probabilities for the test set (propensity scores)
y_proba = model.predict_proba(X_test)[:, 1]
# Convert the probabilities to a binary prediction (using 0.5 as the threshold)
y_pred = (y_proba >= 0.5).astype(int)
5. Evaluate the Model
Evaluate the model by looking at a confusion matrix, a classification report, and by plotting an ROC curve.
# Confusion Matrix and Classification Report
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# ROC Curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:0.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()
6. Interpreting Your Propensity Scores
The model estimates the probability that a given observation belongs to the positive class (e.g., receiving a treatment or converting). These probabilities can be used for further analysis, such as:
-
Propensity score matching: Pairing units with similar scores from treatment and control groups.
-
Weighting: These scores are used for inverse probability weighting to control for confounding factors.
For example, you can inspect the first few propensity scores along with the features and actual outcomes:
# Create a DataFrame with test set results
results = X_test.copy()
results['Actual_Treatment'] = y_test.values
results['Propensity_Score'] = y_proba
results.sort_values(by='Propensity_Score', ascending=False, inplace=True)
print(results.head(10))
7. Save the results to S3
Note: We're writing our results to the S3 bucket in the same AWS Account as our Syntasa platform. This offers a major advantage: our JupyterLab Integration and Syntasa Notebooks can access these resources natively, so they're already authenticated and don't need credentials. The same would apply to other cloud providers like GCP/Azure.
You might be wondering if you need to manually set up an S3 bucket and folder before using them with Syntasa. For this example, manual creation is not always required. This particular example utilizes an S3 bucket that Syntasa already has access to, and you may also be writing files into an existing folder within that bucket.
However, it is essential to define the specific S3 bucket and folder paths within your Syntasa code. To obtain these paths, navigate to the "Files" section in a new browser tab. Click the "+" icon and then the link icon to copy the current path.
Using a Different S3 Bucket
Should your process require utilizing a different S3 bucket, we strongly recommend coordinating with your Cloud DevOps or Cloud Engineering teams. They can provide expert guidance and assist with the creation of the bucket, along with the necessary IAM (Identity and Access Management) role.
Please note that Syntasa requires read and write permissions for any S3 bucket it interacts with.
Additionally, if you use an S3 bucket other than the default Syntasa-accessible one, your code will need to be modified to include the appropriate connection credentials for that S3 bucket.
# Save to CSV using Pandas Data Frame
local_path = "/tmp/propensity_model_results.csv"
results.to_csv(local_path, index=False)
print(f"\nSaved local CSV: {local_path}")
# Upload to S3
s3 = boto3.client('s3')
bucket_name = 'YOUR_BUCKET' # ⬅️ Replace with your actual bucket name
ouput_path = 'PATH/TO/YOUR/FILE/propensity_results.csv' # ⬅️ Replace with your actual output path /file name
try:
s3.upload_file(local_path, bucket_name,ouput_path )
print(f"Uploaded to: s3://{bucket_name}/{ouput_path}")
except Exception as e:
print("Failed to upload to S3:", e)
Conclusion
This notebook demonstrates a basic approach to building a propensity model. By using logistic regression, you can compute the probability of an event occurring based on various features. This is particularly useful in observational studies or A/B tests where you need to adjust for covariates that predict treatment assignment.
Feel free to experiment with different features, models, and evaluation metrics to tailor the model to your specific problem.
The best way to understand and learn how to perform this function is through hands-on experience. Follow the steps below to create the sample notebook in your Syntasa environment:
- Download the sample notebook .ipynb file from this article.
- Create a new notebook in your Syntasa environment using the import notebook option.