This guide walks you through creating a simple propensity model using logistic regression, tracking it with MLflow, and saving results to an S3 cloud bucket.
Set up Your Notebook
Begin by importing the necessary libraries. Certain cloud providers, such as AWS, offer prebuilt packages like boto3 and mlflow, so you might not need to install them—just import them directly. You’ll need:
Note: We're writing our results to the S3 bucket in the same AWS Account as our Syntasa platform. This offers a major advantage: our JupyterLab Integration and Syntasa Notebooks can access these resources natively, so they're already authenticated and don't need credentials. The same would apply to other cloud providers like GCP/Azure.
-
Python 3.7+
-
Connection credentials for the AWS S3 bucket, if using a different one from the bucket used by the Syntasa Platform.
- to install the packages:
-
pip install pandas numpy scikit-learn matplotlib boto3 mlflow
-
- NumPy and Pandas for data manipulation.
-
Matplotlib for plotting.
-
sklearn for logistic regression model(train/test/evaluate) and using confusion matrix, classification report, ROC curve, and AUC.
- boto3 for AWS S3
- MLflow for ML experiment tracking and model logging
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, accuracy_score, precision_score, recall_score
import boto3
import mlflow
import mlflow.sklearn
Generate Sample Data
Set up MLflow tracking and simulate features (age, income) and a treatment outcome using a logistic function:
# MLflow setup
mlflow.set_experiment("Propensity_Model")
with mlflow.start_run():
# Parameters
solver = 'liblinear'
mlflow.log_param("solver", solver)
# Data generation
np.random.seed(42)
n_samples = 1000
age = np.random.normal(40, 10, n_samples)
income = np.random.normal(60000, 15000, n_samples)
beta0, beta1, beta2 = -15, 0.2, 0.0001
linear_combination = beta0 + beta1 * age + beta2 * income
probability = 1 / (1 + np.exp(-linear_combination))
treatment = np.random.binomial(1, probability)
data = pd.DataFrame({
'age': age,
'income': income,
'treatment': treatment
})
Train the Logistic Regression Model
Split your data, train the model, and make predictions.
X = data[['age', 'income']]
y = data['treatment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]
y_pred = (y_proba >= 0.5).astype(int)
Evaluate Model Performance and log with MLflow
-
Use
confusion_matrixandclassification_reportto understand prediction accuracy and class-level metrics (precision, recall, F1-score). -
Generate an ROC curve and calculate AUC to measure the model's ability to separate treated from untreated cases.
-
Visualize the ROC curve to quickly assess how well the model performs across thresholds.
- Track parameters, metrics, and model artifacts.
# Metrics
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
#log everything with MLflow
mlflow.log_metric("accuracy", acc)
mlflow.log_metric("precision", prec)
mlflow.log_metric("recall", rec)
mlflow.log_metric("auc", roc_auc)
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# ROC Curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
roc_path = "/tmp/roc_curve.png"
plt.savefig(roc_path)
plt.close()
mlflow.log_artifact(roc_path)
Save Results and Upload to S3
# Save results
results = X_test.copy()
results['Actual_Treatment'] = y_test.values
results['Propensity_Score'] = y_proba
results.sort_values(by='Propensity_Score', ascending=False, inplace=True)
local_path = "/tmp/propensity_model_results.csv"
results.to_csv(local_path, index=False)
print(f"Saved local CSV: {local_path}")
mlflow.log_artifact(local_path)
# Upload to S3
s3 = boto3.client('s3')
bucket_name = 'YOUR_BUCKET' # ⬅️ Replace with your actual bucket name
output_path = 'PATH/TO/YOUR/FILE/propensity_results.csv' # ⬅️ Replace with your actual output path /file name
s3.upload_file(local_path, bucket_name, output_path)
print(f"Uploaded to: s3://{bucket_name}/{output_path}")
except Exception as e:
print("Failed to upload to S3:", e)
# Optionally log model
mlflow.sklearn.log_model(model, "logistic_model")
Conclusion
You’ve now completed an end-to-end propensity modeling workflow with MLflow. This foundation gives you a template for running propensity models with full observability via MLflow
Simulated a Dataset for Propensity Modeling
-
Generated a synthetic dataset of 1,000 individuals with features like
ageandincome. -
Applied a logistic function with known coefficients to simulate the likelihood (
treatment) of receiving an intervention
Built and Trained a Propensity Model
-
Used
LogisticRegressionfromscikit-learnto estimate the probability of treatment based on the simulated features. -
Split the dataset into training and test sets to validate model performance.
Evaluated Model Performance
-
Calculated metrics like the confusion matrix, precision, recall, and F1-score using
classification_report. -
Plotted an ROC curve and calculated AUC to visualize the model’s ability to distinguish between treated and untreated cases.
-
Converted predicted probabilities into a sorted output for easier analysis.
Logged the Experiment with MLflow
-
Used MLflow to track:
-
Model parameters (e.g., solver used in logistic regression)
-
Performance metrics (e.g., AUC, accuracy)
-
Artifacts like the trained model and ROC plot
-
-
Enabled reproducibility and comparison across runs via MLflow’s UI or tracking server.
Saved and Uploaded Results to Amazon S3
-
Saved the test results (including actual treatment and predicted scores) to a local CSV.
-
Uploaded the CSV file to an S3 bucket for persistent cloud storage and downstream use.
The best way to understand and learn how to perform this function is through hands-on experience. Follow the steps below to create the sample notebook in your Syntasa environment:
- Download the sample notebook .ipynb file from this article.
- Create a new notebook in your Syntasa environment using the import notebook option.