Please use the following Google Colaboratory project for a complete example you can work through showing how you can integrate the Seclea Platform into your data science workflow. Please copy the Google Colaboratory project into your account and run it from there.
Following is a step-by-step guide, followed by the Google Colaboratory project.
We will run through a sample project showing how to use Seclea's tools to record your data science work and explore the results in the Seclea Platform.
Please create a new project by clicking the "+ Create" button and giving it a name and description.
After you have created your project, it will appear in the All Projects section, as shown below:
Click the project name in the All Projects section to go to the project setting page on the Seclea Platform.
Click the settings option in the left panel will take you to the respective project's settings.
In the project setting, you can select, modify or set new templates for compliance, risk management and any internal standard/policy you must comply with. Details on how to use compliance, risk management, and internal settings are detailed in their respective sections.
If you want to include additional team members in the project, you can include them using the Access section of the setting.
These are optional settings; you can set/modify them later in the project.
Integrating with Seclea-AI
You can get the seclea-ai package from pip
!pip install seclea_ai
When you initialise the SecleaAI object, you will be prompted to log in if you haven't already done so. You must use the same Project Name you used earlier and the Organization name provided with your credentials.
from seclea_ai import SecleaAI
# NOTE - use the organization name provided to you when issuing credentials.
seclea = SecleaAI(project_name="Car Insurance Fraud Detection", organization='')
Handling the Data
You can download the data for this tutorial if you are working on this in Colab or without reference to the repo - this is an Insurance Claims dataset with various features and 1000 samples.
Now we can upload the initial data to the Seclea Platform.
This should include whatever information we know about the dataset as metadata. There are only two keys to add in metadata for now - outputs and continuous_features.
You can leave out outputs if you haven't decided what you will be predicted yet, but you should know or be able to find the continuous features at this point.
You can also update these when uploading datasets during/after pre-processing.
import numpy as np
import pandas as pd
# load the data
data = pd.read_csv('insurance_claims.csv', index_col="policy_number")
# define the metadata for the dataset.
dataset_metadata = {"outputs": ["fraud_reported"],
"favourable_outcome": "N",
"unfavourable_outcome": "Y",
"continuous_features": [
"total_claim_amount",
'policy_annual_premium',
'capital-gains',
'capital-loss',
'injury_claim',
'property_claim',
'vehicle_claim',
'incident_hour_of_the_day',
]}
# ⬆️ upload the dataset - pick a meaningful name here; you'll see it a lot on the platform!
seclea.upload_dataset(dataset=data, dataset_name="Auto Insurance Fraud", metadata=dataset_metadata)
Evaluating the Dataset
After running the above section, head back to the Seclea Platform so that we can take a closer look at our Dataset. To do so, you can navigate to the datasets section - under Prepare tab.
Personal Identifiable Information (PII) and Format Check
Data Bias Check
Transformations
When using Seclea to record your Data Science work, you will have to take care with how you deal with transformations of the data.
We require that all transformations are encapsulated in a function that takes the data and returns the transformed data. There are a few things to be aware of, so please see the docs for more.
# Create a copy to isolate the original dataset
df1 = data.copy(deep=True)
def encode_nans(df):
# convert the special characters to nans
return df.replace('?', np.NaN)
df2 = encode_nans(df1)
Data Cleaning
We will carry out some pre-processing and generate a few different datasets to see how to track these on the platform. This also means we can train our models on different data and see how that affects performance.
## Drop the the column which are more than some proportion NaN values
def drop_nulls(df, threshold):
cols = [x for x in df.columns if df[x].isnull().sum() / df.shape[0] > threshold]
return df.drop(columns=cols)
# We choose 95% as our threshold
null_thresh = 0.95
df3 = drop_nulls(df2, threshold=null_thresh)
def drop_correlated(data, thresh):
import numpy as np
# calculate correlations
corr_matrix = data.corr().abs()
# get the upper part of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# columns with correlation above threshold
redundant = [column for column in upper.columns if any(upper[column] >= thresh)]
print(f"Columns to drop with correlation > {thresh}: {redundant}")
new_data = data.drop(columns=redundant)
return new_data
# drop columns that are too closely correlated
correlation_threshold = 0.95
df4 = drop_correlated(df3, correlation_threshold)
Upload Intermediate Dataset
Before balancing the datasets we will upload them to the Seclea Platform.
We define the metadata for the dataset - if there have been any changes since the original dataset we need to put that here, otherwise, we can reuse the original metadata. In this case, we have dropped some of the continuous feature columns so we will need to redefine
We define the transformations that took place between the last state we uploaded and this dataset. This is a list of functions and arguments. Please have a look at docs.seclea.com for more details on the correct formatting.
from seclea_ai.transformations import DatasetTransformation
# define the updates to the metadata - only changes are updated - here a continuous feature has been dropped so now
# we remove it from the list of continuous features.
processed_metadata = {"continuous_features": [
"total_claim_amount",
'policy_annual_premium',
'capital-gains',
'capital-loss',
'injury_claim',
'property_claim',
'incident_hour_of_the_day',
]}
# 🔀 define the transformations - note the arguments
cleaning_transformations = [
DatasetTransformation(encode_nans, data_kwargs={"df": df1}, kwargs={}, outputs=["df"]),
DatasetTransformation(
drop_nulls, data_kwargs={"df": "inherit"}, kwargs={"threshold": null_thresh}, outputs=["data"]
),
DatasetTransformation(
drop_correlated, data_kwargs={"data": "inherit"}, kwargs={"thresh": correlation_threshold}, outputs=["df"]
),
]
# ⬆️ upload the cleaned datasets
seclea.upload_dataset(dataset=df4,
dataset_name="Auto Insurance Fraud - Cleaned",
metadata=processed_metadata,
transformations=cleaning_transformations)
def fill_nan_const(df, val):
"""Fill NaN values in the dataframe with a constant value"""
return df.replace(['None', np.nan], val)
# Fill nans in 1st dataset with -1
const_val = -1
df_const = fill_nan_const(df4, const_val)
def fill_nan_mode(df, columns):
"""
Fills nans in specified columns with the mode of that column
Note that we want to make sure to not modify the dataset we passed in but to
return a new copy.
We do that by making a copy and specifying deep=True.
"""
new_df = df.copy(deep=True)
for col in df.columns:
if col in columns:
new_df[col] = df[col].fillna(df[col].mode()[0])
return new_df
nan_cols = ['collision_type','property_damage', 'police_report_available']
df_mode = fill_nan_mode(df4, nan_cols)
# find columns with categorical data for both dataset
cat_cols = df_const.select_dtypes(include=['object']).columns.tolist()
def encode_categorical(df, cat_cols):
from sklearn.preprocessing import LabelEncoder
new_df = df.copy(deep=True)
for col in cat_cols:
if col in df.columns:
le = LabelEncoder()
le.fit(list(df[col].astype(str).values))
new_df[col] = le.transform(list(df[col].astype(str).values))
return new_df
df_const = encode_categorical(df_const, cat_cols)
df_mode = encode_categorical(df_mode, cat_cols)
# Update metadata with new encoded values for the outcome column.
encoded_metadata = {"favourable_outcome": 0,
"unfavourable_outcome": 1,}
# 🔀 define the transformations - for the constant fill dataset
const_processed_transformations = [
DatasetTransformation(fill_nan_const, data_kwargs={"df": df4}, kwargs={"val": const_val}, outputs=["df"]),
DatasetTransformation(encode_categorical, data_kwargs={"df": "inherit"}, kwargs={"cat_cols":cat_cols}, outputs=["df"]),
]
# ⬆️ upload the constant fill dataset
seclea.upload_dataset(dataset=df_const,
dataset_name="Auto Insurance Fraud - Const Fill",
metadata=encoded_metadata,
transformations=const_processed_transformations)
# 🔀 define the transformations - for the mode fill dataset
mode_processed_transformations = [
DatasetTransformation(fill_nan_mode, data_kwargs={"df": df4}, kwargs={"columns": nan_cols}, outputs=["df"]),
DatasetTransformation(encode_categorical, data_kwargs={"df": "inherit"}, kwargs={"cat_cols": cat_cols}, outputs=["df"]),
]
# ⬆️ upload the mode fill dataset
seclea.upload_dataset(dataset=df_mode,
dataset_name="Auto Insurance Fraud - Mode Fill",
metadata=encoded_metadata,
transformations=mode_processed_transformations)
def get_samples_labels(df, output_col):
X = df.drop(output_col, axis=1)
y = df[output_col]
return X, y
# split the datasets into samples and labels ready for modelling.
X_const, y_const = get_samples_labels(df_const, "fraud_reported")
X_mode, y_mode = get_samples_labels(df_mode, "fraud_reported")
def get_test_train_splits(X, y, test_size, random_state):
from sklearn.model_selection import train_test_split
return train_test_split(
X, y, test_size=test_size, stratify=y, random_state=random_state
)
# returns X_train, X_test, y_train, y_test
# split into test and train sets
X_train_const, X_test_const, y_train_const, y_test_const = get_test_train_splits(X_const, y_const, test_size=0.2, random_state=42)
X_train_mode, X_test_mode, y_train_mode, y_test_mode = get_test_train_splits(X_mode, y_mode, test_size=0.2, random_state=42)
# 🔀 define the transformations - for the constant fill training set
const_train_transformations = [
DatasetTransformation(
get_test_train_splits,
data_kwargs={"X": X_const, "y": y_const},
kwargs={"test_size": 0.2, "random_state": 42},
outputs=["X_train_const", None, "y_train_const", None],
split="train",
),
]
# ⬆️ upload the const fill training set
seclea.upload_dataset_split(
X=X_train_const,
y=y_train_const,
dataset_name="Auto Insurance Fraud - Const Fill - Train",
metadata={},
transformations=const_train_transformations
)
# 🔀 define the transformations - for the constant fill test set
const_test_transformations = [
DatasetTransformation(
get_test_train_splits,
data_kwargs={"X": X_const, "y": y_const},
kwargs={"test_size": 0.2, "random_state": 42},
outputs=[None, "X_test_const", None, "y_test_const"],
split="test"
),
]
# ⬆️ upload the const fill test set
seclea.upload_dataset_split(X=X_test_const,
y=y_test_const,
dataset_name="Auto Insurance Fraud - Const Fill - Test",
metadata={},
transformations=const_test_transformations)
# 🔀 define the transformations - for the mode fill training set
mode_train_transformations = [
DatasetTransformation(
get_test_train_splits,
data_kwargs={"X": X_mode, "y": y_mode},
kwargs={"test_size": 0.2, "random_state": 42},
outputs=["X_train_mode", None, "y_train_mode", None],
split="train",
),
]
# ⬆️ upload the mode fill train set
seclea.upload_dataset_split(X=X_train_mode,
y=y_train_mode,
dataset_name="Auto Insurance Fraud - Mode Fill - Train",
metadata=processed_metadata,
transformations=mode_train_transformations)
# 🔀 define the transformations - for the mode fill test set
mode_test_transformations = [
DatasetTransformation(
get_test_train_splits,
data_kwargs={"X": X_mode, "y": y_mode},
kwargs={"test_size": 0.2, "random_state": 42},
outputs=[None, "X_test_mode", None, "y_test_mode"],
split="test",
),
]
# ⬆️ upload the mode fill test set
seclea.upload_dataset_split(X=X_test_mode,
y=y_test_mode,
dataset_name="Auto Insurance Fraud - Mode Fill - Test",
metadata={},
transformations=mode_test_transformations)
def smote_balance(X, y, random_state):
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=random_state)
X_sm, y_sm = sm.fit_resample(X, y)
print(
f"""Shape of X before SMOTE: {X.shape}
Shape of X after SMOTE: {X_sm.shape}"""
)
print(
f"""Shape of y before SMOTE: {y.shape}
Shape of y after SMOTE: {y_sm.shape}"""
)
return X_sm, y_sm
# returns X, y
# balance the training sets - creating new training sets for comparison
X_train_const_smote, y_train_const_smote = smote_balance(X_train_const, y_train_const, random_state=42)
X_train_mode_smote, y_train_mode_smote = smote_balance(X_train_mode, y_train_mode, random_state=42)
# 🔀 define the transformations - for the constant fill balanced train set
const_smote_transformations = [
DatasetTransformation(
smote_balance,
data_kwargs={"X": X_train_const, "y": y_train_const},
kwargs={"random_state": 42},
outputs=["X", "y"]
),
]
# ⬆️ upload the constant fill balanced train set
seclea.upload_dataset_split(X=X_train_const_smote,
y=y_train_const_smote,
dataset_name="Auto Insurance Fraud - Const Fill - Smote Train",
metadata={},
transformations=const_smote_transformations)
# 🔀 define the transformations - for the mode fill balanced train set
mode_smote_transformations = [
DatasetTransformation(
smote_balance,
data_kwargs={"X": X_train_mode, "y": y_train_mode},
kwargs={"random_state": 42},
outputs=["X", "y"]
),
]
# ⬆️ upload the mode fill balanced train set
seclea.upload_dataset_split(X=X_train_mode_smote,
y=y_train_mode_smote,
dataset_name="Auto Insurance Fraud - Mode Fill - Smote Train",
metadata={},
transformations=mode_smote_transformations)
Evaluating the Transformations
Now head to platform.seclea.com again to take another look at the Datasets section. You will see that there is much more to look at now.
You can see here how the transformations are used to show you the history of the data and how it arrived in its final state.
Modelling
Now we get started with the modelling. We will run the same models over each of our datasets to explore how the different processing of the data has affected our results.
We will use three models from sklearn for this, DecisionTree, RandomForest and GradientBoosting Classifers.
Training
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
classifiers = {
"RandomForestClassifier": RandomForestClassifier(),
"DecisionTreeClassifier": DecisionTreeClassifier(),
"GradientBoostingClassifier": GradientBoostingClassifier()
}
datasets = [
("Const Fill", (X_train_const, X_test_const, y_train_const, y_test_const)),
("Mode Fill", (X_train_mode, X_test_mode, y_train_mode, y_test_mode)),
("Const Fill Smote", (X_train_const_smote, X_test_const, y_train_const_smote, y_test_const)),
("Mode Fill Smote", (X_train_mode_smote, X_test_mode, y_train_mode_smote, y_test_mode))
]
for name, (X_train, X_test, y_train, y_test) in datasets:
for key, classifier in classifiers.items():
# cross validate to get an idea of generalisation.
training_score = cross_val_score(classifier, X_train, y_train, cv=5)
# train on the full training set
classifier.fit(X_train, y_train)
# ⬆️ upload the fully trained model
seclea.upload_training_run_split(model=classifier, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
# test accuracy
y_preds = classifier.predict(X_test)
test_score = accuracy_score(y_test, y_preds)
print(f"Classifier: {classifier.__class__.__name__} has a training score of {round(training_score.mean(), 3) * 100}% accuracy score on {name}")
print(f"Classifier: {classifier.__class__.__name__} has a test score of {round(test_score, 3) * 100}% accuracy score on {name}")