Please use the following Google Colaboratory project for a complete example you can work through showing how you can integrate the Seclea Platform into your data science workflow. Please copy the Google Colaboratory project into your account and run it from there.
Following is a step-by-step guide, followed by the Google Colaboratory project.
We will run through a sample project showing how to use Seclea's tools to record your data science work and explore the results in the Seclea Platform.
Please create a new project by clicking the "+ Create" button and giving it a name and description.
After you have created your project, it will appear in the All Projects section, as shown below:
Click the project name in the All Projects section to go to the project setting page on the Seclea Platform.
Click the settings option in the left panel will take you to the respective project's settings.
In the project setting, you can select, modify or set new templates for compliance, risk management and any internal standard/policy you must comply with. Details on how to use compliance, risk management, and internal settings are detailed in their respective sections.
If you want to include additional team members in the project, you can include them using the Access section of the setting.
These are optional settings; you can set/modify them later in the project.
Integrating with Seclea-AI
You can get the seclea-ai package from pip
!pip install seclea_ai
When you initialise the SecleaAI object, you will be prompted to log in if you haven't already done so. You must use the same Project Name you used earlier and the Organization name provided with your credentials.
from seclea_ai import SecleaAI# NOTE - use the organization name provided to you when issuing credentials.seclea =SecleaAI(project_name="Car Insurance Fraud Detection", organization='')
Handling the Data
You can download the data for this tutorial if you are working on this in Colab or without reference to the repo - this is an Insurance Claims dataset with various features and 1000 samples.
Now we can upload the initial data to the Seclea Platform.
This should include whatever information we know about the dataset as metadata. There are only two keys to add in metadata for now - outputs and continuous_features.
You can leave out outputs if you haven't decided what you will be predicted yet, but you should know or be able to find the continuous features at this point.
You can also update these when uploading datasets during/after pre-processing.
import numpy as npimport pandas as pd# load the data data = pd.read_csv('insurance_claims.csv', index_col="policy_number")# define the metadata for the dataset.dataset_metadata ={"outputs": ["fraud_reported"],"favourable_outcome":"N","unfavourable_outcome":"Y","continuous_features": ["total_claim_amount",'policy_annual_premium','capital-gains','capital-loss','injury_claim','property_claim','vehicle_claim','incident_hour_of_the_day', ]}# ⬆️ upload the dataset - pick a meaningful name here; you'll see it a lot on the platform!seclea.upload_dataset(dataset=data, dataset_name="Auto Insurance Fraud", metadata=dataset_metadata)
Evaluating the Dataset
After running the above section, head back to the Seclea Platform so that we can take a closer look at our Dataset. To do so, you can navigate to the datasets section - under Prepare tab.
Personal Identifiable Information (PII) and Format Check
Data Bias Check
Transformations
When using Seclea to record your Data Science work, you will have to take care with how you deal with transformations of the data.
We require that all transformations are encapsulated in a function that takes the data and returns the transformed data. There are a few things to be aware of, so please see the docs for more.
# Create a copy to isolate the original datasetdf1 = data.copy(deep=True)defencode_nans(df):# convert the special characters to nansreturn df.replace('?', np.NaN)df2 =encode_nans(df1)
Data Cleaning
We will carry out some pre-processing and generate a few different datasets to see how to track these on the platform. This also means we can train our models on different data and see how that affects performance.
## Drop the the column which are more than some proportion NaN valuesdefdrop_nulls(df,threshold): cols = [x for x in df.columns if df[x].isnull().sum()/ df.shape[0]> threshold]return df.drop(columns=cols)# We choose 95% as our thresholdnull_thresh =0.95df3 =drop_nulls(df2, threshold=null_thresh)defdrop_correlated(data,thresh):import numpy as np# calculate correlations corr_matrix = data.corr().abs()# get the upper part of correlation matrix upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))# columns with correlation above threshold redundant = [column for column in upper.columns ifany(upper[column] >= thresh)]print(f"Columns to drop with correlation > {thresh}: {redundant}") new_data = data.drop(columns=redundant)return new_data# drop columns that are too closely correlatedcorrelation_threshold =0.95df4 =drop_correlated(df3, correlation_threshold)
Upload Intermediate Dataset
Before balancing the datasets we will upload them to the Seclea Platform.
We define the metadata for the dataset - if there have been any changes since the original dataset we need to put that here, otherwise, we can reuse the original metadata. In this case, we have dropped some of the continuous feature columns so we will need to redefine
We define the transformations that took place between the last state we uploaded and this dataset. This is a list of functions and arguments. Please have a look at docs.seclea.com for more details on the correct formatting.
from seclea_ai.transformations import DatasetTransformation# define the updates to the metadata - only changes are updated - here a continuous feature has been dropped so now# we remove it from the list of continuous features.processed_metadata ={"continuous_features": ["total_claim_amount",'policy_annual_premium','capital-gains','capital-loss','injury_claim','property_claim','incident_hour_of_the_day', ]}# 🔀 define the transformations - note the argumentscleaning_transformations = [DatasetTransformation(encode_nans, data_kwargs={"df": df1}, kwargs={}, outputs=["df"]),DatasetTransformation( drop_nulls, data_kwargs={"df": "inherit"}, kwargs={"threshold": null_thresh}, outputs=["data"] ),DatasetTransformation( drop_correlated, data_kwargs={"data": "inherit"}, kwargs={"thresh": correlation_threshold}, outputs=["df"]
), ]# ⬆️ upload the cleaned datasetsseclea.upload_dataset(dataset=df4, dataset_name="Auto Insurance Fraud - Cleaned", metadata=processed_metadata, transformations=cleaning_transformations)
deffill_nan_const(df,val):"""Fill NaN values in the dataframe with a constant value"""return df.replace(['None', np.nan], val)# Fill nans in 1st dataset with -1const_val =-1df_const =fill_nan_const(df4, const_val)deffill_nan_mode(df,columns):""" Fills nans in specified columns with the mode of that column Note that we want to make sure to not modify the dataset we passed in but to return a new copy. We do that by making a copy and specifying deep=True. """ new_df = df.copy(deep=True)for col in df.columns:if col in columns: new_df[col]= df[col].fillna(df[col].mode()[0])return new_dfnan_cols = ['collision_type','property_damage','police_report_available']df_mode =fill_nan_mode(df4, nan_cols)# find columns with categorical data for both datasetcat_cols = df_const.select_dtypes(include=['object']).columns.tolist()defencode_categorical(df,cat_cols): from sklearn.preprocessing import LabelEncoder new_df = df.copy(deep=True)for col in cat_cols:if col in df.columns: le =LabelEncoder() le.fit(list(df[col].astype(str).values)) new_df[col]= le.transform(list(df[col].astype(str).values))return new_dfdf_const =encode_categorical(df_const, cat_cols)df_mode =encode_categorical(df_mode, cat_cols)# Update metadata with new encoded values for the outcome column.encoded_metadata ={"favourable_outcome":0,"unfavourable_outcome":1,}# 🔀 define the transformations - for the constant fill datasetconst_processed_transformations = [DatasetTransformation(fill_nan_const, data_kwargs={"df": df4}, kwargs={"val": const_val}, outputs=["df"]), DatasetTransformation(encode_categorical, data_kwargs={"df": "inherit"}, kwargs={"cat_cols":cat_cols}, outputs=["df"]),
]# ⬆️ upload the constant fill datasetseclea.upload_dataset(dataset=df_const, dataset_name="Auto Insurance Fraud - Const Fill", metadata=encoded_metadata, transformations=const_processed_transformations)# 🔀 define the transformations - for the mode fill datasetmode_processed_transformations = [DatasetTransformation(fill_nan_mode, data_kwargs={"df": df4}, kwargs={"columns": nan_cols}, outputs=["df"]), DatasetTransformation(encode_categorical, data_kwargs={"df": "inherit"}, kwargs={"cat_cols": cat_cols}, outputs=["df"]),
]# ⬆️ upload the mode fill datasetseclea.upload_dataset(dataset=df_mode, dataset_name="Auto Insurance Fraud - Mode Fill", metadata=encoded_metadata, transformations=mode_processed_transformations)defget_samples_labels(df,output_col): X = df.drop(output_col, axis=1) y = df[output_col]return X, y# split the datasets into samples and labels ready for modelling.X_const, y_const =get_samples_labels(df_const, "fraud_reported")X_mode, y_mode =get_samples_labels(df_mode, "fraud_reported")defget_test_train_splits(X,y,test_size,random_state):from sklearn.model_selection import train_test_splitreturntrain_test_split( X, y, test_size=test_size, stratify=y, random_state=random_state )# returns X_train, X_test, y_train, y_test# split into test and train setsX_train_const, X_test_const, y_train_const, y_test_const = get_test_train_splits(X_const, y_const, test_size=0.2, random_state=42)
X_train_mode, X_test_mode, y_train_mode, y_test_mode = get_test_train_splits(X_mode, y_mode, test_size=0.2, random_state=42)
# 🔀 define the transformations - for the constant fill training setconst_train_transformations = [DatasetTransformation( get_test_train_splits, data_kwargs={"X": X_const, "y": y_const}, kwargs={"test_size": 0.2, "random_state": 42}, outputs=["X_train_const", None, "y_train_const", None], split="train", ),]# ⬆️ upload the const fill training setseclea.upload_dataset_split( X=X_train_const, y=y_train_const, dataset_name="Auto Insurance Fraud - Const Fill - Train", metadata={}, transformations=const_train_transformations)# 🔀 define the transformations - for the constant fill test setconst_test_transformations = [DatasetTransformation( get_test_train_splits, data_kwargs={"X": X_const, "y": y_const}, kwargs={"test_size": 0.2, "random_state": 42}, outputs=[None, "X_test_const", None, "y_test_const"], split="test" ),]# ⬆️ upload the const fill test setseclea.upload_dataset_split(X=X_test_const, y=y_test_const, dataset_name="Auto Insurance Fraud - Const Fill - Test", metadata={}, transformations=const_test_transformations)# 🔀 define the transformations - for the mode fill training setmode_train_transformations = [DatasetTransformation( get_test_train_splits, data_kwargs={"X": X_mode, "y": y_mode}, kwargs={"test_size": 0.2, "random_state": 42}, outputs=["X_train_mode", None, "y_train_mode", None], split="train", ),]# ⬆️ upload the mode fill train setseclea.upload_dataset_split(X=X_train_mode, y=y_train_mode, dataset_name="Auto Insurance Fraud - Mode Fill - Train", metadata=processed_metadata, transformations=mode_train_transformations)# 🔀 define the transformations - for the mode fill test setmode_test_transformations = [DatasetTransformation( get_test_train_splits, data_kwargs={"X": X_mode, "y": y_mode}, kwargs={"test_size": 0.2, "random_state": 42}, outputs=[None, "X_test_mode", None, "y_test_mode"], split="test", ),]# ⬆️ upload the mode fill test setseclea.upload_dataset_split(X=X_test_mode, y=y_test_mode, dataset_name="Auto Insurance Fraud - Mode Fill - Test", metadata={}, transformations=mode_test_transformations)defsmote_balance(X,y,random_state):from imblearn.over_sampling import SMOTE sm =SMOTE(random_state=random_state) X_sm, y_sm = sm.fit_resample(X, y)print(f"""Shape of X before SMOTE: {X.shape} Shape of X after SMOTE: {X_sm.shape}""" )print(f"""Shape of y before SMOTE: {y.shape} Shape of y after SMOTE: {y_sm.shape}""" )return X_sm, y_sm# returns X, y# balance the training sets - creating new training sets for comparisonX_train_const_smote, y_train_const_smote =smote_balance(X_train_const, y_train_const, random_state=42)X_train_mode_smote, y_train_mode_smote =smote_balance(X_train_mode, y_train_mode, random_state=42)# 🔀 define the transformations - for the constant fill balanced train setconst_smote_transformations = [DatasetTransformation( smote_balance, data_kwargs={"X": X_train_const, "y": y_train_const}, kwargs={"random_state": 42}, outputs=["X", "y"] ),]# ⬆️ upload the constant fill balanced train setseclea.upload_dataset_split(X=X_train_const_smote, y=y_train_const_smote, dataset_name="Auto Insurance Fraud - Const Fill - Smote Train", metadata={}, transformations=const_smote_transformations)# 🔀 define the transformations - for the mode fill balanced train setmode_smote_transformations = [DatasetTransformation( smote_balance, data_kwargs={"X": X_train_mode, "y": y_train_mode}, kwargs={"random_state": 42}, outputs=["X", "y"] ),]# ⬆️ upload the mode fill balanced train setseclea.upload_dataset_split(X=X_train_mode_smote, y=y_train_mode_smote, dataset_name="Auto Insurance Fraud - Mode Fill - Smote Train", metadata={}, transformations=mode_smote_transformations)
Evaluating the Transformations
Now head to platform.seclea.com again to take another look at the Datasets section. You will see that there is much more to look at now.
You can see here how the transformations are used to show you the history of the data and how it arrived in its final state.
Modelling
Now we get started with the modelling. We will run the same models over each of our datasets to explore how the different processing of the data has affected our results.
We will use three models from sklearn for this, DecisionTree, RandomForest and GradientBoosting Classifers.
Training
from sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import cross_val_scorefrom sklearn.metrics import accuracy_scoreclassifiers ={"RandomForestClassifier":RandomForestClassifier(),"DecisionTreeClassifier":DecisionTreeClassifier(),"GradientBoostingClassifier":GradientBoostingClassifier()}datasets = [ ("Const Fill", (X_train_const, X_test_const, y_train_const, y_test_const)), ("Mode Fill", (X_train_mode, X_test_mode, y_train_mode, y_test_mode)), ("Const Fill Smote", (X_train_const_smote, X_test_const, y_train_const_smote, y_test_const)), ("Mode Fill Smote", (X_train_mode_smote, X_test_mode, y_train_mode_smote, y_test_mode)) ]for name, (X_train, X_test, y_train, y_test) in datasets:for key, classifier in classifiers.items():# cross validate to get an idea of generalisation. training_score =cross_val_score(classifier, X_train, y_train, cv=5)# train on the full training set classifier.fit(X_train, y_train)# ⬆️ upload the fully trained model seclea.upload_training_run_split(model=classifier, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)
# test accuracy y_preds = classifier.predict(X_test) test_score =accuracy_score(y_test, y_preds) print(f"Classifier: {classifier.__class__.__name__} has a training score of {round(training_score.mean(), 3) * 100}% accuracy score on {name}")
print(f"Classifier: {classifier.__class__.__name__} has a test score of {round(test_score, 3) * 100}% accuracy score on {name}")