How To Fit A Machine Learning Model To A Kaggle Dataset In 8 Lines

import pandas as pd
train_transaction_raw = pd.read_csv('data/ieee-fraud-detection.zip Folder/train_transaction.csv')

import TEF
train_transaction = TEF.auto_set_dtypes(train_transaction_raw, set_object=[0])

TEF.dfmeta(train_transaction)
TEF.plot_1var(train_transaction)
TEF.plot_1var_by_cat_y(train_transaction, 'isFraud')

TEF.fit(train_transaction, 'isFraud', verbose=2)

Disclaimer and Caveat

Every ML practitioner knows it is a risky behavior to fit a model without understanding the data. The purpose of this article is to introduce the universal usage of TEF only instead of detailed exploration. Within these code, we can only have a rough understanding about the dataset.

In the following section I will walk through these codes for this ieee fraud detection dataset. A more detailed exploration, feature engineering, and model selection may be published in the future.

Load data

import pandas as pd
train_transaction_raw = pd.read_csv('data/ieee-fraud-detection.zip Folder/train_transaction.csv')

Set data types

import TEF
train_transaction = TEF.auto_set_dtypes(train_transaction_raw)
before dtypes: float64(376), int64(4), object(14)
after  dtypes: bool(1), category(12), float64(376), int64(3), object(2)

possible identifier cols: 0 TransactionID
consider using set_object=[0]

possible category cols: 55 V1 (2 levls), 56 V2 (9 levls), 58 V4 (7 levls), 59 V5 (7 levls), 62 V8 (9 levls), 63 V9 (9 levls), 64 V10 (5 levls), 65 V11 (6 levls), 66 V12 (4 levls), 67 V13 (7 levls), 68 V14 (2 levls), 69 V15 (8 levls), 73 V19 (8 levls), 75 V21 (6 levls), 76 V22 (9 levls), 79 V25 (7 levls), 81 V27 (4 levls), 82 V28 (4 levls), 83 V29 (6 levls), 84 V30 (8 levls), 85 V31 (8 levls), 87 V33 (7 levls), 89 V35 (4 levls), 90 V36 (6 levls), 95 V41 (2 levls), 96 V42 (9 levls), 97 V43 (9 levls), 100 V46 (7 levls), 101 V47 (9 levls), 102 V48 (6 levls), 103 V49 (6 levls), 104 V50 (6 levls), 105 V51 (7 levls), 106 V52 (9 levls), 107 V53 (6 levls), 108 V54 (7 levls), 111 V57 (7 levls), 115 V61 (7 levls), 117 V63 (8 levls), 118 V64 (8 levls), 119 V65 (2 levls), 120 V66 (8 levls), 121 V67 (9 levls), 122 V68 (3 levls), 123 V69 (6 levls), 124 V70 (7 levls), 125 V71 (7 levls), 127 V73 (8 levls), 128 V74 (9 levls), 129 V75 (5 levls), 130 V76 (7 levls), 133 V79 (8 levls), 136 V82 (8 levls), 137 V83 (8 levls), 138 V84 (8 levls), 139 V85 (8 levls), 142 V88 (2 levls), 143 V89 (3 levls), 144 V90 (6 levls), 145 V91 (7 levls), 146 V92 (8 levls), 147 V93 (8 levls), 148 V94 (3 levls), 161 V107 (2 levls), 162 V108 (8 levls), 163 V109 (8 levls), 164 V110 (8 levls), 168 V114 (7 levls), 169 V115 (7 levls), 170 V116 (7 levls), 171 V117 (4 levls), 172 V118 (4 levls), 173 V119 (4 levls), 174 V120 (4 levls), 175 V121 (4 levls), 176 V122 (4 levls), 195 V141 (6 levls), 227 V173 (8 levls), 228 V174 (9 levls), 248 V194 (8 levls), 294 V240 (6 levls), 295 V241 (5 levls), 314 V260 (9 levls), 340 V286 (9 levls), 359 V305 (2 levls)
consider using set_category=[55, 56, 58, 59, 62, 63, 64, 65, 66, 67, 68, 69, 73, 75, 76, 79, 81, 82, 83, 84, 85, 87, 89, 90, 95, 96, 97, 100, 101, 102, 103, 104, 105, 106, 107, 108, 111, 115, 117, 118, 119, 120, 121, 122, 123, 124, 125, 127, 128, 129, 130, 133, 136, 137, 138, 139, 142, 143, 144, 145, 146, 147, 148, 161, 162, 163, 164, 168, 169, 170, 171, 172, 173, 174, 175, 176, 195, 227, 228, 248, 294, 295, 314, 340, 359]

It detects the first column, TransactionID is potentially an identifier column, because it has unique values for every row and contains string “ID” in columns name. Add argument set_object=[0] if you accept this suggestion.

We ignore those suggestion for categorical variables because we don’t know what do them actually mean, you can add verbose=0 to suppress the printouts.

train_transaction = TEF.auto_set_dtypes(train_transaction_raw, set_object=[0], verbose=0)

Exploration

For the purpose of exploration, we will use TEF.dfmeta, TEF.plot_1var, and TEF.plot_1var_by_cat_y. However, notice a part of the generated result from the TEF.dfmeta and TEF.plot_1var here are duplicated works compared to what Kaggle has already provided on the data section. The detailed view and column view there has things similar to dtype, NaNs, mean, std, and quantiles like here.

Nevertheless, TEF is build for universal purpose. We can use these functions not only for Kaggle datasets.

TEF.dfmeta(train_transaction)