Tag Archives: TEF

How To Fit A Machine Learning Model To A Kaggle Dataset In 8 Lines

import pandas as pd
train_transaction_raw = pd.read_csv('data/ieee-fraud-detection.zip Folder/train_transaction.csv')

import TEF
train_transaction = TEF.auto_set_dtypes(train_transaction_raw, set_object=[0])

TEF.dfmeta(train_transaction)
TEF.plot_1var(train_transaction)
TEF.plot_1var_by_cat_y(train_transaction, 'isFraud')

TEF.fit(train_transaction, 'isFraud', verbose=2)

Disclaimer and Caveat

Every ML practitioner knows it is a risky behavior to fit a model without understanding the data. The purpose of this article is to introduce the universal usage of TEF only instead of detailed exploration. Within these code, we can only have a rough understanding about the dataset.

In the following section I will walk through these codes for this ieee fraud detection dataset. A more detailed exploration, feature engineering, and model selection may be published in the future.

Continue reading How To Fit A Machine Learning Model To A Kaggle Dataset In 8 Lines

TEF Quick Start with Titanic

Installation

!pip install TEF -U
import TEF
TEF.__version__
'0.7.7'

Load dataset

titanic_raw = TEF.load_dataset('titanic_raw')

This famous dataset is merged from Kaggle and seaborn.

We usually start from head(), but is it possible to understand a dataset with only 6 rows?

titanic_raw.head()

Set dtypes

titanic = TEF.auto_set_dtypes(titanic_raw)
before dtypes: bool(2), float64(2), int64(4), object(5)
after  dtypes: bool(2), category(3), datetime64[ns](1), float64(2), int64(4), object(1)

possible identifier cols: 1 passenger_id
consider using set_object=[1]

possible category cols: 3 pclass (3 levls), 6 sibsp (7 levls), 7 parch (7 levls)
consider using set_category=[3, 6, 7]

If you accept the suggestion,

titanic = TEF.auto_set_dtypes(titanic_raw, set_object=[1], verbose=0)

Generating metadata

Just pass any dataset you are working on to TEF.dfmeta.

TEF.dfmeta(titanic)

Have a description dictionary prepared and start filling it in.

TEF.get_desc_template(titanic)

And call TEF.dfmeta again. Now you will have a column explaining the data.

desc = {
    "survived"    : "Survived (1) or died (0)",
    "passenger_id": "Unique ID of the passenger",
    "name"        : "Passenger's name",
    "pclass"      : "Passenger's class (1st, 2nd, or 3rd)",
    "age"         : "Passenger's age",
    "birth"       : "Created from minusing the titanic happened date from Age",
    "sibsp"       : "Number of siblings/spouses aboard the Titanic",
    "parch"       : "Number of parents/children aboard the Titanic",
    "fare"        : "Fare paid for ticket",
    "who"         : "Whether the passenger is man, woman, or child",
    "deck"        : "",
    "embark_town" : "Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q = Queenstown)",
    "alone"       : ""
}
TEF.dfmeta(titanic, description=desc)

See the relations between target and variables

TEF.plot_1var_by_cat_y(titanic, 'survived')

Fit models

Now fit default classification models with one line.

TEF.fit(titanic, 'survived')

Put fitted feature importance into dfmeta

If it looks okay, you can save the feature importance and pass it to dfmeta to have another column. Or you can train your own model and pass the feature importance here.

feat_imp = TEF.fit(titanic, 'survived', verbose=0, return_agg_feat_imp=True)
TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp)

Standardize metadata

After you clean and look into those dirty values, you probably don’t need “possible NaNs”, “possible dub lev” and those samples there. Add another argument stadard=True to remove them and generate the final standardize metadata.

meta = TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp, standard=True)
meta

Generate a final metadata

Now everything is clean and neat. You can export it to a HTML file. So that you can distribute it or just open it in another window while you are doing more stuff.

TEF.dfmeta_to_htmlfile(meta, filename='titanic_dfmeta.html', head='titanic metadata')
'titanic_dfmeta.html saved'

Or if you want the source HTML code to paste it somewhere.

print(meta.data)

That’s All!

Feel free to leave any feedback!