TEF Quick Start with Titanic


!pip install TEF -U
import TEF

Load dataset

titanic_raw = TEF.load_dataset('titanic_raw')

This famous dataset is merged from Kaggle and seaborn.

We usually start from head(), but is it possible to understand a dataset with only 6 rows?


Set dtypes

titanic = TEF.auto_set_dtypes(titanic_raw)
before dtypes: bool(2), float64(2), int64(4), object(5)
after  dtypes: bool(2), category(3), datetime64[ns](1), float64(2), int64(4), object(1)

possible identifier cols: 1 passenger_id
consider using set_object=[1]

possible category cols: 3 pclass (3 levls), 6 sibsp (7 levls), 7 parch (7 levls)
consider using set_category=[3, 6, 7]

If you accept the suggestion,

titanic = TEF.auto_set_dtypes(titanic_raw, set_object=[1], verbose=0)

Generating metadata

Just pass any dataset you are working on to TEF.dfmeta.


Have a description dictionary prepared and start filling it in.


And call TEF.dfmeta again. Now you will have a column explaining the data.

desc = {
    "survived"    : "Survived (1) or died (0)",
    "passenger_id": "Unique ID of the passenger",
    "name"        : "Passenger's name",
    "pclass"      : "Passenger's class (1st, 2nd, or 3rd)",
    "age"         : "Passenger's age",
    "birth"       : "Created from minusing the titanic happened date from Age",
    "sibsp"       : "Number of siblings/spouses aboard the Titanic",
    "parch"       : "Number of parents/children aboard the Titanic",
    "fare"        : "Fare paid for ticket",
    "who"         : "Whether the passenger is man, woman, or child",
    "deck"        : "",
    "embark_town" : "Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q = Queenstown)",
    "alone"       : ""
TEF.dfmeta(titanic, description=desc)

See the relations between target and variables

TEF.plot_1var_by_cat_y(titanic, 'survived')

Fit models

Now fit default classification models with one line.

TEF.fit(titanic, 'survived')

Put fitted feature importance into dfmeta

If it looks okay, you can save the feature importance and pass it to dfmeta to have another column. Or you can train your own model and pass the feature importance here.

feat_imp = TEF.fit(titanic, 'survived', verbose=0, return_agg_feat_imp=True)
TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp)

Standardize metadata

After you clean and look into those dirty values, you probably don’t need “possible NaNs”, “possible dub lev” and those samples there. Add another argument stadard=True to remove them and generate the final standardize metadata.

meta = TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp, standard=True)

Generate a final metadata

Now everything is clean and neat. You can export it to a HTML file. So that you can distribute it or just open it in another window while you are doing more stuff.

TEF.dfmeta_to_htmlfile(meta, filename='titanic_dfmeta.html', head='titanic metadata')
'titanic_dfmeta.html saved'

Or if you want the source HTML code to paste it somewhere.


That’s All!

Feel free to leave any feedback!

How To Fit A Machine Learning Model To A Kaggle Dataset In 8 Lines

import pandas as pd
train_transaction_raw = pd.read_csv('data/ieee-fraud-detection.zip Folder/train_transaction.csv')

import TEF
train_transaction = TEF.auto_set_dtypes(train_transaction_raw, set_object=[0])

TEF.plot_1var_by_cat_y(train_transaction, 'isFraud')

TEF.fit(train_transaction, 'isFraud', verbose=2)

Disclaimer and Caveat

Every ML practitioner knows it is a risky behavior to fit a model without understanding the data. The purpose of this article is to introduce the universal usage of TEF only instead of detailed exploration. Within these code, we can only have a rough understanding about the dataset.

In the following section I will walk through these codes for this ieee fraud detection dataset. A more detailed exploration, feature engineering, and model selection may be published in the future.

Continue reading How To Fit A Machine Learning Model To A Kaggle Dataset In 8 Lines

Swear Words in Review: Regiospecificity and Predictability


This report is aimed to answer the following two questions. 1. Does the use of swear words have any regiospecificity that result in heterogeneous in the data? 2. Does the use of swear words in customers’ review have an impact on the ratings they gave? Can it predict the stars they gave toward a business? Mainly, analysis using ANOVA on metropolis, multiple regression on ratings are performed. Results indicate that the usage of swear words is different by region and 25 of 45 swear words have predictability on the rating a customer gave. All code and files can be obtained from the link in the end.
Continue reading Swear Words in Review: Regiospecificity and Predictability

An Academic Geek