TEF Quick Start with Titanic


!pip install TEF -U
import TEF

Load dataset

titanic_raw = TEF.load_dataset('titanic_raw')

This famous dataset is merged from Kaggle and seaborn.

We usually start from head(), but is it possible to understand a dataset with only 6 rows?


Set dtypes

titanic = TEF.auto_set_dtypes(titanic_raw)
before dtypes: bool(2), float64(2), int64(4), object(5)
after  dtypes: bool(2), category(3), datetime64[ns](1), float64(2), int64(4), object(1)

possible identifier cols: 1 passenger_id
consider using set_object=[1]

possible category cols: 3 pclass (3 levls), 6 sibsp (7 levls), 7 parch (7 levls)
consider using set_category=[3, 6, 7]

If you accept the suggestion,

titanic = TEF.auto_set_dtypes(titanic_raw, set_object=[1], verbose=0)

Generating metadata

Just pass any dataset you are working on to TEF.dfmeta.


Have a description dictionary prepared and start filling it in.


And call TEF.dfmeta again. Now you will have a column explaining the data.

desc = {
    "survived"    : "Survived (1) or died (0)",
    "passenger_id": "Unique ID of the passenger",
    "name"        : "Passenger's name",
    "pclass"      : "Passenger's class (1st, 2nd, or 3rd)",
    "age"         : "Passenger's age",
    "birth"       : "Created from minusing the titanic happened date from Age",
    "sibsp"       : "Number of siblings/spouses aboard the Titanic",
    "parch"       : "Number of parents/children aboard the Titanic",
    "fare"        : "Fare paid for ticket",
    "who"         : "Whether the passenger is man, woman, or child",
    "deck"        : "",
    "embark_town" : "Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q = Queenstown)",
    "alone"       : ""
TEF.dfmeta(titanic, description=desc)

See the relations between target and variables

TEF.plot_1var_by_cat_y(titanic, 'survived')

Fit models

Now fit default classification models with one line.

TEF.fit(titanic, 'survived')

Put fitted feature importance into dfmeta

If it looks okay, you can save the feature importance and pass it to dfmeta to have another column. Or you can train your own model and pass the feature importance here.

feat_imp = TEF.fit(titanic, 'survived', verbose=0, return_agg_feat_imp=True)
TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp)

Standardize metadata

After you clean and look into those dirty values, you probably don’t need “possible NaNs”, “possible dub lev” and those samples there. Add another argument stadard=True to remove them and generate the final standardize metadata.

meta = TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp, standard=True)

Generate a final metadata

Now everything is clean and neat. You can export it to a HTML file. So that you can distribute it or just open it in another window while you are doing more stuff.

TEF.dfmeta_to_htmlfile(meta, filename='titanic_dfmeta.html', head='titanic metadata')
'titanic_dfmeta.html saved'

Or if you want the source HTML code to paste it somewhere.


That’s All!

Feel free to leave any feedback!

Leave a Reply

Your email address will not be published.