Documentations of TEF. See quick start to know more about it.

auto_set_dtypes

auto_set_dtypes(df, max_lev=10,
  set_datetime=[], set_category=[], set_int=[], set_object=[], set_bool=[],
  set_datetime_by_pattern=r'\d{4}-\d{2}-\d{2}',
  verbose=1)

set to datetime if the pattern is like ‘2018-08-08’
- it’s designed for all datetime columns in a dataset have the same format like 2019-06-06 06:06:06 (such as downloaded from DOMO)
set to category if the number unique levels is less than max_lev
set_{dtypes} can be used for manually configurations, set_object can be used for ID columns
will also trying to detect possible ID columns for seaching string ‘id’, ‘key’, ‘number’ in them and the proportion of nunique is > 0.5, if verbose>=1
will try to detect possible categorical variable that is currently int, by nunique < max_lev. There can be a lot, suppress these using verbose=0
set_datetime/set_object etc can take either columns indices or column names, duplication will be ignored.

args

df: pandas dataframe
max_lev: the maximum number of levels that should be converted to category
set_{datetime/category/int/object/bool}: a list of indices, e.g. [0, 3, 5], force these columns to be set to those dtypes
set_datetime_by_pattern: a regular expression string, recommend using the default
verbose: int/string, 0/False, 1/’summary’, or 2/’detailed’. different type of printouts showing the transformations

returns: a modified pd.dataframe

example

Show Example

titanic_raw = TEF.load_dataset('titanic_raw')
titanic = TEF.auto_set_dtypes(titanic_raw)


before dtypes: bool(2), float64(2), int64(4), object(5)
after  dtypes: bool(2), category(3), datetime64[ns](1), float64(2), int64(4), object(1)

possible identifier cols: 1 passenger_id
consider using set_object=[1]

possible category cols: 3 pclass (3 levls), 6 sibsp (7 levls), 7 parch (7 levls)
consider using set_category=[3, 6, 7]

dfmeta

dfmeta(df, description=None, max_lev=10, transpose=True, sample=True,
  style=True, color_bg_by_type=True, highlight_nan=0.5, in_cell_next_line=True,
  drop=None,
  check_possible_error=True, dup_lev_prop=0.9,
  fitted_feat_imp=None, plot=True,
  standard=False)

dreturns meta data for the given dataset
use .data to obtain the source code. Use dfmeta_to_htmlfile to save the returned object to html file
in default, for every columns it will display:
- idx shows the index of that column
- dtype, background is colored by its dtype
- description is a place that you can input your own explanation
- NaNs shows the number of nulls and its percentage. It will highlight it in red if the percentage is > 0.5, change it using highlight_nan.
- unique counts shows the number of unique values of that columns and percentage. The text will be in blue if everything is unique (percentage = 100%) and in red if everything is the same (count = 1)
- summary shows
  - for datatime, quantiles [0% (min), 25%, 50% (median), 75%, 100% (max)]. It display till day level if the range of this series is larger than 1 day, else it display everything till ns
  - for int and float, quantiles [0% (min), 25%, 50% (median), 75%, 100% (max)] , mean, standard error, CV (coefficient of variance, std/mean), skewness.
    - the skewness will be followed by a star (*) if it doesn’t pass the normality test (skewtest), and then after removing nonpostivies and taking log, another skewtest will be applied.
    - notice when applying skewtest, all nulls will automatically be removed, be cautious when it has a lot of nulls
  - for bool, categoy and object, it gives percentages of all levels, if less than max_lev. For those have more levels, it will try to summarize the rest as a “Other” level
- summary plot will callplot_1var for each column. In my jupyter notebook, double click on it can enlarge it
- possible NaNs tries to detect potential nulls that may caused by hand-coded values, for instance, sometimes a space ‘ ‘ or a string ‘nan’ actually means a NaN. It checks “nan”, “need”, “null”, spaces and characters for now. Disable it using check_possible_error=False.
- possible dup lev tries to detect potential possible duplicate levels using fuzzywuzzy, such as sometimes ‘object1111’ should actually be the same value as ‘object111’ just because of typo. The threshold is defined by dup_lev_prop. It will be multiplied by 100 and passed to ratio, partial_ratio, token_sort_ratio, and token_set_ratio in fuzzywuzzy, and be displayed if any of them is larger than threshold. See its docs for explanation.
- the rest 3 columns are randomly sample from the dataset, where we human always like an example. Adjust using sample

args

df: pandas dataframe
description: dict, where keys are col names and values are description for that column, can contain html code
max_lev: int, the maximum acceptable number of unique levels
transpose: bool, if True, cols are still cols
sample:
- True: sample 3 rows
- False: don’t sample
- ‘head’: use first 3 rows
- int: sample int rows
style: bool, if True, return styled dataframe in html, if False, return pandas dataframe instead and will overwrites color_bg_by_type, highlight_nan, in_cell_next_line
color_bg_by_type: bool, color the cell background by dtype, by column. will be forced to False if style=False
highlight_nan: float [0, 1] or False, the proportion of when should highlight nans. will be forced to False if style=False
in_cell_next_line: bool, if True, use <br/> to separate elements in a list; if False, use ‘, ‘
drop: columns (or rows if transpose=True) that wants to be dropped, doesn’t suppor NaNs and dtypes now
check_possible_error: bool, check possible NaNs and duplicate levels or not
dup_lev_prop: float [0, 1], the threshold that will be passed to fuzzywuzzy
fitted_feat_imp: pd.series, with columns as indices. Fitted feature importance that can be generated by TEF.fit or your own fitted model. See quick start for usage
plot: to plot for every variables or not. Will call plot_1var.
standard: bool, to use standard settings or not. If true, sets check_possible_error = False and sample = False
- Notice it doesn’t affect feat_impand plot.

return: IPython.core.display.HTML object, if style=True (default); a pd.dataframe if style=False. Will display automatically in jupyter notebook. Use .data to obtain the source code. See example below.

example

Show Example

titanic = TEF.load_dataset('titanic')
TEF.dfmeta(titanic)

shape: (891, 13); dtypes: bool(2), category(3), datetime64[ns](1), float64(2), int64(3), object(2); memory usage: 60.7+ KB
col name	idx	dtype	NaNs	unique counts	summary	possible dup lev	row 34	row 87	row 486
survived	0	bool	0 0%	2 0%	False 62% True 38%		False	False	True
passenger_id	1	object	0 0%	891 100%	other 100%		35	88	487
name	2	object	0 0%	891 100%	other 100%		Meyer, Mr. Edgar Joseph	Slocovski, Mr. Selman Francis	Hoyt, Mrs. Frederick Maxfield (Jane Anne Forby)
pclass	3	int64	0 0%	3 0%	3 55% 1 24% 2 21%		1	3	1
age	4	float64	177 20%	89 10%	[0.42, 20.125, 28.0, 38.0, 80.0] mean: 29.70 std: 14.53 cv: 0.49 skew: 0.39* log skew: -2.30		28	nan	35
birth	5	datetime64[ns]	177 20%	72 8%	1832-05-03 1874-04-23 1884-04-20 1892-04-18 1912-04-14		1884-04-20 00:00:00	NaT	1877-04-22 00:00:00
sibsp	6	int64	0 0%	7 1%	[0.0, 0.0, 0.0, 1.0, 8.0] mean: 0.52 std: 1.10 cv: 2.11 skew: 3.69* log skew: 1.67		1	0	1
parch	7	int64	0 0%	7 1%	[0.0, 0.0, 0.0, 0.0, 6.0] mean: 0.38 std: 0.81 cv: 2.11 skew: 2.74* log skew: 0.93		0	0	0
fare	8	float64	0 0%	248 28%	[0.0, 7.9104, 14.4542, 31.0, 512.3292] mean: 32.20 std: 49.69 cv: 1.54 skew: 4.78* log skew: 0.90		82.1708	8.05	90
who	9	category	0 0%	3 0%	man 60% woman 30% child 9%	(man, woman)	man	man	woman
deck	10	category	688 77%	8 1%	nan 77% C 7% B 5% D 4% E 4% A 2% F 1% G 0%		nan	nan	C
embark_town	11	category	2 0%	4 0%	Southampton 72% Cherbourg 19% Queenstown 9% nan 0%		Cherbourg	Southampton	Southampton
alone	12	bool	0 0%	2 0%	True 60% False 40%		False	True	False

Show Example of Printing HTML Source Code

titanic = TEF.load_dataset('titanic')
print(TEF.dfmeta(titanic).data)

<style  type="text/css" >
    #T_78072652_cf8b_11e9_a9e7_5c5f67a418f1row0_col0 {
...
# printed html code hidden

Show Example of Inputting Feature Importance

titanic = TEF.load_dataset('titanic')
desc = {
    "survived"    : "Survived (1) or died (0)",
    "passenger_id": "Unique ID of the passenger",
    "name"        : "Passenger's name",
    "pclass"      : "Passenger's class (1st, 2nd, or 3rd)",
    "age"         : "Passenger's age",
    "birth"       : "Created from minusing the titanic happened date from Age",
    "sibsp"       : "Number of siblings/spouses aboard the Titanic",
    "parch"       : "Number of parents/children aboard the Titanic",
    "fare"        : "Fare paid for ticket",
    "who"         : "Whether the passenger is man, woman, or child",
    "deck"        : "",
    "embark_town" : "Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q = Queenstown)",
    "alone"       : ""
}
feat_imp = TEF.fit(titanic, 'survived', verbose=0, return_agg_feat_imp=True)
TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp)

shape: (891, 13); dtypes: bool(2), category(3), datetime64[ns](1), float64(2), int64(3), object(2); memory usage: 60.7+ KB
col name	idx	dtype	description	NaNs	unique counts	summary	fitted feature importance	possible dup lev	row 159	row 329	row 741
survived	0	bool	Survived (1) or died (0)	0 0%	2 0%	False 62% True 38%			False	True	False
passenger_id	1	object	Unique ID of the passenger	0 0%	891 100%	other 100%			160	330	742
name	2	object	Passenger’s name	0 0%	891 100%	other 100%			Sage, Master. Thomas Henry	Hippach, Miss. Jean Gertrude	Cavendish, Mr. Tyrell William
pclass	3	int64	Passenger’s class (1st, 2nd, or 3rd)	0 0%	3 0%	3 55% 1 24% 2 21%	2/10 0.25 28%		3	1	1
age	4	float64	Passenger’s age	177 20%	89 10%	[0.42, 20.125, 28.0, 38.0, 80.0] mean: 29.70 std: 14.53 cv: 0.49 skew: 0.39* log skew: -2.30	7/10 0.02 2%		nan	16	36
birth	5	datetime64[ns]	Created from minusing the titanic happened date from Age	177 20%	72 8%	1832-05-03 1874-04-23 1884-04-20 1892-04-18 1912-04-14	8/10 0.02 2%		NaT	1896-04-17 00:00:00	1876-04-22 00:00:00
sibsp	6	int64	Number of siblings/spouses aboard the Titanic	0 0%	7 1%	[0.0, 0.0, 0.0, 1.0, 8.0] mean: 0.52 std: 1.10 cv: 2.11 skew: 3.69* log skew: 1.67	6/10 0.02 3%		8	0	1
parch	7	int64	Number of parents/children aboard the Titanic	0 0%	7 1%	[0.0, 0.0, 0.0, 0.0, 6.0] mean: 0.38 std: 0.81 cv: 2.11 skew: 2.74* log skew: 0.93	9/10 0.02 2%		2	1	0
fare	8	float64	Fare paid for ticket	0 0%	248 28%	[0.0, 7.9104, 14.4542, 31.0, 512.3292] mean: 32.20 std: 49.69 cv: 1.54 skew: 4.78* log skew: 0.90	4/10 0.03 4%		69.55	57.9792	78.85
who	9	category	Whether the passenger is man, woman, or child	0 0%	3 0%	man 60% woman 30% child 9%	1/10 0.42 48%	(man, woman)	man	woman	man
deck	10	category		688 77%	8 1%	nan 77% C 7% B 5% D 4% E 4% A 2% F 1% G 0%	3/10 0.05 5%		nan	B	C
embark_town	11	category	Where the passenger got on the ship (C – Cherbourg, S – Southampton, Q = Queenstown)	2 0%	4 0%	Southampton 72% Cherbourg 19% Queenstown 9% nan 0%	5/10 0.03 3%		Southampton	Cherbourg	Southampton
alone	12	bool		0 0%	2 0%	True 60% False 40%	10/10 0.02 2%		False	False	False

Show Example of Generating a Final Standard HTML File

titanic = TEF.load_dataset('titanic')
desc = {
    "survived"    : "Survived (1) or died (0)",
    "passenger_id": "Unique ID of the passenger",
    "name"        : "Passenger's name",
    "pclass"      : "Passenger's class (1st, 2nd, or 3rd)",
    "age"         : "Passenger's age",
    "birth"       : "Created from minusing the titanic happened date from Age",
    "sibsp"       : "Number of siblings/spouses aboard the Titanic",
    "parch"       : "Number of parents/children aboard the Titanic",
    "fare"        : "Fare paid for ticket",
    "who"         : "Whether the passenger is man, woman, or child",
    "deck"        : "",
    "embark_town" : "Where the passenger got on the ship (C - Cherbourg, S - Southampton, Q = Queenstown)",
    "alone"       : ""
}
feat_imp = TEF.fit(titanic, 'survived', verbose=0, return_agg_feat_imp=True)
meta = TEF.dfmeta(titanic, description=desc, fitted_feat_imp=feat_imp)
TEF.dfmeta_to_htmlfile(meta, 'titanic_dfmeta.html', 'titanic metadata')

'titanic_dfmeta.html saved'

get_desc_template

get_desc_template(df, var_name='desc', suffix_idx=False)

A function that takes the original dataframe and print a description template for user to fill in.

args

df: pd.dataframe
var_name: the variable name for the generated dictionary
suffix_idx: if True, suffix “# idx” for each line, can be useful when searching in a large dataset

return: None. The template will be printed.

example

Show Example

titanic = TEF.load_dataset('titanic')
TEF.get_desc_template(titanic, var_name='titanic_desc', suffix_idx=True)

titanic_desc = {
    "survived"    : "", # 0
    "passenger_id": "", # 1
    "name"        : "", # 2
    "pclass"      : "", # 3
    "age"         : "", # 4
    "birth"       : "", # 5
    "sibsp"       : "", # 6
    "parch"       : "", # 7
    "fare"        : "", # 8
    "who"         : "", # 9
    "deck"        : "", # 10
    "embark_town" : "", # 11
    "alone"       : ""  # 12
}

get_desc_template_file

get_desc_template_file(df, filename='desc.py', var_name='desc', suffix_idx=False)

Similar to above, will save a .py file in the working directory.

dfmeta_to_htmlfile

dfmeta_to_htmlfile(styled_df, filename, head=''):

save the styled meta dataframe to html file

args

styled_df: IPython.core.display.HTML, the object returned by dfmeta
filename: string, can includes file path, e.g., ‘dataset_dictionary.html’
head: the header in that html file (in h1 tag)
original_df: the original dataframe that was passed to dfmeta, use to generate verbose print out at the beginning of the file, can be ignored

return: None

example

Show Example

titanic = TEF.load_dataset('titanic')
meta = TEF.dfmeta(titanic)
TEF.dfmeta_to_htmlfile(meta, 'titanic_dfmeta.html', 'titanic metadata')

'titanic_dfmeta.html saved'

summary

summary(s, max_lev=10, br_way=', ', sum_num_like_cat_if_nunique_small=5)

A function that takes a series and returns a summary string, same as the one you see in dfmeta.

arge

s: padnas.series
max_lev: the max level to display counts for category and object, the other will display as ‘other’
br_way: the way to break line, can use <br/> in order to pass to html
sum_num_like_cat_if_nunique_small: int, for int dtype, if the number of unique levels is smaller than it, it will try to summarize it like category

return: a string

example

Show Example

titanic = TEF.load_dataset('titanic')
TEF.summary(titanic.deck)

'nan 77%, C  7%, B  5%, D  4%, E  4%, A  2%, F  1%, G  0%'

possible_dup_lev

possible_dup_lev(series, threshold=0.9, truncate=False)

A function that will be used as possible dup lev in dfmeta. It basically just save the string pairs if

any([fuzz.ratio(l[i], l[j]) > threshold, fuzz.partial_ratio(l[i], l[j]) > threshold, fuzz.token_sort_ratio(l[i], l[j]) > threshold, fuzz.token_set_ratio(l[i], l[j]) > threshold])

It will skip object dtype if the number of levels is more than 100. You can set truncate=False or set the dtype of that column to category to force it to check, but it will take a while. For category, it checks no matter how many unique levels are there, be mindful before setting the dtype.

args

series: pd.series
threshold: the threshold that will be passed to fuzzywuzzy
truncate: if True, truncate the returning string if longer than 1000. It will be set to True when calling from dfmeta.

return: a string

Show Example

import pandas as pd
s = pd.Series(['lakers won', 'laker won', 'lake won'])
TEF.possible_dup_lev(s)

'(lakers won, laker won); (laker won, lake won)'

plot_1var

plot_1var(df, max_lev=20, log_numeric=True, cols=None, save_plt=None)

plot a plot for every cols, according to its dtype

args

df: pandas dataframe
max_lev: skip if theres too many levels, no need when used auto_set_type function
log_numeric: bool, plot two more plots for numerical which take log on it
cols: a list of int, columns to plot, specify is you don’t want to plot all columns, can be use with save_plt arg
save_plt: string, if not None, will save every plots to working directory, the string will be the prefix, a folder is okay but you need to creat the folder by yourself first

return: None

example

Show Example

titanic = TEF.load_dataset('titanic')
TEF.plot_1var(titanic)

# plots hidden

titanic = TEF.load_dataset('titanic')
TEF.plot_1var(titanic, cols=[1,5,7], save_plt='titanic_1var')

plot_1var_by_cat_y

plot_1var_by_cat_y(df, y, max_lev=20, log_numeric=True,
    kind_for_num='boxen')

plot a plot for every cols, agains the given y dependent var.

Notice saving is not implemented yet, and datetime also, and cat_y means can only handle categorical y.

args

df: pandas dataframe
y: string, col name of the dependent var
max_lev: skip if theres too many levels, no need when used my auto_set_type function
log_numeric, bool, take log on y axis if its numerical var, notice the 0’s and negatives will be removed automatically
kind_for_num: string, ‘boxen’, ‘box’, ‘violin’, ‘strip’ (not recommend for big dataset), ‘swarm’ (not recommend for big dataset), the type of plot for numerical vars

example

Show Example

titanic = TEF.load_dataset('titanic')
TEF.plot_1var_by_cat_y(titanic, 'survived')

# plots hidden

fit

fit(df, y_name, verbose=1, max_lev=10, transform_date=True, transform_time=False, impute=True, 
    CV=5, class_weight='balanced', use_metric=None, return_agg_feat_imp=False)

A universal function that will detect the data type of prediction variable (y_name) in the dataframe (df) and call corresponding fitting function.

Notice for now it only deals with classification problem, i.e., it will only call fit_classification for category and bool y.

args

df: pandas dataframe
y_name: string, column name of y variable, should be included in the df
verbose: int, 0 to 3, control the amount of printouts
max_lev: the maximum level to convert categorical variables to dummies
transform_date: bool, whether to transform all datetime variable to year, month, week, dayofweek
transform_time: bool, whether to transform all datetime variable to hour, minute, second
impute: bool, whether impute the missing values or not, if False, will drop every rows that contains NaNs
CV: int, the number of folds for cross validation
class_weight: string, ‘balanced’ or None, the type of weighting for classification will be passed to LogisticRegression, RandomForestClassifier, LinearSVC, and XGBClassifier
use_metric: string, ‘f1’ or ‘f1_weighted’ for classification takse; ‘neg_mean_squared_error’ or ‘r2’ for regression, the metric to select the final model after model comparison. ‘f1’ can be applied only for binary classification while ‘f1_weighted’ can by used for multiclass. the default is ‘f1_weighted’ for classification and ‘neg_mean_squared_error’ for regression
return_agg_feat_imp: bool, whether to return a series of summarized feature importance or not, then can be passed to dfmeta

This function calls the following function by order, which should be intuitive to understand by their names

LinearRegression(),LinearRegression(),LinearRegression(),LinearRegression(),data_preprocess
- print the distribution bar plot of y
- select X by dtypes
- remove columns with too many levels (max_lev)
- engineer on datetime variables
- impute missing values or drop
- print the distribution of y again
- convert all bool and category to dummies
model_selection
- does CV-fold cross validation on 4 classification models
  - for now, the defaults are
  - for classification:
    - LogisticRegression(random_state=random_state, class_weight=class_weight, solver='lbfgs', multi_class='auto')
    - RandomForestClassifier(random_state=random_state, class_weight=class_weight, n_estimators=100),
    - LinearSVC(random_state=random_state, class_weight=class_weight)
    - XGBClassifier(random_state=random_state, scale_pos_weight=y.value_counts(normalize=True).iloc[0] if class_weight=='balanced' and binary else 1)
  - for regression:
    - LinearRegression()
    - LassoCV(random_state=random_state, cv=CV)
    - RandomForestRegressor(random_state=random_state)XGBRegressor(random_state=random_state, objective='reg:squarederror')
- plot boxplots for them
train_test_CV
- train using the best model from above again, with another random state
classification_result
- print classification report and scores
coef_to_feat_imp
- transform coefficient matrix to a single feature importance array by taking absolute maximum, if the model is using coefficients (e.g. Logistic)
plot_detailed_feature_importance or plot_summary_feature_importance
- plot_detailed tries to plot all feature importance for multi-class problem
- plot_summary only plots the abs max one
agg_feat_imp if return_agg_feat_imp is True
- transform the dummies back, e.g., if variable “class” contains 3 levels, “A”, “B”, “C”, it will be converted to dummy variables “class_A”, “class_B”, “class_C”, and it has three feature importance on them. Here it takes the abs max and converted it back to one single “class”.
- warning: it use column names for matching, therefore, it will cause confusion if it has same pattern for different columns, e.g., column like “cla_A” will be counted toward “class” LinearRegression(),

example

Show Example

TEF.fit(titanic, 'survived')

original X:
    shape: (891, 12)
    dtypes: bool(1), category(3), datetime64[ns](1), float64(2), int64(3), object(2)
processed X:
    shape: (891, 13)
    dtypes: bool(1), category(3), float64(6), int64(3)
y:

dummy X:
    shape: (891, 23)
    dtypes: bool(1), float64(6), int64(3), uint8(13)

classification result:
    accuracy      : 80.47
    false positive:  5.39
    false negative: 14.14
    f1            : 71.29

Utility Functions

reorder_col

reorder_col(df, to_move, after=None, before=None)

Reorder the columns, by specifying the one to move and where. At least one of the argument after or before should be given. Return the modified dataframe.

args

df: pandas dataframe
to_move: string, the name of the column to be moved
after/before: string, the name of the columns that should move to_move to

return: pd.dataframe

example

Show Example

titanic = TEF.load_dataset('titanic')
titanic_reordered = TEF.reorder_col(titanic, 'name', before='survived')
print(titanic.columns.tolist())
print(titanic_reordered.columns.tolist())

['survived', 'passenger_id', 'name', 'pclass', 'age', 'birth', 'sibsp', 'parch', 'fare', 'who', 'deck', 'embark_town', 'alone']
['name', 'survived', 'passenger_id', 'pclass', 'age', 'birth', 'sibsp', 'parch', 'fare', 'who', 'deck', 'embark_town', 'alone']

rename_cols_by_words

rename_cols_by_words(df, words=[], mapper={}, verbose=1)

NOTE: Think twice before using this function! Sometimes some of the column names get too long with this logic, in that case, I personally wouldn’t specify any words.

A function that rename the columns by

replacing space ‘ ‘ with ‘_’
- so that you can use d.col_name instead of d['col_name']
rename column names by mapper dictionary
rename by detecting given words in the column names, separate word by word with ‘_’

args

df: pandas dataframe
words: list of strings, words that should be detected and separated
mapper: dict, where keys are column name before renaming and values are after
verbose: int, 0, 1, 2, how many message you want to print out

return: pd.dataframe

example

Show Example

import pandas as pd
import numpy as np
np.random.seed(555)
raw = pd.DataFrame({'intcol': [34, 645, 23, 4, 0, 6], # all positive
                    'floatcol': [132.54, float('nan'), 21399.23, 0, 434.74, 4592309.23],
                    'boolcol': [True, False, True, False, False, False],
                    'categorycol': ['a', 'a', 'b', 'b', 'b', 'c'],
                    'objectcol': ['z', 'y', ' ', 'nan', 'x', '   ']})
renamed = TEF.rename_cols_by_words(raw, ['category', 'float'], verbose=2)

1  , floatcol                  -> float_col                
3  , categorycol               -> category_col             
didn't changed: ['intcol', 'boolcol', 'objectcol']

convert_column_list

convert_column_list(columns, idx_or_labels, to_idx_only=False, to_label_only=False)

Convert a list of either indices or labels of columns to either indices or labels.

args

columns: pandas.column
idx_or_labels: list, can be a mix of labels of columns
to_idx_only: if True, convert idx_or_labels all to indices
to_label_only: if True, convert idx_or_labels all to labels

If to_idx_only and to_label_only are both False (default), will convert index to label and label to index.

return: list

example

Show Example

titanic = TEF.load_dataset('titanic')
print(TEF.convert_column_list(titanic.columns, idx_or_labels=[0, 'passenger_id', 2, 'pclass']))
print(TEF.convert_column_list(titanic.columns, idx_or_labels=[0, 'passenger_id', 2, 'pclass'], to_idx_only=True))
print(TEF.convert_column_list(titanic.columns, idx_or_labels=[0, 'passenger_id', 2, 'pclass'], to_label_only=True))

['survived', 1, 'name', 3]
[0, 1, 2, 3]
['survived', 'passenger_id', 'name', 'pclass']

ct

ct(s1, s2, style=True, col_name=None, sort=False, head=False)

An enhancement of pd.crosstab, generate counts and proportion (nomalize='index') using pd.crosstab. Colored background by columns.

args

s1: pandas.Series
s2: pandas.Series
style: bool, to color the background or not
col_name: list, same length of s2.unique(), rename the column name
sort: tuple or bool, sort the output dataframe. If sort == True, it is equivalent tosort = (s2.name, 'count', 'All'), other usage like sort = (s2.name , 'proportion', True)
head: int or False, the number of head to return

return: pandas.io.formats.style.Styler

example

Show Example

titanic = TEF.load_dataset('titanic')
TEF.ct(titanic.deck, titanic.survived)

	survived
	count			proportion
	False	True	All	False	True
deck
A	8	7	15	53	47
B	12	35	47	26	74
C	24	35	59	41	59
D	8	25	33	24	76
E	8	24	32	25	75
F	5	8	13	38	62
G	2	2	4	50	50

set_relation

set_relation(s1, s2, plot=True)

Return the union, intersection, and different of two series. And plot.

args

s1: pandas.Series
s2: pandas.Series
plot: bool, plot or not

return: pd.series

example

Show Example

id1 = pd.Series([1, 2, 3, 3, 4, float('nan')], name='id1')
id2 = pd.Series([3, 4, 5, 6, 7, 8, 8, float('-nan')], name='id2')
TEF.set_relation(id1, id2)

s1 orig len       6
s2 orig len       8
s1 notnull len    5
s2 notnull len    7
s1 nunique        4
s2 nunique        6
union             8
intersection      2
in s1 only        2
in s2 only        4
dtype: int64

correspondence

correspondence(s1, s2, verbose=1, fillna=True)

Return the counts (and keys of 1-1 correspondence, 1-m (many), m-1, m-m.

args

s1: pandas.Series
s2: pandas.Series
verbose: int, the amount of printouts
fillna: bool, if False, every nan will be treated as an independent value (Python’s default, nan != nan). this may cause a extreme heavy computation for a big dataset (about 2h for a million elements).

example

Show Example

titanic = TEF.load_dataset('titanic')
crpd = TEF.correspondence(titanic.sample(10).passenger_id, titanic.sample(10).name)
crpd

1-1 10 100%, 1-m 0 0%, m-1 0 0%, m-m 0 0%, total 10
{'count_k1': {'total': 10, '1-1': 10, '1-m': 0, 'm-1': 0, 'm-m': 0},
 'k1': {'1-1': {133, 183, 233, 237, 364, 432, 506, 595, 796, 876},
  '1-m': set(),
  'm-1': set(),
  'm-m': set()}}

return: dictionary

TLL

TEF Documentations

auto_set_dtypes

dfmeta

get_desc_template

get_desc_template_file

dfmeta_to_htmlfile

summary

possible_dup_lev

plot_1var

plot_1var_by_cat_y

fit

Utility Functions

reorder_col

rename_cols_by_words

convert_column_list

ct

set_relation

correspondence

Leave a Reply Cancel reply

An Academic Geek