All About Python: Categorical Variable

Apr 24, 2020

Ordinal label encoder of categorical variables

By DC April 24, 2020 Categorical Variable, Label Encoder, Machine Learning, Python No comments

Ordinal label encoder of categorical variables

During the machine learning model development, it is usually required to convert the categorical variables into numerical in order to fit the machine learning packages. This process is called categorical variable encoder. There are various ways to code the categorical variables into numerical, such as label encoder, one-hot encoder, weight of evidence encoder, etc.

Label encoder is the most simple methodology, which convert the categorical values into numbers without any order requirement. Suppose we have a pandas data frame with a categorical variable - "cat" and the target variable - "target". The following code snippet will label each unique values into a number. The benefit comparing to the sklearn labelencoder is that the missing values have been considered as a new category, while the sklearn will prompt errors.

Simple label encoder convert the values of the categorical variables into labels

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

What if we want the label to be in order? In this case, the categorical variable is truly converted into a numerical variable that are rank ordering with the dependent variable. One way is to compute the mean of the target variables for each value of the categorical variable. If we sort the pair of cat-target by target, the values of the categorical variables have been sorted in order.

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))

df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

Or more compact way,

def ordinal_label(df, cat, target):
    values = df.groupby([cat])[target].mean().reset_index().\

        sort_values([target])[cat].values
    return dict(zip(values, range(len(values))

df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

A better strategy is to do some binning of the values with similar 'mean of target', which can be tackled manually.

All About Python

Featured Post

Set up machine learning and deep learning on AWS

Set up AWS for Machine Learning

Ordinal label encoder of categorical varaibles

Apr 24, 2020

Ordinal label encoder of categorical variables

Ordinal label encoder of categorical variables

Simple label encoder convert the values of the categorical variables into labels

Contact Form

Labels

Blog Archive

Labels

Blog Archive

Popular Posts

All About Python

Featured Post

Set up machine learning and deep learning on AWS

Set up AWS for Machine Learning

Ordinal label encoder of categorical varaibles

Apr 24, 2020

Ordinal label encoder of categorical variables

Ordinal label encoder of categorical variables

Simple label encoder convert the values of the categorical variables into labels

Social Profiles

Contact Form

Labels

Blog Archive

Labels

Blog Archive

Popular Posts