Featured Post

Set up machine learning and deep learning on AWS

Here is the simple instructions to set up a EC2 instance to run machine learning and deep learning on AWS 1.  Run an EC2 instance from ...

Showing posts with label Categorical Variable. Show all posts
Showing posts with label Categorical Variable. Show all posts

Apr 24, 2020

Ordinal label encoder of categorical variables

machine learning label encoder

Ordinal label encoder of categorical variables

During the machine learning model development, it is usually required to convert the categorical variables into numerical in order to fit the machine learning packages. This process is called categorical variable encoder. There are various ways to code the categorical variables into numerical, such as label encoder, one-hot encoder, weight of evidence encoder, etc. 

Label encoder is the most simple methodology, which convert the categorical values into numbers without any order requirement. Suppose we have a pandas data frame with a categorical variable - "cat" and the target variable - "target". The following code snippet will label each unique values into a number. The benefit comparing to the sklearn labelencoder is that the missing values have been considered as a new category, while the sklearn will prompt errors.

Simple label encoder convert the values of the categorical variables into labels

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

What if we want the label to be in order? In this case, the categorical variable is truly converted into a numerical variable that are rank ordering with the dependent variable. One way is to compute the mean of the target variables for each value of the categorical variable. If we sort the pair of cat-target by target, the values of the categorical variables have been sorted in order.

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

Or more compact way,

def ordinal_label(df, cat, target):
    values = df.groupby([cat])[target].mean().reset_index().\
        sort_values([target])[cat].values
    return dict(zip(values, range(len(values))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

A better strategy is to do some binning of the values with similar 'mean of target', which can be tackled manually.