Ordinal label encoder of categorical variables
During the machine learning model development, it is usually required to convert the categorical variables into numerical in order to fit the machine learning packages. This process is called categorical variable encoder. There are various ways to code the categorical variables into numerical, such as label encoder, one-hot encoder, weight of evidence encoder, etc.Label encoder is the most simple methodology, which convert the categorical values into numbers without any order requirement. Suppose we have a pandas data frame with a categorical variable - "cat" and the target variable - "target". The following code snippet will label each unique values into a number. The benefit comparing to the sklearn labelencoder is that the missing values have been considered as a new category, while the sklearn will prompt errors.
Simple label encoder convert the values of the categorical variables into labels
def ordinal_label(df, v, target): df_group = df.groupby([v])[target].mean().reset_index() values = df_group.sort_values([target])[v].values return dict(zip(values, range(len(values)))) df['cat_label'] = df['cat'].map(get_label(df, 'cat'))
What if we want the label to be in order? In this case, the categorical variable is truly converted into a numerical variable that are rank ordering with the dependent variable. One way is to compute the mean of the target variables for each value of the categorical variable. If we sort the pair of cat-target by target, the values of the categorical variables have been sorted in order.
def ordinal_label(df, v, target): df_group = df.groupby([v])[target].mean().reset_index() values = df_group.sort_values([target])[v].values return dict(zip(values, range(len(values))))
df['cat_label'] = df['cat'].map(get_label(df, 'cat'))
Or more compact way,
def ordinal_label(df, cat, target): values = df.groupby([cat])[target].mean().reset_index().\
sort_values([target])[cat].values return dict(zip(values, range(len(values))
df['cat_label'] = df['cat'].map(get_label(df, 'cat'))
A better strategy is to do some binning of the values with similar 'mean of target', which can be tackled manually.