Featured Post

Set up machine learning and deep learning on AWS

Here is the simple instructions to set up a EC2 instance to run machine learning and deep learning on AWS 1.  Run an EC2 instance from ...

Showing posts with label Machine Learning. Show all posts
Showing posts with label Machine Learning. Show all posts

May 13, 2023

Machine Learning public datasets

There are many public datasets available for machine learning that can be used for research, experimentation, and model development. Here are some popular sources of public datasets for machine learning:

UCI Machine Learning Repository: This is a collection of datasets that cover a wide range of topics, including classification, regression, and clustering. The datasets are available in various formats, including CSV, ARFF, and others.

Kaggle Datasets: Kaggle is a platform for data science competitions and also provides a collection of public datasets. The datasets cover various domains, including computer vision, natural language processing, and tabular data.

Google Dataset Search: Google Dataset Search is a search engine for datasets that allows users to find datasets from a variety of sources, including government agencies, universities, and research institutions.

Amazon Web Services (AWS) Public Datasets: AWS provides a collection of public datasets that can be used for machine learning and other applications. The datasets cover a range of domains, including genomics, astronomy, and finance.

Open Data on AWS: This is a collection of public datasets that are hosted on AWS. The datasets cover various domains, including healthcare, finance, and transportation.

Data.gov: This is the US government's open data portal, which provides access to thousands of datasets from various government agencies.

Microsoft Research Open Data: This is a collection of datasets from Microsoft Research that cover various domains, including healthcare, education, and social media.

Jan 4, 2021

All About Python - what can Python do?

 


One of the most commonly asked questions about Python is : What Python can do? As a fast growing language, Python has been used in many domains, including data analytics, data visualization, model development, natural language processing, and many others.

1. Python for data analytics

This is a the domain usually dominated by R, SAS, SQL, Matlab, etc. With the rich Python libraries, one can achieve almost everything that these language/software can do. And there is only one single programming language to learn. Sounds amazing? You may use SAS for data processing and you are familiar with working on a table. You need to do summaries, aggregations, table joining. The Python alternative is 'pandas'. Pandas provides a table interface called - data frame and tons of functions, capable to do anything you can imagine. What if you need to work on arrays, matrix, high dimensional data, 'numpy' is the library to provide an array interface to process list of lists ...

2. Python for data visualization
Data visualization or graphics is not usually easily adopted in other languages. This is an area used to be dominated by softwares, such as Matlab, origin, etc. With Python, there are so many amazing libraries to realize the graphics, such as the matplotlib, bokeh, seaborn, etc.

3. Python for model development

Model development, or machine learning is the most attractive applications of Python. For research/study, sklearn is good enough to cover major machine learning models, and facilities to build models. The best feature is the model development API that is now popular to the model developers. For more advanced areas, like machine learning, the most prevalently used libraries are Tensorflow and PyTorch, which are originated from Google and Facebook separately.

4. Python for natural language processing

Natural language processing (NLP) is one of the trending subjects of artificial intelligence. Most programming language has zero touch on this subjects. The study and research, the natural language toolkit (NLTK) is a good library to get started. Some other libraries, such as spaCy provides more advanced capabilities to processing natural language.

Apr 24, 2020

Ordinal label encoder of categorical variables

machine learning label encoder

Ordinal label encoder of categorical variables

During the machine learning model development, it is usually required to convert the categorical variables into numerical in order to fit the machine learning packages. This process is called categorical variable encoder. There are various ways to code the categorical variables into numerical, such as label encoder, one-hot encoder, weight of evidence encoder, etc. 

Label encoder is the most simple methodology, which convert the categorical values into numbers without any order requirement. Suppose we have a pandas data frame with a categorical variable - "cat" and the target variable - "target". The following code snippet will label each unique values into a number. The benefit comparing to the sklearn labelencoder is that the missing values have been considered as a new category, while the sklearn will prompt errors.

Simple label encoder convert the values of the categorical variables into labels

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

What if we want the label to be in order? In this case, the categorical variable is truly converted into a numerical variable that are rank ordering with the dependent variable. One way is to compute the mean of the target variables for each value of the categorical variable. If we sort the pair of cat-target by target, the values of the categorical variables have been sorted in order.

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

Or more compact way,

def ordinal_label(df, cat, target):
    values = df.groupby([cat])[target].mean().reset_index().\
        sort_values([target])[cat].values
    return dict(zip(values, range(len(values))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

A better strategy is to do some binning of the values with similar 'mean of target', which can be tackled manually.