Featured Post

Set up machine learning and deep learning on AWS

Here is the simple instructions to set up a EC2 instance to run machine learning and deep learning on AWS 1.  Run an EC2 instance from ...

May 13, 2023

Pandas How to Load the File Efficiently

Here are the methods on how to load data efficiently with Pandas:

  • Use the right data structure. The data structure you choose can have a big impact on the efficiency of loading and processing data. For example, if you have a large dataset, you may want to use a data structure that is designed for efficient storage and retrieval, such as a NumPy array or a Pandas DataFrame.
  • Use the right tools. There are a number of tools available in Pandas that can help you load data more efficiently. For example, the pandas.read_csv() function has a number of options that can be used to improve the performance of loading a CSV file.
  • Optimize your code. There are a number of ways to optimize your code to improve the efficiency of loading data. For example, you can use functions to avoid repeating code, and you can use generators to load data lazily.

Here are some specific examples of how to load data efficiently with Pandas:

  • To load a CSV file, you can use the pandas.read_csv() function. This function will read the CSV file and return a Pandas DataFrame.
Code snippet
df = pd.read_csv("data.csv")
  • To load a JSON file, you can use the pandas.read_json() function. This function will read the JSON file and return a Pandas DataFrame.
Code snippet
df = pd.read_json("data.json")
  • To load a SQL database, you can use the pandas.read_sql() function. This function will connect to the database and return a Pandas DataFrame.
Code snippet
df = pd.read_sql("SELECT * FROM table", "database")

By following these tips, you can load data more efficiently with Pandas and improve the performance of your applications.

Here are some additional tips:

  • Use a smaller sample of the data. If you only need to work with a small subset of the data, you can use the .sample() method to select a random sample of the data. This can be useful if you are working with a large dataset and you want to avoid loading the entire dataset into memory.
  • Use a data cache. If you are loading the same data repeatedly, you can use a data cache to store the data in memory. This can improve the performance of loading the data by avoiding the need to read the data from disk each time.
  • Use a distributed computing framework. If you have a large dataset, you can use a distributed computing framework to load the data in parallel. This can significantly improve the performance of loading the data.

Machine Learning public datasets

There are many public datasets available for machine learning that can be used for research, experimentation, and model development. Here are some popular sources of public datasets for machine learning:

UCI Machine Learning Repository: This is a collection of datasets that cover a wide range of topics, including classification, regression, and clustering. The datasets are available in various formats, including CSV, ARFF, and others.

Kaggle Datasets: Kaggle is a platform for data science competitions and also provides a collection of public datasets. The datasets cover various domains, including computer vision, natural language processing, and tabular data.

Google Dataset Search: Google Dataset Search is a search engine for datasets that allows users to find datasets from a variety of sources, including government agencies, universities, and research institutions.

Amazon Web Services (AWS) Public Datasets: AWS provides a collection of public datasets that can be used for machine learning and other applications. The datasets cover a range of domains, including genomics, astronomy, and finance.

Open Data on AWS: This is a collection of public datasets that are hosted on AWS. The datasets cover various domains, including healthcare, finance, and transportation.

Data.gov: This is the US government's open data portal, which provides access to thousands of datasets from various government agencies.

Microsoft Research Open Data: This is a collection of datasets from Microsoft Research that cover various domains, including healthcare, education, and social media.

Best Machine Learning and Natural Language Processing courses

Machine Learning

  • Machine Learning with Python by Andrew Ng on Coursera
  • Machine Learning A-Z™: Hands-On Artificial Intelligence with Python by Kirill Eremenko and Hadelin de Ponteves on Udemy
  • Introduction to Machine Learning by Stanford University on YouTube
  • Machine Learning for Absolute Beginners by freeCodeCamp on YouTube
  • Machine Learning with TensorFlow by Google on TensorFlow

Natural Language Processing

  • Natural Language Processing with Python by Manning Publications on Coursera
  • Natural Language Processing with Deep Learning by Stanford University on Coursera
  • Speech and Language Processing by Dan Jurafsky and Martin Wattenberg on Coursera
  • Natural Language Processing with spaCy by Manning Publications on Pluralsight
  • Natural Language Processing with Hugging Face Transformers by Hugging Face on YouTube


Dec 21, 2021

Regular Expression with Python

 Metacharacters

[] : A set of characters

\ : Signals a special sequence (can also be used to escape special characters) "\d"

. : Single dot to match any character (except newline character) "he..o"

^ : Starts with "^hello"

$ : Ends with "planet$"

* : Zero or more occurrences "he.*o"

+ : One or more occurrences "he.+o"

? : Zero or one occurrences "he.?o"

{} : Exactly the specified number of occurrences "he{2}o"

| : Either or "x|y" matches either "x" or "y"

() : Capture and group


Special Sequences

\A Returns a match if the specified characters are at the beginning of the string "\AThe"

\b Returns a match where the specified characters are at the beginning or at the end of a word

(the "r" in the beginning is making sure that the string is being treated as a "raw string") r"\bain"

r"ain\b"

\B Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word

(the "r" in the beginning is making sure that the string is being treated as a "raw string") r"\Bain"

r"ain\B"

\d Returns a match where the string contains digits (numbers from 0-9) "\d"

\D Returns a match where the string DOES NOT contain digits "\D"

\s Returns a match where the string contains a white space character "\s"

\S Returns a match where the string DOES NOT contain a white space character "\S"

\w Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character) "\w"

\W Returns a match where the string DOES NOT contain any word characters "\W"

\Z Returns a match if the specified characters are at the end of the string "Spain\Z"

Sets or Character Class

[arn] Returns a match where one of the specified characters (a, r, or n) are present

[a-n] Returns a match for any lower case character, alphabetically between a and n

[^arn] Returns a match for any character EXCEPT a, r, and n

[0123] Returns a match where any of the specified digits (0, 1, 2, or 3) are present

[0-9] Returns a match for any digit between 0 and 9

[0-5][0-9] Returns a match for any two-digit numbers from 00 and 59

[a-zA-Z] Returns a match for any character alphabetically between a and z, lower case OR upper case

[+] In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

Python regular expression 

The re module in Python provides functions and support for regular expressions.

re.search()

re.match()

re.sub()

Replace "xxx" or "yyy" with "abc"

>>> old_string = "I xxx Python yyy"

>>> new_string = re.sub(r"xxx|yyy", "abc", old_string)

Replace first "xxx" with "abc"

>>> old_string = "abc xxx abc xxx"

>>> new_string = re.sub("xxx", "abc", old_string, 1)

>>> print(new_string)

Jan 11, 2021

The buy vs. rent decision - value of money

 


The buy vs. rent decision problem is a classical project to illustrate the value of money in the financial courses. There are different versions of this project. The context of the various versions are pretty much the same with the change only on the numbers - interest rate, price of the condo, etc. Here a solution based on the Python functions is provided using the functionalities implemented in this article.

The story started with the consideration to buy a property. In May 2013, Rebecca Young completed her MBA and moved to Toronto for a new job in investment banking. There, she rented a spacious, two-bedroom condominium for $3,000 per month, which included parking but not utilities or cable television. In July 2014, the virtually identical unit next door became available for sale with an asking price of $620,000, and Young believed she could purchase it for $600,000. She realized she was facing the classic buy-versus-rent decision. It was time for her to apply some of the analytical tools she had acquired in business school — including “time value of money” concepts — to her personal life...

Summary of the financial details

  • Purchase price of the property: $600,000
  • Down payment: 20% of property price
  • Mortgage interest (Semi-annual compound): 4% 
  • Mortgage periods: 25 years 
  • Owner monthly fees: Condo fee ($1055) + property tax ($300) + maintenance ($50)
  • Purchase closing cost: 1.5% + 1.5% + $2000
  • Rent: $3000 
Summary of scenarios
  1. The condo price remains unchanged
  2. The condo price drops 10% over the next two years, then increases back to it's purchase price by the end of 5 years
  3. The condo price increases annually by the annual rate of inflation of 2% over the next 10 years
  4. The condo price increases annually by an annual rate of 5% over the next 10 years.
Summary of selling fees
  • 5% of selling price
  • $2000 flat fee
The loan amount is the $600k * 80% = $480k. The monthly rate can be calculated from the nominal interest as r = (1 + 0.04/2)^(1/6) - 1. The total number of payments is 25x12 = 300. The monthly payment of loan can be calculated as below. The number is $2524.9. The opportunity cost is the extra payment comparing to rent - 2524.9 + 1055 + 300 + 50 - 3000 = 929.91.

def get_monthly_pmt(loan_amt, r, n):
    """ calculate monthly payment
    loan_amt: inital loan amount
    r: monthly interest
    n: total number of payments
    """
    return loan_amt*r*(1 + r)**n / ((1 + r)**n - 1)


def get_principle(loan_amt, A, r, t):
    """ calculate principle for month t
    loan_amt: intial loan amount
    A: monthly payment
    r: monthly intrest
    t: number of payments
    """
    return loan_amt*((1 + r)**t) - A*((1 + r)**t - 1)/r

"""
r = (1 + 0.04/2)^(1/6) - 1 = 0.00330589

>>> get_monthly_pmt(loan_amt=480000, r=0.00330589, n=300)

2524.905347597944
"""

Considering the value of money in the future, the opportunity cost is contributed by two parts: 
  • The initial out of pocket - the value is expected to increase in the future with the initial deposit. The initial out of pocket is down payment plus closing cost: 600000*20% + 600000*3% + 2000 = $140,000.
  • The savings of rent - the accumulative value is expected to increase in the future by continuous contribution of $929.91.
The future value of the opportunity cost in 2, 5 and 10 years can be calculated as below. The interesting observation here is that the decision of bur or rent is highly dependent on the housing price variation.

def get_future_value_init(deposit, r, n):
    """ Calculate future value
    deposit: monthly deposit
    r: monthly interest
    n: total number of months
    """
    return deposit*(1 + r)**n


def get_future_value(deposit, r, n):
    """ Calculate future value
    deposit: monthly deposit
    r: monthly interest
    n: total number of deposit
    """
    return deposit*((1 + r)**n - 1)/r

for i in [24, 60, 120]:
    a = get_future_value_init(140000, r=0.00330589, n=i)
    b = get_future_value(929.91, r=0.00330589, n=i)
    print(a, b, a+b)
 

"""
151540.5012231941 23187.245216568885 174727.74643976297
170659.2154860795 61600.682143214086 232259.8976292936
208032.62735945804 136691.56848584456 344724.1958453026
"""

Scenario analysis

The principle balance after 2, 5, and 10 years can be calculated as follows. The take-home amount is the selling price subtracted by the selling fee and principle balance. And the future value can be derived by the take-home amount subtracted by the future value of the opportunity.

def get_principle(loan_amt, A, r, t):
    """ calculate principle for month t
    loan_amt: intial loan amount
    A: monthly payment
    r: monthly intrest
    t: number of payments
    """
    return loan_amt*((1 + r)**t) - A*((1 + r)**t - 1)/r


for i in [24, 60, 120]:
    print(get_principle(480000, A=2524.91, r=0.00330589, t=i))
    
"""
456608.96654832666
417857.9213183897
342107.0756292622
"""

The results of the 4 different scenarios are summarized as below. It's an interesting observation that the final net future value is highly dependent on the property price. The decision to buy or rent the property is based on the housing market.

Scenario 1:

OpportunityPrincipleSelling-priceSelling-feeTake-homeNet-future-value
174,728456,609600,00032,000111,391-63,337
232,260417,858600,00032,000150,142-82,118
344,724342,107600,00032,000225,893-118,831

Scenario 2:

OpportunityPrincipleSelling-priceSelling-feeTake-homeNet-future-value
174,728456,609540,00029,00054,391-120,337
232,260417,858600,00032,000150,142-82,118
344,724342,107660,00035,000282,893-61,831

Scenario 3:

OpportunityPrincipleSelling-priceSelling-feeTake-homeNet-future-value
174,728456,609624,24033,212134,419-40,309
232,260417,858662,44835,122209,468-22,792
344,724342,107731,39738,570350,7205,996

Scenario 4:

OpportunityPrincipleSelling-priceSelling-feeTake-homeNet-future-value
174,728456,609661,50035,075169,816-4,912
232,260417,858765,76940,288307,62375,363
344,724342,107977,33750,867584,363239,639

Jan 10, 2021

Loan or Mortgage amortization function with Python


After some search on the internet, I'm not able to find a simple implementation of the loan or mortgage amortization schedule. This is usually achieved with the excel template in the financial world. One can simply use the formula to compute the monthly payment and the template to work on the schedule. There are more complex tasks that are not easily to refer to the excel sheet. For instance, we might want to perform a task to look up the balance on the amortization schedule to make the decision when to pay off the balance. This could be simply implemented in Python and it's a nice illustration of how to use Python in Finance. I decided to take a shot to the loan amortization function as the start of a financial calculator. The results has been cross checked with the financial calculators. 

There are generally tree steps. 

  • Calculate the monthly payment for the loan
  • Calculate the principle for amortization schedule
  • Calculate the payment schedule

The monthly payment can be easily calculated with the following formula. The detailed loan amortization derivation can be found on wikipedia.


where

  • A = payment amount per period
  • P = initial principal (loan amount)
  • r = interest rate per period
  • n = total number of payments or periods


The first step is to calculate the monthly payment with the loan amortization formula. The input is the initial loan amount, the month interest and the total length of loan in months. The interest published is usually the nominal annual interest. The monthly interest can be simply derived by nominal interest/12. The Python function is listed as below.

def get_monthly_pmt(loan_amt, r, n):
    """ calculate monthly payment
    loan_amt: initial loan amount
    r: monthly interest
    n: total number of payments
    """
    return loan_amt*r*(1 + r)**n / ((1 + r)**n - 1)

"""
loan_amt = 500,000
r = 0.04/12
n = 300

>>> get_monthly_pmt(loan_amt=500000, r=0.04/12, n=300)

2639.1842014888507
"""

The second step is to calculate the principle of any month t.

def get_principle(loan_amt, A, r, t):
    """ calculate principle for month t
    loan_amt: intial loan amount
    A: monthly payment
    r: monthly intrest
    t: number of payments
    """
    return loan_amt*((1 + r)**t) - A*((1 + r)**t - 1)/r

The final step is to calculate the monthly amortization schedule. 
def get_schedule(loan_amt, r, n):
    """ calculate the amortization schedule
    loan_amt: intial loan amount
    r: monthly intrest
    n: total number of payments
    """
    monthly = []
    A = get_monthly_pmt(loan_amt, r, n)
    last = loan_amt
    for i in range(1, n+1):
        curr = get_principle(loan_amt, A, r, i)
        p = last - curr
        monthly.append([i, p, A - p, curr])
        last = curr
    return pd.DataFrame(monthly, columns=['month', 'principle_pmt', 'interest_pmt', 'principle'])

To put everything in the same place with a real time example. The output is a pandas data frame containing the amortization schedule. The balance of any month can be simply looked up from the data frame.
"""
Loan/Mortgage Amortization schedule
Financial calculator
"""

import pandas as pd


def get_monthly_pmt(loan_amt, r, n):
    """ calculate monthly payment
    loan_amt: initial loan amount
    r: monthly interest
    n: total number of payments
    """
    return loan_amt*r*(1 + r)**n / ((1 + r)**n - 1)
    
def get_principle(loan_amt, A, r, t):
    """ calculate principle for month t
    loan_amt: intial loan amount
    A: monthly payment
    r: monthly intrest
    t: number of payments
    """
    return loan_amt*((1 + r)**t) - A*((1 + r)**t - 1)/r

def get_schedule(loan_amt, r, n):
    """ calculate the amortization schedule
    loan_amt: intial loan amount
    r: monthly intrest
    n: total number of payments
    """
    monthly = []
    A = get_monthly_pmt(loan_amt, r, n)
    last = loan_amt
    for i in range(1, n+1):
        curr = get_principle(loan_amt, A, r, i)
        p = last - curr
        monthly.append([i, p, A - p, curr])
        last = curr
    return pd.DataFrame(monthly, columns=['month', 'principle_pmt', 'interest_pmt', 'principle'])

"""
loan_amt = 500,000
r = 0.04/12
n = 300

>>> get_monthly_pmt(loan_amt=500000, r=0.04/12, n=300)

2639.1842014888507

>>> df = show_schedule(loan_amt=500000, r=0.04/12, n=300)

month	principle_pmt	interest_pmt	principle
0	1	972.517535	1666.666667	499027.482465
1	2	975.759260	1663.424942	498051.723205
2	3	979.011791	1660.172411	497072.711414
3	4	982.275163	1656.909038	496090.436251
4	5	985.549414	1653.634788	495104.886837
...	...	...	...	...
295	296	2595.634264	43.549938	10469.347082
296	297	2604.286378	34.897824	7865.060704
297	298	2612.967332	26.216869	5252.093371
298	299	2621.677224	17.506978	2630.416148
299	300	2630.416148	8.768054	0.000000
"""
    

Jan 4, 2021

All About Python - what can Python do?

 


One of the most commonly asked questions about Python is : What Python can do? As a fast growing language, Python has been used in many domains, including data analytics, data visualization, model development, natural language processing, and many others.

1. Python for data analytics

This is a the domain usually dominated by R, SAS, SQL, Matlab, etc. With the rich Python libraries, one can achieve almost everything that these language/software can do. And there is only one single programming language to learn. Sounds amazing? You may use SAS for data processing and you are familiar with working on a table. You need to do summaries, aggregations, table joining. The Python alternative is 'pandas'. Pandas provides a table interface called - data frame and tons of functions, capable to do anything you can imagine. What if you need to work on arrays, matrix, high dimensional data, 'numpy' is the library to provide an array interface to process list of lists ...

2. Python for data visualization
Data visualization or graphics is not usually easily adopted in other languages. This is an area used to be dominated by softwares, such as Matlab, origin, etc. With Python, there are so many amazing libraries to realize the graphics, such as the matplotlib, bokeh, seaborn, etc.

3. Python for model development

Model development, or machine learning is the most attractive applications of Python. For research/study, sklearn is good enough to cover major machine learning models, and facilities to build models. The best feature is the model development API that is now popular to the model developers. For more advanced areas, like machine learning, the most prevalently used libraries are Tensorflow and PyTorch, which are originated from Google and Facebook separately.

4. Python for natural language processing

Natural language processing (NLP) is one of the trending subjects of artificial intelligence. Most programming language has zero touch on this subjects. The study and research, the natural language toolkit (NLTK) is a good library to get started. Some other libraries, such as spaCy provides more advanced capabilities to processing natural language.