Featured Post

Set up machine learning and deep learning on AWS

Here is the simple instructions to set up a EC2 instance to run machine learning and deep learning on AWS 1.  Run an EC2 instance from ...

Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

May 13, 2023

Pandas How to Load the File Efficiently

Here are the methods on how to load data efficiently with Pandas:

  • Use the right data structure. The data structure you choose can have a big impact on the efficiency of loading and processing data. For example, if you have a large dataset, you may want to use a data structure that is designed for efficient storage and retrieval, such as a NumPy array or a Pandas DataFrame.
  • Use the right tools. There are a number of tools available in Pandas that can help you load data more efficiently. For example, the pandas.read_csv() function has a number of options that can be used to improve the performance of loading a CSV file.
  • Optimize your code. There are a number of ways to optimize your code to improve the efficiency of loading data. For example, you can use functions to avoid repeating code, and you can use generators to load data lazily.

Here are some specific examples of how to load data efficiently with Pandas:

  • To load a CSV file, you can use the pandas.read_csv() function. This function will read the CSV file and return a Pandas DataFrame.
Code snippet
df = pd.read_csv("data.csv")
  • To load a JSON file, you can use the pandas.read_json() function. This function will read the JSON file and return a Pandas DataFrame.
Code snippet
df = pd.read_json("data.json")
  • To load a SQL database, you can use the pandas.read_sql() function. This function will connect to the database and return a Pandas DataFrame.
Code snippet
df = pd.read_sql("SELECT * FROM table", "database")

By following these tips, you can load data more efficiently with Pandas and improve the performance of your applications.

Here are some additional tips:

  • Use a smaller sample of the data. If you only need to work with a small subset of the data, you can use the .sample() method to select a random sample of the data. This can be useful if you are working with a large dataset and you want to avoid loading the entire dataset into memory.
  • Use a data cache. If you are loading the same data repeatedly, you can use a data cache to store the data in memory. This can improve the performance of loading the data by avoiding the need to read the data from disk each time.
  • Use a distributed computing framework. If you have a large dataset, you can use a distributed computing framework to load the data in parallel. This can significantly improve the performance of loading the data.

Jan 10, 2021

Loan or Mortgage amortization function with Python


After some search on the internet, I'm not able to find a simple implementation of the loan or mortgage amortization schedule. This is usually achieved with the excel template in the financial world. One can simply use the formula to compute the monthly payment and the template to work on the schedule. There are more complex tasks that are not easily to refer to the excel sheet. For instance, we might want to perform a task to look up the balance on the amortization schedule to make the decision when to pay off the balance. This could be simply implemented in Python and it's a nice illustration of how to use Python in Finance. I decided to take a shot to the loan amortization function as the start of a financial calculator. The results has been cross checked with the financial calculators. 

There are generally tree steps. 

  • Calculate the monthly payment for the loan
  • Calculate the principle for amortization schedule
  • Calculate the payment schedule

The monthly payment can be easily calculated with the following formula. The detailed loan amortization derivation can be found on wikipedia.


where

  • A = payment amount per period
  • P = initial principal (loan amount)
  • r = interest rate per period
  • n = total number of payments or periods


The first step is to calculate the monthly payment with the loan amortization formula. The input is the initial loan amount, the month interest and the total length of loan in months. The interest published is usually the nominal annual interest. The monthly interest can be simply derived by nominal interest/12. The Python function is listed as below.

def get_monthly_pmt(loan_amt, r, n):
    """ calculate monthly payment
    loan_amt: initial loan amount
    r: monthly interest
    n: total number of payments
    """
    return loan_amt*r*(1 + r)**n / ((1 + r)**n - 1)

"""
loan_amt = 500,000
r = 0.04/12
n = 300

>>> get_monthly_pmt(loan_amt=500000, r=0.04/12, n=300)

2639.1842014888507
"""

The second step is to calculate the principle of any month t.

def get_principle(loan_amt, A, r, t):
    """ calculate principle for month t
    loan_amt: intial loan amount
    A: monthly payment
    r: monthly intrest
    t: number of payments
    """
    return loan_amt*((1 + r)**t) - A*((1 + r)**t - 1)/r

The final step is to calculate the monthly amortization schedule. 
def get_schedule(loan_amt, r, n):
    """ calculate the amortization schedule
    loan_amt: intial loan amount
    r: monthly intrest
    n: total number of payments
    """
    monthly = []
    A = get_monthly_pmt(loan_amt, r, n)
    last = loan_amt
    for i in range(1, n+1):
        curr = get_principle(loan_amt, A, r, i)
        p = last - curr
        monthly.append([i, p, A - p, curr])
        last = curr
    return pd.DataFrame(monthly, columns=['month', 'principle_pmt', 'interest_pmt', 'principle'])

To put everything in the same place with a real time example. The output is a pandas data frame containing the amortization schedule. The balance of any month can be simply looked up from the data frame.
"""
Loan/Mortgage Amortization schedule
Financial calculator
"""

import pandas as pd


def get_monthly_pmt(loan_amt, r, n):
    """ calculate monthly payment
    loan_amt: initial loan amount
    r: monthly interest
    n: total number of payments
    """
    return loan_amt*r*(1 + r)**n / ((1 + r)**n - 1)
    
def get_principle(loan_amt, A, r, t):
    """ calculate principle for month t
    loan_amt: intial loan amount
    A: monthly payment
    r: monthly intrest
    t: number of payments
    """
    return loan_amt*((1 + r)**t) - A*((1 + r)**t - 1)/r

def get_schedule(loan_amt, r, n):
    """ calculate the amortization schedule
    loan_amt: intial loan amount
    r: monthly intrest
    n: total number of payments
    """
    monthly = []
    A = get_monthly_pmt(loan_amt, r, n)
    last = loan_amt
    for i in range(1, n+1):
        curr = get_principle(loan_amt, A, r, i)
        p = last - curr
        monthly.append([i, p, A - p, curr])
        last = curr
    return pd.DataFrame(monthly, columns=['month', 'principle_pmt', 'interest_pmt', 'principle'])

"""
loan_amt = 500,000
r = 0.04/12
n = 300

>>> get_monthly_pmt(loan_amt=500000, r=0.04/12, n=300)

2639.1842014888507

>>> df = show_schedule(loan_amt=500000, r=0.04/12, n=300)

month	principle_pmt	interest_pmt	principle
0	1	972.517535	1666.666667	499027.482465
1	2	975.759260	1663.424942	498051.723205
2	3	979.011791	1660.172411	497072.711414
3	4	982.275163	1656.909038	496090.436251
4	5	985.549414	1653.634788	495104.886837
...	...	...	...	...
295	296	2595.634264	43.549938	10469.347082
296	297	2604.286378	34.897824	7865.060704
297	298	2612.967332	26.216869	5252.093371
298	299	2621.677224	17.506978	2630.416148
299	300	2630.416148	8.768054	0.000000
"""
    

Jan 4, 2021

All About Python - what can Python do?

 


One of the most commonly asked questions about Python is : What Python can do? As a fast growing language, Python has been used in many domains, including data analytics, data visualization, model development, natural language processing, and many others.

1. Python for data analytics

This is a the domain usually dominated by R, SAS, SQL, Matlab, etc. With the rich Python libraries, one can achieve almost everything that these language/software can do. And there is only one single programming language to learn. Sounds amazing? You may use SAS for data processing and you are familiar with working on a table. You need to do summaries, aggregations, table joining. The Python alternative is 'pandas'. Pandas provides a table interface called - data frame and tons of functions, capable to do anything you can imagine. What if you need to work on arrays, matrix, high dimensional data, 'numpy' is the library to provide an array interface to process list of lists ...

2. Python for data visualization
Data visualization or graphics is not usually easily adopted in other languages. This is an area used to be dominated by softwares, such as Matlab, origin, etc. With Python, there are so many amazing libraries to realize the graphics, such as the matplotlib, bokeh, seaborn, etc.

3. Python for model development

Model development, or machine learning is the most attractive applications of Python. For research/study, sklearn is good enough to cover major machine learning models, and facilities to build models. The best feature is the model development API that is now popular to the model developers. For more advanced areas, like machine learning, the most prevalently used libraries are Tensorflow and PyTorch, which are originated from Google and Facebook separately.

4. Python for natural language processing

Natural language processing (NLP) is one of the trending subjects of artificial intelligence. Most programming language has zero touch on this subjects. The study and research, the natural language toolkit (NLTK) is a good library to get started. Some other libraries, such as spaCy provides more advanced capabilities to processing natural language.

Dec 30, 2020

Ways to format the code snippet and tables in the blog

 


I'm trying to embed the Python code into the the blog posts. I have played around the format and found there is no good support in blogger to fulfill this task. After a deep dive, I have found two ways to embed the code snippet in a good format. 

One way is to use the HTML format site - http://hilite.me/.  Here is a sample Python code to judge if a number is null or not.

# import numpy
import numpy as np

# function to judge if a number is null
def is_null(x):
   return x != x

# test samples
is_null(np.nan)
is_null(2)

Or you are not comfortable to use HTML, simply copy and paste the code below in the HTML view and modify it in the Compose view.

<!-- HTML --><div style="background: #f0f0f0; overflow:auto;width:auto;border:None gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><pre style="margin: 0; line-height: 125%"><span style="color: #60a0b0; font-style: italic"># print</span>
<span style="color: #007020; font-weight: bold">print</span>(<span style="color: #4070a0">&quot;All About Python&quot;</span>)
</pre></div>

The output of the above embedding.

# modify the code here
print("All About Python")

The other way is to use gist - https://gist.github.com/. Here are the steps to create the formatted code in the gist. An example of Python code is showed below.

  1. Sign up a account with GitHub if you don't have it
  2. Enter the file name with proper extension. For example, I use isnull.py as the file name
  3. Copy or type your code
  4. Select "Create public gist" and click on it. Now the code is ready to share
  5. Choose "Embed" and Copy the link address and paste in the blog editor in HTML mode.

To add tables to the blog, it's easier to use tableizer to format table in HTML code. It's simply to copy the table from the excel sheet to the tableizer and choose the proper font size, color, and font style. Click on the "Tableize it!" and copy the HTML, and paste it in the HTML view. An example is showed below.

DateOpenHighLowClose
10/18/19128.34129.21127.38129.12
10/27/19128.21129.39127.1129.31
10/26/19127.31128.21127.03128.12
10/25/19126.41127.56125.98127.21

Dec 28, 2020

Python installation and environment setup

 


There are many ways to install Python on your machine. The easiest and most popular way is to download and install Anaconda. A pure Python without any packages is useless, as we are using a list of Python packages to do the work. To use Python for numerical computation, we need NumPy and SciPy. To process data and tables, we would need the Pandas, and for data visualization, Matplotlib is the basic package. You may also need machine learning packages, such as scikit-learn, TensorFlow, etc. To manage the dependencies of the package version,  it's essential to use a Python package manager. Anaconda is the best of the Python package mangers that provides the essential Python tools to get started. 

For individual uses, you can find the Anaconda for your machine specifics here on Anaconda's download page. Choose the operation system and the corresponding installer, follow the instruction and install the Anaconda on your machine. 

Suppose you are using a Mac OS and you have installed the Anaconda using the Graphical Installer. There are multiple ways to open Python.

1. Open the terminal and type 'python'

$ python

2. Open the graphical Anaconda-Navigator and click on the notebook

3. Open the terminal and type 'jupyter notebook'

$ jupyter notebook

Introduction to Python language

 


There are so many programming languages. Why do we need to learn Python? What are the advantages that make Python unique for beginners to learn coding? Well, there might be many answers to this question. The best answer would be that Python is popular, especially in data science and machine learning. Python is now becoming the basic and almost unique tool to data scientist and model developers. 


Another advantage of the Python is the availability in open source libraries. Python has now been used prevalently in machine learning and deep learning with a list of high level packages available in open source, such as scikit-learn, TensorFlow, PyTorch, etc.

Python as a programming language was designed with simple syntax, emphasizing on the readability and prototyping. 

Writing the first line of code with Python on terminal or Jupyter notebook

print('All About Python')

Define a sequence of numbers

alist = [2, 3, 5, 9, 10]
print(alist)

Write a function to filter the even numbers in a list

def get_even_numbers(a):
    even_nums = []
    for x in a:
        if x % 2 == 0:  
            even_nums.append(x)
    return even_nums

get_even_numbers(alist)




Apr 25, 2020

Python methods cheat sheet

Python Tips and Tricks, You Haven't Already Seen, Part 2

Python list methods

append(item)
>>> lst = [1, 2, 2]
>>> lst.append(4)
[1, 2, 2, 4]
count(item)
>>> lst.count(2)
2
extend(list)
>>> lst.extend([5, 6])
[1, 2, 2, 4, 5, 6]
index(item)
>>> lst.index(5)
4
insert(position, item)
>>>lst.insert(1, 9)
[1, 9, 2, 2, 3, 5, 6]
sort()
>>> lst.sort()
[1, 2, 2, 3, 5, 6, 9]
pop(index)
>>> lst.pop()
9
reverse()
>>> lst.reverse()
[6, 5, 3, 2, 1, 1]

Python string methods
>>> s = 'Hello2Python'
count(sub, start, end)
>>> s.count('l', 1, 5)
2
find(sub, start, end)
>>> s.find('l', 0, 5)
2
index(sub, start, end)
>>> s.index('t', 0, 9)
8
isalnum()
>>> s.isalnum()
True


isalpha()
>>> s.isalpha()
False
isdigit()
>>> s.isdigit()
False
islower()
>>> s.islower()
False
isupper()
>>> s.isupper()
False
isspace()
>>> s.isspace()
False
join()
>>> s.join('___')
'_Hello2Python_Hello2Python_Hello2Python_'
lower()
>>> s.lower()
hello2python
partition(sep)
>>> s.partition('2')
('Hello', '2', 'Python')
split(sep)
>>> s.split('2')
['Hello',  'Python']
strip()
>>> s.strip('n')
'Hello2Pytho'
upper()
>>> s.upper()
'HELLO2PYTHON'



Apr 24, 2020

Adjust the cell width of the Jupyter notebook

The default width of the notebook cell is sometimes not fit the purpose to display the content. In this case, we would like to adjust the cell size. There are few solutions to do this.

The permanent solution is to create a file in the home folder for IPython notebook.

~/.ipython/profile_default/static/custom/custom.css

or Jupyter notebook

~/.jupyter/custom/custom.css

with content

.container { width:100% !important; }

Then restart iPython/Jupyter notebooks. Note that this will affect all notebooks. 

If you don't want to change your default settings, and you only want to change the width of the current notebook you're working on, you can enter the following into a cell:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;</style>"))

Python map and filter function

Python provides map and filter functions to process computations on the list of items. Here is a list of examples how to use map and filter.

Python map function examples

Compute the square of 1 to 10

def f(x):
    return x*x

map(f, range(1, 11))

or use lambda functions

map(lambda x: x*x, range(10)) 


Python filter function examples

Extract the event numbers from 1 to 10

def f(x):
    return x%2 == 0

filter(f, range(1, 11))

or use lambda functions

filter(lambda x: x%2 == 0, range(1, 11))

Python - list comprehension examples

Python Logo Png | Transparent PNG Download #616238 - Vippng

List comprehension is a Python feature to process elements in the list without writing a function.

Square of numbers from 0 to n

Python script
for x in range(10):
    print(x**2)

List comprehension
[x**2 for x in range(n)]

Map function
map(lambda x: x**2, range(n))

Square of even numbers from 0 to n

Python script
for x in range(10):
    if x%2 == 0:
        print(x**2)

List comprehension with if clause
[x**2 for x in range(n) if x%2 == 0]


List comprehension with two items
[x + y for x in range(10) for y in range(10)]
[(x, y, x+y) for x in range(10) for y in range(10)]

List comprehension with two items with if clause
[(x, y, x+y) for x in range(10) for y in range(10) if (x%2==0) & * (y%2==0)]

Ordinal label encoder of categorical variables

machine learning label encoder

Ordinal label encoder of categorical variables

During the machine learning model development, it is usually required to convert the categorical variables into numerical in order to fit the machine learning packages. This process is called categorical variable encoder. There are various ways to code the categorical variables into numerical, such as label encoder, one-hot encoder, weight of evidence encoder, etc. 

Label encoder is the most simple methodology, which convert the categorical values into numbers without any order requirement. Suppose we have a pandas data frame with a categorical variable - "cat" and the target variable - "target". The following code snippet will label each unique values into a number. The benefit comparing to the sklearn labelencoder is that the missing values have been considered as a new category, while the sklearn will prompt errors.

Simple label encoder convert the values of the categorical variables into labels

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

What if we want the label to be in order? In this case, the categorical variable is truly converted into a numerical variable that are rank ordering with the dependent variable. One way is to compute the mean of the target variables for each value of the categorical variable. If we sort the pair of cat-target by target, the values of the categorical variables have been sorted in order.

def ordinal_label(df, v, target):
    df_group = df.groupby([v])[target].mean().reset_index()
    values = df_group.sort_values([target])[v].values
    return dict(zip(values, range(len(values))))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

Or more compact way,

def ordinal_label(df, cat, target):
    values = df.groupby([cat])[target].mean().reset_index().\
        sort_values([target])[cat].values
    return dict(zip(values, range(len(values))


df['cat_label'] = df['cat'].map(get_label(df, 'cat'))

A better strategy is to do some binning of the values with similar 'mean of target', which can be tackled manually.