🎉 Wipro's Lab45 AI Platform is now free with your Topcoder account!

December 29, 2020

Processing Excel Files with Pandas Library

M_CCM_CC

DURATION

15min

Introduction

Data analysis and statistical work are often reused in daily work. Before implementing data analysis, the most important thing is data collection and sorting. Excel is a commonly used tool for this task. When the data is small, Excel is very convenient, but when the amount of data is large, processing with Excel will be very slow, affecting work efficiency. For example: you are asked to merge more than 100 files into one. If you only rely on Excel to operate, this will involve multiple instances of copying and pasting, which may take several hours. But by implementing a few lines of code in Python we can automate Excel and help us process data efficiently.

Pandas is a powerful tool for analyzing structured data. It is based on numpy (providing high-performance matrix operations) for data mining and data analysis. It also provides data cleaning functions.

This article introduces the common operations of Pandas to handle Excel.

Import the Pandas library

import pandas as pd

Read Excel file

1
2
df = pd.read_excel('d:/python/excel_1.xlsx', header = 1， index_col = 'ID')
df = pd.read_excee('d:/python/excel_1.xlsx', skiprows = 3, usecols = 'G:L')

It should be noted that if the second parameter header is not set, Pandas will treat the first row of Excel as columns by default. When ‘header=None’, Pandas will regenerate numbers starting from 0 as columns. We can treat the second line as a column by ‘header=1’. The third parameter index_col indicates which column is used as index. If it is not set, Pandas will generate a string of numbers starting from 0 as index by default. We can set the specified column as index. For example, ‘index_col=“ID”’. If the data does not start from the first cell, you can use the skiprows parameter to skip rows, or ‘usecols=“F:H”’ to determine which column to start reading from. Besides, you can use the ‘sheet_name= ’parameter to determine which sheet to read.

Set columns

Modifying the columns attribute of the object is equivalent to resetting the columns for the object.

df.columns = ['ID', 'Column_A', 'Column_B']

Set index

We can set the index through the set_index() method. For example, we can set the ID column to index. But it should be noted that after setting, the ID will no longer appear in the df.columns attribute, and will become the name of the index. The second parameter inplace can be set to true and false.

df.set_index('ID', inplace=True)

Data filtering

In Pandas, you can directly filter the data frame. For example, you can use df[‘Value’]>5000 to filter a single condition (value greater than 5000). If you use multiple conditions to filter, you only need to use & (Union) and | (or) operator implementation.

df['Value']>5000

Data deletion

It is very simple to delete data in Pandas. For example, to delete the Column_B column.

del df['Column_B']

Data sorting

In Pandas, you can use sort_values to sort, and use ascending to control the ascending and descending order. The default is ascending order.

df.sort_values('Value',ascending=False,inplace=True)

Missing value processing

In Pandas, you can use data.isnull().sum() to check for missing values, and then you can use a variety of methods to fill or delete missing values. For example, we want to replace the missing value with the value before the missing value horizontally/vertically.

df = df.fillna(axis=0,method='ffill')

Data deduplication

In Pandas, you can use drop_duplicates to deduplicate data, and you can specify columns and retention order. For example, deduplicate sample data according to the Create_time column.

df.drop_duplicates(['Create_time'],inplace=True)

Format modification

There is no fixed format modification method in Pandas. Different data formats have different modification methods. For example, modify the creation time to the format of year-month-day.

1
df['Create_time'] = df['create time to modify the creation time to year-month-day in Excel. '].dt.strftime('%Y-%m-%d')

Data consolidation

Merge two columns into one

df['Combined_column'] = df['Column_A'] + df['Column_C']

Write to file and save

It can be saved locally through the to_excel() method.

df.to_excel('d:/new_filename.xlsx')

Combine multiple Excel files into one

For example, merge several tables in the Excel folder into one.

Import Pandas library, os module

1
2
import pandas as pd
import os

Get the path of the file

file_dir = r'C:\Users\mcc\Desktop\EXCEL'

Construct a new Excel file

new_filename = file_dir + '\\result.xlsx'

Find the names of all excels under the file path and return a list

file_list = os.listdir(file_dir)

Refactor the file path, convert Excel into a dataframe, and save it in a new list.

1
2
3
4
5
new_list = []
for file in file_list:
  file_path = os.path.join(file_dir, file)
dataframe = pd.read_excel(file_path)
new_list.append(dataframe)

Combine multiple dataframes into one and write them into a new Excel file.

1
2
df = pd.concat(new_list)
df.to_excel(new_filename, index = False)

The output is:

Reference:

https://pandas.pydata.org/

Chat on Discord