Data analysis and statistical work are often reused in daily work. Before implementing data analysis, the most important thing is data collection and sorting. Excel is a commonly used tool for this task. When the data is small, Excel is very convenient, but when the amount of data is large, processing with Excel will be very slow, affecting work efficiency. For example: you are asked to merge more than 100 files into one. If you only rely on Excel to operate, this will involve multiple instances of copying and pasting, which may take several hours. But by implementing a few lines of code in Python we can automate Excel and help us process data efficiently.
Pandas is a powerful tool for analyzing structured data. It is based on numpy (providing high-performance matrix operations) for data mining and data analysis. It also provides data cleaning functions.
This article introduces the common operations of Pandas to handle Excel.
import pandas as pd
1
2
df = pd.read_excel('d:/python/excel_1.xlsx', header = 1, index_col = 'ID')
df = pd.read_excee('d:/python/excel_1.xlsx', skiprows = 3, usecols = 'G:L')
It should be noted that if the second parameter header is not set, Pandas will treat the first row of Excel as columns by default. When ‘header=None’, Pandas will regenerate numbers starting from 0 as columns. We can treat the second line as a column by ‘header=1’. The third parameter index_col indicates which column is used as index. If it is not set, Pandas will generate a string of numbers starting from 0 as index by default. We can set the specified column as index. For example, ‘index_col=“ID”’. If the data does not start from the first cell, you can use the skiprows parameter to skip rows, or ‘usecols=“F:H”’ to determine which column to start reading from. Besides, you can use the ‘sheet_name= ’parameter to determine which sheet to read.
Modifying the columns attribute of the object is equivalent to resetting the columns for the object.
df.columns = ['ID', 'Column_A', 'Column_B']
We can set the index through the set_index() method. For example, we can set the ID column to index. But it should be noted that after setting, the ID will no longer appear in the df.columns attribute, and will become the name of the index. The second parameter inplace can be set to true and false.
df.set_index('ID', inplace=True)
In Pandas, you can directly filter the data frame. For example, you can use df[‘Value’]>5000 to filter a single condition (value greater than 5000). If you use multiple conditions to filter, you only need to use & (Union) and | (or) operator implementation.
df['Value']>5000
It is very simple to delete data in Pandas. For example, to delete the Column_B column.
del df['Column_B']
In Pandas, you can use sort_values to sort, and use ascending to control the ascending and descending order. The default is ascending order.
df.sort_values('Value',ascending=False,inplace=True)
In Pandas, you can use data.isnull().sum()
to check for missing values, and then you can use a variety of methods to fill or delete missing values. For example, we want to replace the missing value with the value before the missing value horizontally/vertically.
df = df.fillna(axis=0,method='ffill')
In Pandas, you can use drop_duplicates to deduplicate data, and you can specify columns and retention order. For example, deduplicate sample data according to the Create_time column.
df.drop_duplicates(['Create_time'],inplace=True)
There is no fixed format modification method in Pandas. Different data formats have different modification methods. For example, modify the creation time to the format of year-month-day.
1
df['Create_time'] = df['create time to modify the creation time to year-month-day in Excel. '].dt.strftime('%Y-%m-%d')
Merge two columns into one
df['Combined_column'] = df['Column_A'] + df['Column_C']
It can be saved locally through the to_excel()
method.
df.to_excel('d:/new_filename.xlsx')
For example, merge several tables in the Excel folder into one.
Import Pandas library, os module
1 2
import pandas as pd import os
Get the path of the file
file_dir = r'C:\Users\mcc\Desktop\EXCEL'
Construct a new Excel file
new_filename = file_dir + '\\result.xlsx'
Find the names of all excels under the file path and return a list
file_list = os.listdir(file_dir)
Refactor the file path, convert Excel into a dataframe, and save it in a new list.
1 2 3 4 5
new_list = [] for file in file_list: file_path = os.path.join(file_dir, file) dataframe = pd.read_excel(file_path) new_list.append(dataframe)
Combine multiple dataframes into one and write them into a new Excel file.
1
2
df = pd.concat(new_list)
df.to_excel(new_filename, index = False)
The output is: