I will explain how to use Pandas in an easy-to-understand manner. If you read this article properly, it's OK.
If you're a complete beginner, just listen to the CSV files before you start studying Pandas.
CSV (comma separated value) is a file that literally reads "values are separated by commas (,)". Let's look at a concrete example. Suppose you have a file like the one below.
Language used,Years of experience,annual income
Python,10,"¥60,000,000.00"
Ruby,2,"¥3,500,000.00"
Swift,4,"¥5,000,000.00"
If you open this with excel or google spreadsheet, it will be displayed as follows. Conclusion The only thing you should keep in mind is the "," delimited version of the excel file.
・ What is Pandas? ・ Installation procedure ・ Basic data type ・ How to retrieve data (loc, iloc, head, tail, etc.) ・ Data reading and output ・ Data sorting ・ Processing of missing values Manipulate data ・ Series edition ・ DataFrame ・ Statistical processing
Pandas is a library for efficient data analysis in Python. It's kind of abstract and I don't know what it is, so I'll talk about it concretely. When performing machine learning or data analysis, the data for that learning is often not organized for proper learning. Therefore, if you use this Pandas, you can conveniently shape the data. This process before performing this machine learning is called data preprocessing. Speaking of data preprocessing, use Pandas! !! !! !! !! !! !! !! !! Please keep in mind.
If you installed Python using Anaconda, you probably already have it installed. If not installed
pip install pandas
When using Pandas, you need to load the Pandas library.
import pandas as pd
It's annoying to call it with pandas every time, so I generally use pd.
Series Series is a data type with only one column. To put it simply, a one-dimensional data structure.
import pandas as pd
l = [1,2,3,4,5]
series = pd.Series(l)
print(series)
==========>
0 1
1 2
2 3
3 4
4 5
dtype: int64
The number on the left is the index (row label) and the number on the right is the series data.
Dataframes are two-dimensional labeled data structures, the most used data structures in Pandas. I think it's easy to understand if you imagine the data of excel and spreadsheet.
import pandas as pd
df = pd.DataFrame({
'Program language' :['Python', 'Ruby', 'Go'],
'Years of experience' : [1, 1, 2],
'annual income' : [3000000, 2800000, 16900000]
})
print(df)
===========>
Program language Years of experience Years of income
0 Python 1 3000000
1 Ruby 1 2800000
2 Go 2 16900000
Such an image
By the way, in the data frame type, it is automatically sorted by the row label (index), so the order may change.
For the series, you can access it with the line label as it is.
import pandas as pd
l = [1,2,3,4,5]
series = pd.Series(l)
print(series[1])
==========>
2
The problem is here. There are various ways to take it out, so let's look at it in order. As a premise, assume that you have the following data.
import pandas as pd
df = pd.DataFrame({
'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
'Years of experience' : [1, 1, 2, 3, 1,3],
'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
'age' : [21,22,34,55,11,8]
})
print(df)
============>
Program Language Years of Experience Years Income Age
0 Python 1 3000000 21
1 Python 1 2800000 22
2 Ruby 2 16900000 34
3 Go 3 1230000 55
4 C# 1 2000000 11
5 C# 3 500000 8
print(df['Program language'])
#Or df.'Program language'But similar results can be obtained.
=================>
0 Python
1 Python
2 Ruby
3 Go
4 C#
5 C#
Name:Program language, dtype: object
print(df[0:2])
===============>
Program Language Years of Experience Years Income Age
0 Python 1 3000000 21
1 Python 1 2800000 22
I will explain in detail because it seems that it will not be understood as getting a column.
If you enter the key normally with df [], pandas will determine that this is a column name
.
If you type in df [slice], Pandas will consider it a row label
.
This time specify both rows and columns. loc Basic usage of loc loc [Specify row, specify column] In loc, specify the row name and column name. iloc Basic usage of iloc iloc [row number, column number] In iloc, specify by row number and column number.
import pandas as pd
df = pd.DataFrame({
'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
'Years of experience' : [1, 1, 2, 3, 1,3],
'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
'age' : [21,22,34,55,11,8]
})
print(df.loc[0:2,'Program language'])#This also includes the last value of the slice. It's just the name of the line.
print(df.iloc[0:2,0])#This does not include the last value of the slice!
=================>
0 Python
1 Python
2 Ruby
Name:Program language, dtype: object
0 Python
1 Python
Name:Program language, dtype: object
Please read the comments for the time being. There are some differences in the output results. By the way, if you access a column that does not exist, NaN will be returned.
If you use head (), the first 5 cases You can use tail () to access the last 5 items.
print(df.head())
==================>
Program Language Years of Experience Years Income Age
0 Python 1 3000000 21
1 Python 1 2800000 22
2 Ruby 2 16900000 34
3 Go 3 1230000 55
4 C# 1 2000000 11
print(df.tail())
==================>
Program Language Years of Experience Years Income Age
1 Python 1 2800000 22
2 Ruby 2 16900000 34
3 Go 3 1230000 55
4 C# 1 2000000 11
5 C# 3 500000 8
#You can specify how many items to access with an argument.
print(head(2))
====================>
Program Language Years of Experience Years Income Age
0 Python 1 3000000 21
1 Python 1 2800000 22
print(tail(2))
=====================>
Program Language Years of Experience Years Income Age
4 C# 1 2000000 11
5 C# 3 500000 8
By using query (), it is possible to specify the value of the data frame and extract the row containing it. It is usually specified using a comparison operator.
import pandas as pd
df = pd.DataFrame({
'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
'Years of experience' : [1, 1, 2, 3, 1,3],
'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
'age' : [21,22,34,55,11,8]
})
print(df.query('Years of experience<= 2'))
========================>
Program Language Years of Experience Years Income Age
0 Python 1 3000000 21
1 Python 1 2800000 22
2 Ruby 2 16900000 34
4 C# 1 2000000 11
Pandas has the ability to enter data and output the data as a file after manipulation. Here, we will only introduce the functions.
import pandas as pd
pd.read_CSV('file name', header, sep,...)#read_In CSV, the default delimiter is ",」
pd.read_table('file name', header, sep....)# read_In table, the default delimiter is "\t」
#As output,
pd.to_csv('file name')
pd.to_excel('file name')
pd.to_html('file name')
#And so on.
There are two main methods.
import pandas as pd
df = pd.DataFrame({
'Program language' :['Python', 'Python','Ruby', 'Go','C#','C#'],
'Years of experience' : [1, 1, 2, 3, 1,3],
'annual income' : [3000000, 2800000, 16900000,1230000,2000000,500000],
'age' : [21,22,34,55,11,8]
})
print(df.sort_index(ascending=False))
===============================>
Program Language Years of Experience Years Income Age
5 C# 3 500000 8
4 C# 1 2000000 11
3 Go 3 1230000 55
2 Ruby 2 16900000 34
1 Python 1 2800000 22
0 Python 1 3000000 21
print(df.sort_values(by="annual income") )
=================================>
Program Language Years of Experience Years Income Age
5 C# 3 500000 8
3 Go 3 1230000 55
4 C# 1 2000000 11
1 Python 1 2800000 22
0 Python 1 3000000 21
2 Ruby 2 16900000 34
You will come across many missing values in data analysis and machine learning. Missing values are the missing parts of the data. (For example, the unanswered column of the questionnaire) coming soon....
Recommended Posts