What is Dataframe

DataFrame has a two-dimensional data structure that looks like a bundle of multiple Series. You can generate a DataFrame by passing a Series to pd.DataFrame (). Lines are automatically numbered from 0 in ascending order.

pd.DataFrame([Series, Series, ...])

It can also be generated by expressing the value in a dictionary type, which is a list type. Keep in mind that the list length of each element must be the same.

import pandas as pd

data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "year": [2001, 2002, 2001, 2008, 2006],
        "time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df)
#Output result
       fruits  time  year
0       apple     1  2001
1      orange     4  2002
2      banana     5  2001
3  strawberry     6  2008
4   kiwifruit     3  2006

import pandas as pd

index = ["apple", "orange", "banana", "strawberry", "kiwifruit"]
data1 = [10, 5, 8, 12, 3]
data2 = [30, 25, 12, 10, 8]
series1 = pd.Series(data1, index=index)
series2 = pd.Series(data2, index=index)

df = pd.DataFrame([series1,series2])

print(df)

#Output result
apple  orange  banana  strawberry  kiwifruit
0     10       5       8          12          3
1     30      25      12          10          8

Set index and column

In DataFrame, row names are called indexes and column names are called columns.

When a DataFrame is created without specifying anything Integers are assigned to the index in ascending order from 0.

In addition, the column becomes the index of Series which is the original data and the dictionary type key.

The index of the DataFrame type variable df can be set by assigning a list of the same length as the number of rows to df.index. The columns of df can be set by assigning a list of the same length as the number of columns to df.columns.

df.index = ["name1", "name2"]

import pandas as pd

index = ["apple", "orange", "banana", "strawberry", "kiwifruit"]
data1 = [10, 5, 8, 12, 3]
data2 = [30, 25, 12, 10, 8]

series1 = pd.Series(data1, index=index)
series2 = pd.Series(data2, index=index)
df = pd.DataFrame([series1, series2])

#Set df index to start at 1
df.index=[1,2]

print(df)

#Output result
 apple  orange  banana  strawberry  kiwifruit
1     10       5       8          12          3
2     30      25      12          10          8

Add a line (note the name attribute)

If you want to add one line of data, use append (). Here, DataFrame type data is df and Series type data is series.

First, prepare Series type data corresponding to the index in the column of DataFrame. Then, if you write append () as shown in the sample code below, one line of data will be added to df.

Also, if you want to add a Series whose index does not match the DataFrame (For example, for DataFrame type data with columns ["fruits", "time", "year"] If you want to add Series type data with index ["fruits", "time", "year", "date"])

A new column is added to the DataFrame type data df, and elements that do not have a value in that column are treated as missing values. NaN (Non a Number) is automatically filled.

At this time, if you want to concatenate Series without name attribute, an error will occur unless ignore_index = True is specified. The name attribute is the name attached to the Series itself (not the column name).


ata = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "year": [2001, 2002, 2001, 2008, 2006],
        "time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
series = pd.Series(["mango", 2008, 7, "12/1"], index=["fruits", "year", "time", "date"])

df = df.append(series, ignore_index=True) #It is okay to ignore without name.
print(df)
#Output result
       fruits  time  year  date
0       apple     1  2001   NaN
1      orange     4  2002   NaN
2      banana     5  2001   NaN
3  strawberry     6  2008   NaN
4   kiwifruit     3  2006   NaN
5       mango     7  2008  12/1

#Series without name attribute
series = pd.Series(["mango", 2008, 7, "12/1"], index=["fruits", "year", "time", "date"])
print(series)
#Output result
fruits    mango
year       2008
time          7
date       12/1
dtype: object

#Series with name attribute
series = pd.Series(["mango", 2008, 7, "12/1"], index=["fruits", "year", "time", "date"])
series.name = 'test'
print(series)
#Output result
fruits    mango
year       2008
time          7
date       12/1
Name: test, dtype: object

Add column

You may want to add a new item (column) to an existing DataFrame.

When the dataframe type variable is df

df["New column name"] #You can add a new column to this by assigning a list or Series to it.

When a list is assigned, the first element of the list is assigned from the first line of df, and when Series is assigned, the index of Series corresponds to the index of df.

data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "year": [2001, 2002, 2001, 2008, 2006],
        "time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)

df["price"] = [150, 120, 100, 300, 150]
print(df)
#Output result
       fruits  time  year  price
0       apple     1  2001    150
1      orange     4  2002    120
2      banana     5  2001    100
3  strawberry     6  2008    300
4   kiwifruit     3  2006    150

Data reference

Data in DataFrame can be referenced by specifying rows and columns. The reference changes as shown in the figure below depending on how the rows and columns are specified.

There are several ways to refer to it For the time being, we will handle loc and iloc.

loc makes a reference by name iloc makes a reference by number.

Reference by name

Use loc to refer to DataFrame type data by index or column name.

When the dataframe type variable is df

df.loc["List of indexes", "List of columns"] #You can get the corresponding range of DataFrame.

data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "year": [2001, 2002, 2001, 2008, 2006],
        "time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df)
#Output result
       fruits  time  year
0       apple     1  2001
1      orange     4  2002
2      banana     5  2001
3  strawberry     6  2008
4   kiwifruit     3  2006

#List of indexes in the DataFrame above[1, 2]And a list of columns["time","year"]Is specified.

df = df.loc[[1,2],["time","year"]]  #Be careful of double brackets because it is a list specification
print(df)
#Output result
       time    year
1      4      2002
2      5      2001

Reference by number

Use iloc to refer to DataFrame type data by index or column number.

When the dataframe type variable is df

df.iloc["List of line numbers","List of column numbers"] #You can get the corresponding range of DataFrames.

Numbers start at 0 for both rows and columns. In addition to passing the list, it is also possible to specify it in slice notation.

data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "year": [2001, 2002, 2001, 2008, 2006],
        "time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df)
#Output result
       fruits  time  year
0       apple     1  2001
1      orange     4  2002
2      banana     5  2001
3  strawberry     6  2008
4   kiwifruit     3  2006
#List of line numbers in the above DataFrame[1, 3]And a list of column numbers[0, 2]Is specified.

df = df.iloc[[1, 3], [0, 2]]
print(df)
#Output result
       fruits  year
1      orange  2002
3  strawberry  2008

Delete row or column

If the DataFrame type variable is df, specifying an index or column in df.drop () You can delete the corresponding row or column.

You can delete all indexes or columns by passing them in a list.

However, you cannot delete rows and columns at the same time. If you want to delete a column, specify axis = 1 in the second argument.

axis=1  #Specify this when deleting a column.

import pandas as pd

data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "time": [1, 4, 5, 6, 3],
        "year": [2001, 2002, 2001, 2008, 2006]}
df = pd.DataFrame(data)

# drop()Df 0 using,Delete the first line
df_1 = df.drop(range(0, 2))

# drop()Column of df using"year"Delete
df_2 = df.drop("year", axis=1)

print(df_1)
print()
print(df_2)
#Output result
       fruits  time  year
2      banana     5  2001
3  strawberry     6  2008
4   kiwifruit     3  2006

       fruits  time
0       apple     1
1      orange     4
2      banana     5
3  strawberry     6
4   kiwifruit     3

sort

When the dataframe type variable is df

df.sort_values(by="Column or list of columns") #Now you can sort the data.

ascending=True #This argument sorts the column values in ascending (smallest) order.

import pandas as pd

data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "time": [1, 4, 5, 6, 3],
        "year": [2001, 2002, 2001, 2008, 2006]}
df = pd.DataFrame(data)
print(df)
>>>Output result
       fruits  time  year
0       apple     1  2001
1      orange     4  2002
2      banana     5  2001
3  strawberry     6  2008
4   kiwifruit     3  2006
#Sort data in ascending order(Specify a column as an argument)
df = df.sort_values(by="year", ascending = True)
print(df)
#Output result
       fruits  time  year
0       apple     1  2001
2      banana     5  2001
1      orange     4  2002
4   kiwifruit     3  2006
3  strawberry     6  2008
#Sort data in ascending order(Specify a list of columns as an argument)
#Output result
df = df.sort_values(by=["time", "year"] , ascending = True)
print(df)
       fruits  time  year
0       apple     1  2001
4   kiwifruit     3  2006
1      orange     4  2002
2      banana     5  2001
3  strawberry     6  2008

filtering

DataFrame is the same as Series By specifying a bool type sequence, you can perform filtering to extract only True ones.

Also, like Series, you can get a bool type sequence from a conditional expression using DataFrame. You can use this conditional expression to perform filtering.

For example, the code below extracts only even rows of data.

data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
        "year": [2001, 2002, 2001, 2008, 2006],
        "time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df.index % 2 == 0)
print()
print(df[df.index % 2 == 0])
#Output result
[ True False  True False  True]

      fruits  time  year
0      apple     1  2001
2     banana     5  2001
4  kiwifruit     3  2006

When the dataframe type variable is df

df.loc[df["column"]Conditional expression including] #A DataFrame with rows containing matching elements is generated.

import numpy as np
import pandas as pd

np.random.seed(0)
columns = ["apple", "orange", "banana", "strawberry", "kiwifruit"]

#Generate a DataFrame and add a column
df = pd.DataFrame()
for column in columns:
    df[column] = np.random.choice(range(1, 11), 10)
df.index = range(1, 11)

#Using filtering, df"apple"5 or more columns"kiwifruit"Assign a DataFrame to df that contains rows whose columns have a value greater than or equal to 5
df = df.loc[df["apple"]>=5]
df = df.loc[df["kiwifruit"]>=5]

print(df)

#output
apple  orange  banana  strawberry  kiwifruit
1      6       8       6           3         10
5      8       2       5           4          8
8      6       8       4           8          8

Python application: Pandas # 3: Dataframe

What is Dataframe

Set index and column

Add a line (note the name attribute)

Add column

Data reference

Reference by name

Reference by number

Delete row or column

sort

filtering