DataFrame has a two-dimensional data structure that looks like a bundle of multiple Series. You can generate a DataFrame by passing a Series to pd.DataFrame (). Lines are automatically numbered from 0 in ascending order.
pd.DataFrame([Series, Series, ...])
It can also be generated by expressing the value in a dictionary type, which is a list type. Keep in mind that the list length of each element must be the same.
import pandas as pd
data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"year": [2001, 2002, 2001, 2008, 2006],
"time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df)
#Output result
fruits time year
0 apple 1 2001
1 orange 4 2002
2 banana 5 2001
3 strawberry 6 2008
4 kiwifruit 3 2006
import pandas as pd
index = ["apple", "orange", "banana", "strawberry", "kiwifruit"]
data1 = [10, 5, 8, 12, 3]
data2 = [30, 25, 12, 10, 8]
series1 = pd.Series(data1, index=index)
series2 = pd.Series(data2, index=index)
df = pd.DataFrame([series1,series2])
print(df)
#Output result
apple orange banana strawberry kiwifruit
0 10 5 8 12 3
1 30 25 12 10 8
In DataFrame, row names are called indexes and column names are called columns.
When a DataFrame is created without specifying anything Integers are assigned to the index in ascending order from 0.
In addition, the column becomes the index of Series which is the original data and the dictionary type key.
The index of the DataFrame type variable df can be set by assigning a list of the same length as the number of rows to df.index. The columns of df can be set by assigning a list of the same length as the number of columns to df.columns.
df.index = ["name1", "name2"]
import pandas as pd
index = ["apple", "orange", "banana", "strawberry", "kiwifruit"]
data1 = [10, 5, 8, 12, 3]
data2 = [30, 25, 12, 10, 8]
series1 = pd.Series(data1, index=index)
series2 = pd.Series(data2, index=index)
df = pd.DataFrame([series1, series2])
#Set df index to start at 1
df.index=[1,2]
print(df)
#Output result
apple orange banana strawberry kiwifruit
1 10 5 8 12 3
2 30 25 12 10 8
If you want to add one line of data, use append (). Here, DataFrame type data is df and Series type data is series.
First, prepare Series type data corresponding to the index in the column of DataFrame. Then, if you write append () as shown in the sample code below, one line of data will be added to df.
Also, if you want to add a Series whose index does not match the DataFrame (For example, for DataFrame type data with columns ["fruits", "time", "year"] If you want to add Series type data with index ["fruits", "time", "year", "date"])
A new column is added to the DataFrame type data df, and elements that do not have a value in that column are treated as missing values. NaN (Non a Number) is automatically filled.
At this time, if you want to concatenate Series without name attribute, an error will occur unless ignore_index = True is specified. The name attribute is the name attached to the Series itself (not the column name).
ata = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"year": [2001, 2002, 2001, 2008, 2006],
"time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
series = pd.Series(["mango", 2008, 7, "12/1"], index=["fruits", "year", "time", "date"])
df = df.append(series, ignore_index=True) #It is okay to ignore without name.
print(df)
#Output result
fruits time year date
0 apple 1 2001 NaN
1 orange 4 2002 NaN
2 banana 5 2001 NaN
3 strawberry 6 2008 NaN
4 kiwifruit 3 2006 NaN
5 mango 7 2008 12/1
#Series without name attribute
series = pd.Series(["mango", 2008, 7, "12/1"], index=["fruits", "year", "time", "date"])
print(series)
#Output result
fruits mango
year 2008
time 7
date 12/1
dtype: object
#Series with name attribute
series = pd.Series(["mango", 2008, 7, "12/1"], index=["fruits", "year", "time", "date"])
series.name = 'test'
print(series)
#Output result
fruits mango
year 2008
time 7
date 12/1
Name: test, dtype: object
You may want to add a new item (column) to an existing DataFrame.
When the dataframe type variable is df
df["New column name"] #You can add a new column to this by assigning a list or Series to it.
When a list is assigned, the first element of the list is assigned from the first line of df, and when Series is assigned, the index of Series corresponds to the index of df.
data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"year": [2001, 2002, 2001, 2008, 2006],
"time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
df["price"] = [150, 120, 100, 300, 150]
print(df)
#Output result
fruits time year price
0 apple 1 2001 150
1 orange 4 2002 120
2 banana 5 2001 100
3 strawberry 6 2008 300
4 kiwifruit 3 2006 150
Data in DataFrame can be referenced by specifying rows and columns. The reference changes as shown in the figure below depending on how the rows and columns are specified.
There are several ways to refer to it For the time being, we will handle loc and iloc.
loc makes a reference by name iloc makes a reference by number.
Use loc to refer to DataFrame type data by index or column name.
When the dataframe type variable is df
df.loc["List of indexes", "List of columns"] #You can get the corresponding range of DataFrame.
data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"year": [2001, 2002, 2001, 2008, 2006],
"time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df)
#Output result
fruits time year
0 apple 1 2001
1 orange 4 2002
2 banana 5 2001
3 strawberry 6 2008
4 kiwifruit 3 2006
#List of indexes in the DataFrame above[1, 2]And a list of columns["time","year"]Is specified.
df = df.loc[[1,2],["time","year"]] #Be careful of double brackets because it is a list specification
print(df)
#Output result
time year
1 4 2002
2 5 2001
Use iloc to refer to DataFrame type data by index or column number.
When the dataframe type variable is df
df.iloc["List of line numbers","List of column numbers"] #You can get the corresponding range of DataFrames.
Numbers start at 0 for both rows and columns. In addition to passing the list, it is also possible to specify it in slice notation.
data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"year": [2001, 2002, 2001, 2008, 2006],
"time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df)
#Output result
fruits time year
0 apple 1 2001
1 orange 4 2002
2 banana 5 2001
3 strawberry 6 2008
4 kiwifruit 3 2006
#List of line numbers in the above DataFrame[1, 3]And a list of column numbers[0, 2]Is specified.
df = df.iloc[[1, 3], [0, 2]]
print(df)
#Output result
fruits year
1 orange 2002
3 strawberry 2008
If the DataFrame type variable is df, specifying an index or column in df.drop () You can delete the corresponding row or column.
You can delete all indexes or columns by passing them in a list.
However, you cannot delete rows and columns at the same time. If you want to delete a column, specify axis = 1 in the second argument.
axis=1 #Specify this when deleting a column.
import pandas as pd
data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"time": [1, 4, 5, 6, 3],
"year": [2001, 2002, 2001, 2008, 2006]}
df = pd.DataFrame(data)
# drop()Df 0 using,Delete the first line
df_1 = df.drop(range(0, 2))
# drop()Column of df using"year"Delete
df_2 = df.drop("year", axis=1)
print(df_1)
print()
print(df_2)
#Output result
fruits time year
2 banana 5 2001
3 strawberry 6 2008
4 kiwifruit 3 2006
fruits time
0 apple 1
1 orange 4
2 banana 5
3 strawberry 6
4 kiwifruit 3
When the dataframe type variable is df
df.sort_values(by="Column or list of columns") #Now you can sort the data.
ascending=True #This argument sorts the column values in ascending (smallest) order.
import pandas as pd
data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"time": [1, 4, 5, 6, 3],
"year": [2001, 2002, 2001, 2008, 2006]}
df = pd.DataFrame(data)
print(df)
>>>Output result
fruits time year
0 apple 1 2001
1 orange 4 2002
2 banana 5 2001
3 strawberry 6 2008
4 kiwifruit 3 2006
#Sort data in ascending order(Specify a column as an argument)
df = df.sort_values(by="year", ascending = True)
print(df)
#Output result
fruits time year
0 apple 1 2001
2 banana 5 2001
1 orange 4 2002
4 kiwifruit 3 2006
3 strawberry 6 2008
#Sort data in ascending order(Specify a list of columns as an argument)
#Output result
df = df.sort_values(by=["time", "year"] , ascending = True)
print(df)
fruits time year
0 apple 1 2001
4 kiwifruit 3 2006
1 orange 4 2002
2 banana 5 2001
3 strawberry 6 2008
DataFrame is the same as Series By specifying a bool type sequence, you can perform filtering to extract only True ones.
Also, like Series, you can get a bool type sequence from a conditional expression using DataFrame. You can use this conditional expression to perform filtering.
For example, the code below extracts only even rows of data.
data = {"fruits": ["apple", "orange", "banana", "strawberry", "kiwifruit"],
"year": [2001, 2002, 2001, 2008, 2006],
"time": [1, 4, 5, 6, 3]}
df = pd.DataFrame(data)
print(df.index % 2 == 0)
print()
print(df[df.index % 2 == 0])
#Output result
[ True False True False True]
fruits time year
0 apple 1 2001
2 banana 5 2001
4 kiwifruit 3 2006
When the dataframe type variable is df
df.loc[df["column"]Conditional expression including] #A DataFrame with rows containing matching elements is generated.
import numpy as np
import pandas as pd
np.random.seed(0)
columns = ["apple", "orange", "banana", "strawberry", "kiwifruit"]
#Generate a DataFrame and add a column
df = pd.DataFrame()
for column in columns:
df[column] = np.random.choice(range(1, 11), 10)
df.index = range(1, 11)
#Using filtering, df"apple"5 or more columns"kiwifruit"Assign a DataFrame to df that contains rows whose columns have a value greater than or equal to 5
df = df.loc[df["apple"]>=5]
df = df.loc[df["kiwifruit"]>=5]
print(df)
#output
apple orange banana strawberry kiwifruit
1 6 8 6 3 10
5 8 2 5 4 8
8 6 8 4 8 8
Recommended Posts