Since pandas is a data operation based on numpy, it is convenient because the operation of numpy can be used as it is. However, it is difficult to understand how to extract rows and columns until you get used to it. I'm still unfamiliar with it, so I'll write it down.
There are two types of data formats in pandas, DataFrame and Series. The former is two-dimensional data and the latter is one-dimensional data. Basically, Series is rarely used, so we will focus on DataFrame. When one column is specified and fetched from DataFrame, it becomes Series type.
# DataFrame
foo bar
a 0 1
b 2 3
c 4 5
# Series
a 0
b 2
c 4
In DataFrame, element numbers such as numpy such as the nth row and mth column and user-defined element specifications by index and column can be specified as element position information. Unless otherwise specified, a number is assigned, but it is not used in practice because it is the same as numpy in such usage. Personally, I also wonder if index can be a number.
To specify index and columns, do as follows.
df.columns = ['foo', 'bar']
df.index = ['a', 'b', 'c']
Also, to check the index and columns name of DataFrame, do as follows.
df.columns
df.index
df.info() # columns, index, memory usage
In DataFrame, the specification of how to take __getitem__
is the specification of columns. You can also retrieve by column number, but in that case, you need to specify even a single list type. However, the line number (index) of index cannot be specified by this method.
In the case of Series, index is specified by __getitem__
. It's natural because there is only one column.
df['foo'] or df[[0]] # designate single column
df[['foo', 'bar']] or df[[0, 1]] # designate multi columns
As mentioned above, there are matrix element numbers and user-defined names as element position information on the DataFrame. There are three types, ix, iloc, and loc, to clarify which one is used for extraction. iloc can be specified only by number, loc can be specified only by name, and ix can be specified by both. Taking the above example, if you want to take [0,0], you can write as follows.
df.ix[[0], [0]]
df.ix[[0], ['foo']]
df.ix[['a'], ['foo']]
df.ix[['a'], [0]]
df.iloc[[0], [0]]
df.loc[['a'], ['foo']]
By the way, if you want to specify multiple indexes, you can do as follows.
df.ix[:, [0]] #all
df.ix[1:5, [0]] #Range specification
df.ix[:] #Specify only index
How to extract rows that meet certain conditions from specified columns. All columns in that column are output.
print foo.loc[foo['bar'] == condition]
Indirectly, the elements that do not meet the conditions are made NaN, and then the columns containing NaN are deleted.
foo = foo[foo == 1] #All elements that do not meet the conditions are NaN.
foo = foo.dropna(axis=1)
When iterating for each column of pd.DataFrame.
for index, rows in df.iterrows():
print index, rows # rows: pd.It is a DataFrame.
#When creating only a vessel
foo = pd.DataFrame(columns=['bar', 'baz'])
foo = pd.DataFrame({'bar': [0, 1, 2],
'baz': [3, 4, 5]}
index=['a', 'b', 'c'])
# foo
bar baz
a 0 3
b 1 4
c 2 5
Adding a new column is easier than adding a row.
foo['qux'] = [6, 7, 8]
# foo
bar baz qux
a 0 3 6
b 1 4 7
c 2 5 8
foo = foo.append(pd.DataFrame({'bar': [6, 7], 'baz': [8, 9]}, index=['d', 'e']))
# foo
#If you want to modify the index, you need to specify it yourself.
bar baz
a 0 3
b 1 4
c 2 5
d 6 7
e 8 9
foo.drop('e')
foo.drop('bar', axis=1) #Delete the column.
del foo['bar'] #Delete the column.(I am using python del.)
Reference URL http://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas
referenced URL: http://sinhrks.hatenablog.com/entry/2014/11/12/233216
Recommended Posts