For those who understand Python analysis code, we have summarized the correspondence of R code. * Updating from time to time (In this article, only the R base package is used)
There are many people who ask, "How do you write in R when you write in python?"
Unless otherwise noted, module name aliases are as follows.
python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Image of variable names appearing below.
python
df = pd.DataFrame()
R
df = data.frame()
pd.DataFrame() Creating a data frame
R
data.frame() #Generate an empty data frame
data.frame(col1=c(x1, x2, x3), col2=c(y1, y2, y3)) #column
pd.read_csv() Read CSV file (comma separated data)
R
read.csv(file name)
pd.read_table() Read TSV and CSV files (tab-delimited data)
R
read.table(file name)
Line name settings
R
rownames(df) <- c(Line name 1,Line name 2, ...)
print(rownames(df)) #Can be obtained as a vector by calling without substituting
Column name settings
R
colnames(df) <- c(Column name 1,Column name 2, ...)
print(colnames(df)) #Can be obtained as a vector by calling without substituting
df.shape Get the number of rows and columns
R
dim(df)
len(df) Get the number of lines
R
ncol(df)
len(df.columns) Get the number of columns
R
nrow(df)
df.head() First line output
R
head(df) #You can also specify the number of lines to display with an argument
df.tail() Last line output
R
tail(df) #You can also specify the number of lines to display with an argument
df.info() Display the number and type information of each column
R
str(df)
df.describe() Output basic statistics
R
summary(df) #However, std is not output
#get std, for example:
sds = NULL
for(col in colnames(df)){
sds <- c(sds, sd(df[, col]))
}
names(sds) <- colnames(df)
df.isna() Check for missing values (NA)
R
is.na(df)
df.isna().sum()
Check the number of missing values (NA) for each column
R
colSums(is.na(df))
# summary(df)But the number of NA is also output so you can check it
df[df.isna().any(axis=1)] Extract rows that have at least one missing value (NA)
R
df[!complete.cases(df), ]
df.col.unique() Returns a unique (non-overlapping) value that appears in a column
R
unique(df$col)
df.col.value_counts() Returns the number of appearances of a value that appears in a column
R
table(df$col)
df.iloc[x1:x2, y1:y2] Specify the range using the row number and column number
R
df[x1:x2, y1:y2] #Note that R has an index start of 1
df.iloc[[x1, x2, ...], [y1, y2, ...]] Specify a list using row and column numbers
R
df[c(x1, x2, ...), c(y1, y2, ...)]
Specify the range using the row name and column name
R
#It doesn't seem to exist clearly, so if you do it,
#Obtain the position (number) of the specified row name and column name and use it for range specification.
x1 <- which(rownames(df) ==Line name 1)
x2 <- which(rownames(df) ==Line name 2)
y1 <- which(colnames(df) ==Column name 1)
y2 <- which(colnames(df) ==Column name 2)
df[x1:x2, y1:y2]
Specify a list using row and column names
R
df[c(Line name 1,Line name 2, ...), c(Column name 1,Column name 2, ...)]
df[df.col == x] Extract rows that match the conditions
R
df[df$col == x, ]
#Or
subset(df, col == x)
df[new_col] = x Add a new column to the data frame
R
df[, new_col] <- x
df.drop() Delete rows and columns
R
#You can delete by selecting the row or column you want to delete and assigning NULL.
df[c(x1, x2), ] <- NULL #Delete line
df[, c(y1, y2)] <- NULL #Delete column
#Using the property of returning a matrix excluding that number when the index is negative, you can also write:
df <- df[c(-1, -2), ] #Delete line
df <- df[, c(-1, -2)] #Delete column
df.fillna(x) Fill in missing values (NA)
R
df[is.na(df)] <- x
df.dropna() Delete rows that contain missing values (NA)
R
na.omit(df)
df.apply(func) Apply the function func to each element one by one
R
sapply(df, FUN =func)
df.col.apply(func) Apply the func function to each element of the specified column
R
sapply(df$x, FUN =func)
df.T Transpose the matrix
R
t(df)
pd.to_datetime() Convert to date type
R
as.Date(df$col) #Date only (eg:'2020-01-01')
df.max()、 df.min() Find the maximum and minimum values for each column
R
sapply(df, FUN =max)
sapply(df, FUN =min)
#Equivalent processing is possible with apply
apply(df, MARGIN=2, FUN =max) #MARGIN=If 1, the function (FUN) is applied line by line.
apply(df, MARGIN=2, FUN =min) #max(df)If, find the maximum value among all elements (same for min)
df.groupby([x1, x2, ...]).agg(func) Group and aggregate
R
aggregate(. ~ x1+x2, df, FUN=sum) #「."Aggregates all columns
aggregate(x ~ x1+x2, df, FUN=sum) #Performs aggregation processing for the column specified by "x"
pd.pivot_table(df, index, columns, values) Not in the base package. maybe.
Recommended Posts