For those who understand Python analysis code, we have summarized the correspondence of R code. * Updating from time to time (In this article, only the R base package is used)

There are many people who ask, "How do you write in R when you write in python?"

Naming conventions in the document

Unless otherwise noted, module name aliases are as follows.

`python`


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Image of variable names appearing below.

`python`


df = pd.DataFrame()

`R`


df = data.frame()

How to write Python code in R

Data frame generation

pd.DataFrame() Creating a data frame

`R`


data.frame() #Generate an empty data frame
data.frame(col1=c(x1, x2, x3), col2=c(y1, y2, y3)) #column

pd.read_csv() Read CSV file (comma separated data)

`R`


read.csv(file name)

pd.read_table() Read TSV and CSV files (tab-delimited data)

`R`


read.table(file name)

df.index = [line name 1, line name 2, ...]

Line name settings

`R`


rownames(df) <- c(Line name 1,Line name 2, ...)
print(rownames(df)) #Can be obtained as a vector by calling without substituting

df.columns = [column name 1, column name 2, ...]

Column name settings

`R`


colnames(df) <- c(Column name 1,Column name 2, ...)
print(colnames(df)) #Can be obtained as a vector by calling without substituting

Check the contents of the data frame

df.shape Get the number of rows and columns

`R`


dim(df)

len(df) Get the number of lines

`R`


ncol(df)

len(df.columns) Get the number of columns

`R`


nrow(df)

df.head() First line output

`R`


head(df) #You can also specify the number of lines to display with an argument

df.tail() Last line output

`R`


tail(df) #You can also specify the number of lines to display with an argument

df.info() Display the number and type information of each column

`R`


str(df)

df.describe() Output basic statistics

`R`


summary(df) #However, std is not output

#get std, for example:
sds = NULL
for(col in colnames(df)){
  sds <- c(sds, sd(df[, col]))
}
names(sds) <- colnames(df)

df.isna() Check for missing values (NA)

`R`


is.na(df)

df.isna().sum()

Check the number of missing values (NA) for each column

`R`


colSums(is.na(df))
# summary(df)But the number of NA is also output so you can check it

df[df.isna().any(axis=1)] Extract rows that have at least one missing value (NA)

`R`


df[!complete.cases(df), ]

df.col.unique() Returns a unique (non-overlapping) value that appears in a column

`R`


unique(df$col)

df.col.value_counts() Returns the number of appearances of a value that appears in a column

`R`


table(df$col)

Data extraction

df.iloc[x1:x2, y1:y2] Specify the range using the row number and column number

`R`


df[x1:x2, y1:y2] #Note that R has an index start of 1

df.iloc[[x1, x2, ...], [y1, y2, ...]] Specify a list using row and column numbers

`R`


df[c(x1, x2, ...), c(y1, y2, ...)]

df.loc [row name 1: row name 2, column name 1: column name 2]

Specify the range using the row name and column name

`R`


#It doesn't seem to exist clearly, so if you do it,
#Obtain the position (number) of the specified row name and column name and use it for range specification.
x1 <- which(rownames(df) ==Line name 1)
x2 <- which(rownames(df) ==Line name 2)
y1 <- which(colnames(df) ==Column name 1)
y2 <- which(colnames(df) ==Column name 2)
df[x1:x2, y1:y2]

df.loc [[row name 1, row name 2, ...], [column name 1, column name 2, ...]]

Specify a list using row and column names

`R`


df[c(Line name 1,Line name 2, ...), c(Column name 1,Column name 2, ...)]

df[df.col == x] Extract rows that match the conditions

`R`


df[df$col == x, ]
#Or
subset(df, col == x)

Data processing

df[new_col] = x Add a new column to the data frame

`R`


df[, new_col] <- x

df.drop() Delete rows and columns

`R`


#You can delete by selecting the row or column you want to delete and assigning NULL.
df[c(x1, x2), ] <- NULL #Delete line
df[, c(y1, y2)] <- NULL #Delete column

#Using the property of returning a matrix excluding that number when the index is negative, you can also write:
df <- df[c(-1, -2), ] #Delete line
df <- df[, c(-1, -2)] #Delete column

df.fillna(x) Fill in missing values (NA)

`R`


df[is.na(df)] <- x

df.dropna() Delete rows that contain missing values (NA)

`R`


na.omit(df)

df.apply(func) Apply the function func to each element one by one

`R`


sapply(df, FUN =func)

df.col.apply(func) Apply the func function to each element of the specified column

`R`


sapply(df$x, FUN =func)

df.T Transpose the matrix

`R`


t(df)

pd.to_datetime() Convert to date type

`R`


as.Date(df$col) #Date only (eg:'2020-01-01'）

Data aggregation

df.max()、 df.min() Find the maximum and minimum values for each column

`R`


sapply(df, FUN =max)
sapply(df, FUN =min)

#Equivalent processing is possible with apply
apply(df, MARGIN=2, FUN =max) #MARGIN=If 1, the function (FUN) is applied line by line.
apply(df, MARGIN=2, FUN =min) #max(df)If, find the maximum value among all elements (same for min)

df.groupby([x1, x2, ...]).agg(func) Group and aggregate

`R`


aggregate(. ~ x1+x2, df, FUN=sum) #「."Aggregates all columns
aggregate(x ~ x1+x2, df, FUN=sum) #Performs aggregation processing for the column specified by "x"

pd.pivot_table(df, index, columns, values) Not in the base package. maybe.

R code compatible sheet for Python users

Naming conventions in the document

python

python

R

How to write Python code in R

Data frame generation

R

R

R

df.index = [line name 1, line name 2, ...]

R

df.columns = [column name 1, column name 2, ...]

R

Check the contents of the data frame

R

R

R

R

R

R

R

R

R

R

R

R

Data extraction

R

R

df.loc [row name 1: row name 2, column name 1: column name 2]

R

df.loc [[row name 1, row name 2, ...], [column name 1, column name 2, ...]]

R

R

Data processing

R

R

R

R

R

R

R

R

Data aggregation

R

R

`python`

`python`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`

`R`