What do you do when you want to extract database data in CSV? please tell me about that if you do not mind! I look like this ↓ ↓ ↓
I want to convert the result of an extraction request to the database by writing a query to CSV.
In-house data analysts write SQL and create files for analysis. It's okay to use Excel to connect to ODBC, but I'd like to just put the data into the statistical software. I want to process the numerical calculation logic and replacement logic as they are on the analysis software side.
There is something called odbc that acts as an intermediary for connecting to the database. Let's leave the ODBC settings and what to do to other sites. If you're not the one who sets up the database yourself, someone should be using it, so ask. For the time being, just ask for the IP address of the database server and the driver for connection.
The method is a link. Here or ODBC This is also ODBC
Set the "Data Source Name" at the time of setting. Make a note of this.
You can install R by downloading it from CRAN, installing it, and typing the following script in the R editor.
Suppose you have a table full of 6 million records called user_log It takes a tremendous amount of time to extract everything with software that normally connects to a database. If other people are also connected, it is essential that the CPU load on the server becomes scary and everyone gets angry.
For software that connects to databases, I like → CSE. This can also be connected if the ODBC data source name, ID, and pass are known.
library(RODBC)
conn_DB <- odbcConnect("Data source name", "Assigned ID", "Pass that will be issued with the ID")
query <- paste0("select count(*) from user_log")
tbl <- sqlQuery(conn_DB, query)
odbcClose(conn_DB)
Turn this with a FOR statement.
First you can connect to the database above to see how many rows the table has. Let's say you can confirm that it is 6 million. If you try to pull out all at once, the CPU of the server will stop thinking, so pull out on a small scale by dividing. On the analysis software side as well, there is a possibility that the memory will become full, so it is recommended to write it out in detail and combine it at the end.
SEQ<-seq(1,6000000,5000)
LIM<-length(SEQ)-1
conn_DB <- odbcConnect("Data source name", "Assigned ID", "Pass that will be issued with the ID")
for(i in 1:LIM){
query <- paste0(
"select * from user_log ",
"limit 5000 offset ",
as.numeric(SEQ[i])
)
tbl <- sqlQuery(conn_DB, query)
write.csv(tbl, paste0("this_is_",i,"th.csv"), row.names=F)
Sys.sleep(10)
}
odbcClose(conn_DB)
With this, the data could be extracted by dividing it into 5000 records. Queries written in SQL can be combined with paste0. If there is a single quotation mark in SQL, prefix it with a backslash.
where colum = '10'
where colum = \'10\'
If you combine characters with paste0, SQL line breaks may disappear, so be careful when writing.
paste0("select * from user_log", "where colum = '10'")
select * from user_logwhere colum = '10'
It's stuck together.
setwd("path of the file that spit out csv")
lf <- list.files(pattern="csv")
data <- data.frame()
for(i in 1:length(lf)){
add <- fread(lf[i])
data <- rbind(data,add)
}
This will put 6 million data in data. Note that the processing will be katamaru if the PC does not have enough memory.
If you want to do machine learning, why not make each csv equivalent to a mini-batch?
Since I explained the policy to some extent, I will omit the explanation from now on.
import pyodbc
import numpy as np
import pandas as pd
cnx = pyodbc.connect('DSN=Data source name; UID=Enter ID; PWD=Insert pass')
cursor = cnx.cursor()
list_for = np.arange(1,60000,5000)
list_for=list_for.tolist()
for i in range(len(list_for)):
made_que = "SELECT * FROM user_log " + "LIMIT 5000 OFFSET " + str(list_for[i])
cursor.execute(made_que)
tbl = cursor.fetchall()
array_format = np.array(tbl)
dataframe_format = pd.DataFrame(array_format)
dataframe_format.to_csv('csv_data/' + str(i) + '.csv')
After extraction, analysis production
Recommended Posts