I encountered a scene where I wanted to read multiple csv files stored in multiple zip files under a certain folder at once, so I made a note.
When the folder structure is like this, I want to read the csv files in each zip file at once and store them in the list.
input/
┣ zip_files/
┃ ┣ test1.zip/
┃ ┃ ┣ test1_1.csv
┃ ┃ ┣ test1_2.csv
┃ ┃ ...
┃ ┣ test2.zip/
┃ ┃ ┣ test2_1.csv
┃ ┃ ┣ test2_2.csv
┃ ...
You can access the files inside without opening the zip with the unzip and unz functions. I wanted to add it to the list as an append of python, but I wasn't sure, so I compromised below.
library(tidyverse)
library(data.table)
zip_list <- list.files("zip_files")
# function of read csv files in zip files
get_csv <- function(zip_list){
csv_list <- list()
zip_lists <- list()
# Loop through the list of files
for(j in 1:length(zip_list)) {
# Create list of files
file <- unzip(paste0("zip_files/", zip_list[j]), list = TRUE)
for(i in 1:length(file)){
# If a file is a csv file, unzip it and read the data
if(grepl("csv", file[i,1])) {
print(paste0('reading following file...', file[i,1]))
csv_files <- read_csv(unz(paste0("zip_files/", zip_list[j]), file[i,1]),
col_names=TRUE)
########################
# Add Some process.
########################
csv_list[[i]] <- csv_files
zip_lists[[j]] <- csv_list
}
}
}
return(zip_lists)
}
system.time(csvs <- get_csv(zip_list))
The zipfile module allows you to access the inside without unzipping the zip file. Since enumerate returns the index and element of the object to be turned by the for statement, it is convenient when adding some processing. (In the example below, it is the same even if it is not used)
import os
import zipfile
import glob
import pandas as pd
import time
df_list = list()
start = time.time()
for i, zips in enumerate(zip_list):
zip_f = zipfile.ZipFile(zips)
file_list = zip_f.namelist() # file names of csv files in zip
for j, files in enumerate(file_list):
print('reading following file...' + zips + '/' + files)
df = pd.read_csv(zip_f.open(files))
########################
# Add Some process.
# If use i and j too.
########################
df_list.append(df)
elapsed_time = time.time() - start
print ("elapsed_time:{0}".format(elapsed_time) + "[sec]")
Recommended Posts