mongodb has also reached its limit.
Mongodb shortest introduction (1) Install & start on EC2 & suddenly put tens of thousands [Mongodb Shortest Introduction (2) I searched for tens of thousands] (http://qiita.com/tottoko/items/68e61daf6eafb03d4b4b) [Mongodb Shortest Introduction (3) I tried to speed up even millions] (http://qiita.com/tottoko/items/b6c9430b1e0ab7cb409c)
Mr. mongo who is strict from tens of millions of data. Since the wall of 100 million has come into view, SQL Lite has appeared in a surprising place.
[Impact NoSQL KVS Easy DB mongo redis sqlite What was God speed locally? .. .. ] (http://qiita.com/wan-liner/items/ee07442e5bd4afd5b1b5)
If no processing is required ↑ This was the strongest.
Strictly speaking, it is better to keep the file as gz in a CSV reader. It is physically faster to decompress the compression in memory than to read the file from a regular disk. Of course, except in cases where reading files is faster than memory (SSD? Nvme?). So, CSV has columns, so sometimes I want to use placeholders.
To make the SQL easier to read, I would like you to use a placeholder with the column name: column_name1 to escape or automatically sanitize it.
data=dict(zip(keys,row))
keys is a list of column names Example: ('id','title') row is a CSV row separated by a delimiter Example: 1111 "\ t" aaaa Now data is a dictionary like {id: 1111, title: aaaa}. It's magic.
cursor.execute('insert into table (:id,:title)',data)
You can use placeholders like mysql. When it's not a dictionary?
cursor.execute('insert into table (?,?)',row)
Also, magic conn.text_factory = str
You didn't tell me that, so you got angry!
sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings
# coding: utf-8
import sqlite3
import csv
import gzip
conn = sqlite3.connect("honyahonya.db")
conn.text_factory = str
csr = conn.cursor()
keys=(
'id',
'title',
'date',
'article'
)
with gzip.open("honyahonya.csv.gz", 'rU') as f:
reader = csv.reader(f, delimiter="\t")
for row in reader:
#If you are confident in the data, it is faster to omit checking the number of columns
if len(row)==len(key):
d=dict(zip(keys,row))
csr.execute('insert into tracks (id,title,date,article) values(:id,:title,:date,:article)' , d)
conn.commit()
Recommended Posts