I used to save tweet data collection in sqlite using python, A memo that I researched variously because I wanted to do it at high speed while multithreading with java.
When dealing with sqlite sqlite-jdbc is easy to use, so I fetch it from maven and use it.
Initially, I was collecting data on the HDD,
I wondered why it was so slow, and it took a lot of time to add DBs rather than collecting them.
I searched for the cause.
Solve one by one
prepareStatement
takes a long time, so change it to ʻINSERT OR IGNORE
.journal_mode
to MEMORY
and set sync_mode
to ʻOFF`.Example of saving the tweet that was said first
DataBase.java
/**
*Driver registration
*/
public static void dbInit() {
try {
Class.forName("org.sqlite.JDBC");
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
/**
*pragma settings
*Here 3.Jornal_mode and sync_mode is set
*/
public static Properties getProperties() {
Properties prop = new Properties();
prop.put("journal_mode", "MEMORY");
prop.put("sync_mode", "OFF");
return prop;
}
/**
*Process to add to database
*If you do not put synchronized, deadlock will occur when writing hard with multithread
*/
public static synchronized void putTweet2SQL(File dbFile, List<Status> tweet) {
Statement stmt;
String dbHeader = "jdbc:sqlite:" + dbFile.getAbsolutePath();
PreparedStatement pstmt;
dbInit();
try (Connection conn = DriverManager.getConnection(dbHeader, getProperties())) { //try-with-resources
conn.setAutoCommit(false);
stmt = conn.createStatement();
//Create if you don't have a database
stmt.executeUpdate("CREATE TABLE IF NOT EXISTS tweets (tweet_id INTEGER PRIMARY KEY, user_id INTEGER, user_screen_name TEXT,tweet_text TEXT)");
//Tweet ID, user ID, screen name, tweet text
//2.Preparing to add data in bulk
pstmt = conn.prepareStatement("INSERT OR IGNORE INTO tweets VALUES (?, ?, ?, ?)");
for (Status status : tweet) {
place = status.getPlace().getFullName();
pstmt.setLong(1, status.getId());
pstmt.setLong(2, status.getUser().getId());
pstmt.setString(3, status.getUser().getScreenName());
pstmt.setString(4, status.getText());
pstmt.addBatch();//1.Add to processing
}
pstmt.executeBatch();//1.The actual database addition process is done here
conn.commit();
} catch (SQLException e) {
e.printStackTrace();
}
}
Even when reading from the DB, setFetchSize
will call the specified number at once, so it will be comfortable to some extent.
DataBase.java
/**
*Get Tweets
*If you pull huge data, the heap will be insufficient, so in that case rewrite appropriately
* @return Tweet list
*/
public static List<Status> getTweetsFromSQL(File dbFile) {
String dbHeader = "jdbc:sqlite:" + dbFile.getAbsolutePath();
PreparedStatement pstmt;
List<Status> userDetails = new ArrayList<>();//Appropriate list
dbInit();
try (Connection conn = DriverManager.getConnection(dbHeader, getProperties())) {
pstmt = conn.prepareStatement("SELECT * FROM tweets");
pstmt.setFetchSize(1000);//If you set 5000 trillion or something, you will get it all at once
ResultSet rs = pstmt.executeQuery();
while (rs.next()) {//Tweet ID, user ID, screen name, tweet text
Status status = new status();
status.setId(rs.getLong(1));
status.setUserId(rs.getLong(2));
status.setScreenName(rs.getString(3));
status.setTweetText(rs.getString(4));
}
} catch (SQLException e) {
e.printStackTrace();
}
return userDetails;
}
It's much faster and I'm happy because I could buy more SSDs
Recommended Posts