Overview

I used to save tweet data collection in sqlite using python, A memo that I researched variously because I wanted to do it at high speed while multithreading with java.

Library

When dealing with sqlite sqlite-jdbc is easy to use, so I fetch it from maven and use it.

I was addicted to

Initially, I was collecting data on the HDD,

Anyway slow. **

I wondered why it was so slow, and it took a lot of time to add DBs rather than collecting them.

Solution

I searched for the cause.

When adding data, it was added one by one with ʻexecuteUpdate`
ʻINSERT OR UPDATE` to avoid duplication of data (tweets)
In the default setting, it is written to the db-jornal file and then added to the DB file, so extra disk access occurs. Also, it is slow because it has a synchronization flag with DB.

Solve one by one

to the appropriate library can be reserved or to insert in advance what kind of things as data in the prepareStatement
Regarding this, it seems that ʻUPDATE takes a long time, so change it to ʻINSERT OR IGNORE.
Play around with the pragma mode. Specifically, set journal_mode to MEMORY and set sync_mode to ʻOFF`.

Implementation example

Example of saving the tweet that was said first

`DataBase.java`


    /**
     *Driver registration
     */
    public static void dbInit() {
        try {
            Class.forName("org.sqlite.JDBC");
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        }
    }
    /**
     *pragma settings
     *Here 3.Jornal_mode and sync_mode is set
     */
    public static Properties getProperties() {
        Properties prop = new Properties();
        prop.put("journal_mode", "MEMORY");
        prop.put("sync_mode", "OFF");
        return prop;
    }
    /**
     *Process to add to database
     *If you do not put synchronized, deadlock will occur when writing hard with multithread
     */
    public static synchronized void putTweet2SQL(File dbFile, List<Status> tweet) {
        Statement stmt;
        String dbHeader = "jdbc:sqlite:" + dbFile.getAbsolutePath();
        PreparedStatement pstmt;
        dbInit();
        try (Connection conn = DriverManager.getConnection(dbHeader, getProperties())) { //try-with-resources
            conn.setAutoCommit(false);
            stmt = conn.createStatement();
            //Create if you don't have a database
            stmt.executeUpdate("CREATE TABLE IF NOT EXISTS tweets (tweet_id INTEGER PRIMARY KEY, user_id INTEGER, user_screen_name TEXT,tweet_text TEXT)");
            //Tweet ID, user ID, screen name, tweet text
            //2.Preparing to add data in bulk
            pstmt = conn.prepareStatement("INSERT OR IGNORE INTO tweets VALUES (?, ?, ?, ?)");
            for (Status status : tweet) {
                place = status.getPlace().getFullName();
                pstmt.setLong(1, status.getId());
                pstmt.setLong(2, status.getUser().getId());
                pstmt.setString(3, status.getUser().getScreenName());
                pstmt.setString(4, status.getText());
                pstmt.addBatch();//1.Add to processing
            }
            pstmt.executeBatch();//1.The actual database addition process is done here
            conn.commit();
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

bonus

Even when reading from the DB, setFetchSize will call the specified number at once, so it will be comfortable to some extent.

`DataBase.java`


    /**
     *Get Tweets
     *If you pull huge data, the heap will be insufficient, so in that case rewrite appropriately
     * @return Tweet list
     */
    public static List<Status> getTweetsFromSQL(File dbFile) {
        String dbHeader = "jdbc:sqlite:" + dbFile.getAbsolutePath();
        PreparedStatement pstmt;
        List<Status> userDetails = new ArrayList<>();//Appropriate list
        dbInit();
        try (Connection conn = DriverManager.getConnection(dbHeader, getProperties())) {
            pstmt = conn.prepareStatement("SELECT * FROM tweets");
            pstmt.setFetchSize(1000);//If you set 5000 trillion or something, you will get it all at once
            ResultSet rs = pstmt.executeQuery();
            while (rs.next()) {//Tweet ID, user ID, screen name, tweet text
                Status status = new status();
                status.setId(rs.getLong(1));
                status.setUserId(rs.getLong(2));
                status.setScreenName(rs.getString(3));
                status.setTweetText(rs.getString(4));
                
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }
        return userDetails;
    }

Impressions

It's much faster and I'm happy because I could buy more SSDs

I want to write quickly from java to sqlite

Overview

Library

I was addicted to

** Anyway slow. ** **

Solution

Implementation example

DataBase.java

bonus

DataBase.java

Impressions

Anyway slow. **

`DataBase.java`

`DataBase.java`