[Java + jsoup] Scraping Mercari's products for sale

【Overview】

Simply specify the URL of Mercari to scrape the products for sale and list them on the HTML1 page. It was troublesome to display Mercari bookmarks one by one, so I automated it. It is listed on GitHub → Mercari Scraping

[Preparation]

-Write the URL of the product you want to get in a text file. Lines with a half-width sharp "#" at the beginning are treated as comments. (In this example, C: \ Users \ nobu \ Desktop \ tmp \ mercari_url.txt)

mercari_url.txt


#Algorithms and data structures for standard C programmers
https://item.mercari.com/jp/product_key/1_28384941/
#Oracle PL as a professional/Introduction to SQL [3rd Edition]
https://item.mercari.com/jp/product_key/1_33099446/
#Learning Perl for the first time
https://item.mercari.com/jp/product_key/1_32331276/

【code】

Main.java


package scrap.main;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.StandardCharsets;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class Main {
	public static String MERCARI_URL_FILE = "C:\\Users\\nobu\\Desktop\\tmp\\mercari_url.txt";
	public static String OUTPUT_HTML_FILE = "C:\\Users\\nobu\\Desktop\\tmp\\output_mercari.html";

	public static void main(String[] args) {

		//Read Mercari URL
		BufferedReader reader = null;
		try {
			reader = new BufferedReader(
					new InputStreamReader(new FileInputStream(MERCARI_URL_FILE), StandardCharsets.UTF_8));
		} catch (FileNotFoundException e1) {
			e1.printStackTrace();
		}

		String str;
		Document document = null;
		Elements element = null;
		String html = "";
		try {
			while ((str = reader.readLine()) != null) {
				//Half-width sharp at the beginning"#"Skip comment lines that are
				if (str.startsWith("#")) {
					continue;
				}

				int pageNo = 1;
				//Loop to a URL where there is no product for sale
				do {
					//Get page source, request timeout set to 10 seconds
					document = Jsoup
							.connect(str + "?page=" + pageNo + "#sell-items")
							.timeout(10000).get();
					element = document.getElementsByClass("entertainment-product-sell-item-content");
					html += element.outerHtml();
					pageNo++;
				} while (!element.isEmpty());
			}
		} catch (IOException e) {
			e.printStackTrace();
		}

		//html file creation
		try {
			File file = new File(OUTPUT_HTML_FILE);
			PrintWriter pw = new PrintWriter(new BufferedWriter(new FileWriter(file)));
			pw.println("<!DOCTYPE html>");
			pw.println("<html lang=\"ja-JP\">");
			pw.println("<head>");
			pw.println("<link href=\"https://item.mercari.com/jp/assets/css/app.jp.css?3062056556\" rel=\"stylesheet\">");
			pw.println("</head>");
			pw.println("<body>");
			pw.println("<main class=\" l-container clearfix\">");
			pw.println(html.replaceAll("class=\"lazyload\"", "").replaceAll("data-src", "src"));
			pw.println("</main>");
			pw.println("</body>");
			pw.println("</html>");
			pw.close();
		} catch (IOException e) {
			e.printStackTrace();
		}

	}
}

【Execution result】

When you run C:\Users\nobu\Desktop\tmp\output_mercari.html Is completed. The image is part of the page and is information at the time of writing this article.

メルカリ.png

Recommended Posts

[Java + jsoup] Scraping Mercari's products for sale
For JAVA learning (2018-03-16-01)
2017 IDE for Java
Java for statement
I tried scraping a stock chart using Java (Jsoup)
[Java] for statement, while statement
Website scraping with jsoup
[Java] Package for management
[Java] for statement / extended for statement
Scraping practice using Java ②
Countermeasures for Java OutOfMemoryError
NLP for Java (NLP4J) (2)
(Memo) Java for statement
NLP for Java (NLP4J) (1)
Scraping practice using Java ①
Scraping for beginners (Ruby)