[Java] Parse Excel (not limited to various) files with Apathce Tika [Kotlin]

Thing you want to do

Parse the xlsx file. This time, we will parse using Apache Tika.

Content to parse

This time, we will parse the file with the following contents.

As a side note, the file used for confirmation when writing this article was generated using a Google Docs spreadsheet, so I have not been able to confirm that it will work properly even if it is generated from genuine / third-party software. Also, in the actual file, the fonts were changed separately, but the parse result was not affected.

Sheet 1
hoge Fuga Piyo
fizz Bads fizzBuzz

How to do

You can do this by installing a library and parsing. This time, I will try two methods, one is to parse only xml type (including xlsx) files, and the other is to have Tika detect the file type and parse it automatically.

Introduce Apache Tika Parsers

Introduced from Maven. I used 1.21 for verification.

How to parse only xml type data

It can be parsed using ʻOOXML Parser. The parsed result is stored in ContentHandler (this time BodyContentHandler). I couldn't find any way to get information from BodyContentHandler other than toString`.

The sample code is as follows. Parse the xml files in src / main / resources. As a caveat, if you put a file that is not xml type (= ʻOOXMLParser` cannot parse), it will be dropped.

The execution result is the same as the method of having Tika detect the file type and automatically parsing it, so I will omit it.

import org.apache.tika.parser.ParseContext
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata as TikaMetadata
import java.io.File
import java.io.FileInputStream

fun main() {
    val parser = OOXMLParser() //Parser for XML data
    val metaData = TikaMetadata()
    val context = ParseContext()
    val handler = BodyContentHandler() //Handler for storing parse results

    File(System.getProperty("user.dir") + "/src/main/resources").listFiles().forEach {
        parser.parse(FileInputStream(it), handler, metaData, context)
        println(handler.toString())
    }
}

How to have Tika detect the file type and parse it automatically

This can be achieved by replacing the parser with ʻAutoDetectParser` from the sample code above.

import org.apache.tika.parser.AutoDetectParser
import org.apache.tika.parser.ParseContext
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata as TikaMetadata
import java.io.File
import java.io.FileInputStream

fun main() {
    val parser = AutoDetectParser() //Parser that automatically detects file types
    val metaData = TikaMetadata()
    val context = ParseContext()
    val handler = BodyContentHandler() //Handler for storing parse results

    File(System.getProperty("user.dir") + "/src/main/resources").listFiles().forEach {
        println("-------------")
        println(it.name)

        parser.parse(FileInputStream(it), handler, metaData, context)
        println(handler.toString())
        println("-------------")
    }
}

Execution result

I tried all the file formats except zip that can be dropped in the spreadsheet.

-------------
xls_for_test.xlsx
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz



-------------
-------------
xls_for_test -Sheet 1.csv
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz

-------------
-------------
xls_for_test.ods
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
	hoge
Fuga
Piyo
	
	fizz
Bads
	fizzBuzz
	
	
	

???
Page 
??? (???)
00/00/0000, 00:00:00
Page  / 

-------------
-------------
xls_for_test -Sheet 1.pdf
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
	hoge
Fuga
Piyo
	
	fizz
Bads
	fizzBuzz
	
	
	

???
Page 
??? (???)
00/00/0000, 00:00:00
Page  / 

hoge Fuga Piyo

fizz buzz fizz buzz



-------------
-------------
xls_for_test -Sheet 1.tsv
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
	hoge
Fuga
Piyo
	
	fizz
Bads
	fizzBuzz
	
	
	

???
Page 
??? (???)
00/00/0000, 00:00:00
Page  / 

hoge Fuga Piyo

fizz buzz fizz buzz


hoge Fuga Piyo
fizz buzz fizz buzz

-------------

Impressions

For the time being, I'm glad that I could parse files in various formats, not just xlsx. However, the output of the parsed result is painful if it is in the format of BodyContentHandler, so I will search for something that better suits the purpose.

Recommended Posts

[Java] Parse Excel (not limited to various) files with Apathce Tika [Kotlin]
[Java] Get metadata from files with Apathce Tika, and get image / video width and height from metadata [Kotlin]
[Java] Get Charset with Apathce Tika / Initialize String from Charset [Kotlin]
I want to implement various functions with kotlin and java!
[Java] Handle Excel files with Apache POI
[Kotlin] Delete files with duplicate contents [Java]
[Java] Get MimeType from the contents of the file with Apathce Tika [Kotlin]
Getting started with Kotlin to send to Java developers
I want to transition screens with kotlin and java!
How to use trained model of tensorflow2.0 with Kotlin / Java
I want to make a list with kotlin and java!
I want to make a function with kotlin and java!
Java to play with Function
Let's operate Excel with Java! !!
Connect to DB with Java
Connect to MySQL 8 with Java
I want to return to the previous screen with kotlin and java!
Android: How to deal with "Could not determine java version from '10 .0.1'"
Project facet Java version 13 is not supported. How to deal with