Thing you want to do

Parse the xlsx file. This time, we will parse using Apache Tika.

Content to parse

This time, we will parse the file with the following contents.

As a side note, the file used for confirmation when writing this article was generated using a Google Docs spreadsheet, so I have not been able to confirm that it will work properly even if it is generated from genuine / third-party software. Also, in the actual file, the fonts were changed separately, but the parse result was not affected.

Sheet 1
hoge	Fuga	Piyo
fizz	Bads	fizzBuzz

How to do

You can do this by installing a library and parsing. This time, I will try two methods, one is to parse only xml type (including xlsx) files, and the other is to have Tika detect the file type and parse it automatically.

Introduce Apache Tika Parsers

Introduced from Maven. I used 1.21 for verification.

Maven Repository: org.apache.tika » tika-parsers

How to parse only `xml` type data

It can be parsed using ʻOOXML Parser. The parsed result is stored in ContentHandler (this time BodyContentHandler). I couldn't find any way to get information from BodyContentHandler other than toString`.

The sample code is as follows. Parse the xml files in src / main / resources. As a caveat, if you put a file that is not xml type (= ʻOOXMLParser` cannot parse), it will be dropped.

The execution result is the same as the method of having Tika detect the file type and automatically parsing it, so I will omit it.

import org.apache.tika.parser.ParseContext
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata as TikaMetadata
import java.io.File
import java.io.FileInputStream

fun main() {
    val parser = OOXMLParser() //Parser for XML data
    val metaData = TikaMetadata()
    val context = ParseContext()
    val handler = BodyContentHandler() //Handler for storing parse results

    File(System.getProperty("user.dir") + "/src/main/resources").listFiles().forEach {
        parser.parse(FileInputStream(it), handler, metaData, context)
        println(handler.toString())
    }
}

How to have Tika detect the file type and parse it automatically

This can be achieved by replacing the parser with ʻAutoDetectParser` from the sample code above.

import org.apache.tika.parser.AutoDetectParser
import org.apache.tika.parser.ParseContext
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata as TikaMetadata
import java.io.File
import java.io.FileInputStream

fun main() {
    val parser = AutoDetectParser() //Parser that automatically detects file types
    val metaData = TikaMetadata()
    val context = ParseContext()
    val handler = BodyContentHandler() //Handler for storing parse results

    File(System.getProperty("user.dir") + "/src/main/resources").listFiles().forEach {
        println("-------------")
        println(it.name)

        parser.parse(FileInputStream(it), handler, metaData, context)
        println(handler.toString())
        println("-------------")
    }
}

Execution result

I tried all the file formats except zip that can be dropped in the spreadsheet.

-------------
xls_for_test.xlsx
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz



-------------
-------------
xls_for_test -Sheet 1.csv
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz

-------------
-------------
xls_for_test.ods
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
	hoge
Fuga
Piyo
	
	fizz
Bads
	fizzBuzz
	
	
	

???
Page 
??? (???)
00/00/0000, 00:00:00
Page  / 

-------------
-------------
xls_for_test -Sheet 1.pdf
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
	hoge
Fuga
Piyo
	
	fizz
Bads
	fizzBuzz
	
	
	

???
Page 
??? (???)
00/00/0000, 00:00:00
Page  / 

hoge Fuga Piyo

fizz buzz fizz buzz



-------------
-------------
xls_for_test -Sheet 1.tsv
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz


hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
	hoge
Fuga
Piyo
	
	fizz
Bads
	fizzBuzz
	
	
	

???
Page 
??? (???)
00/00/0000, 00:00:00
Page  / 

hoge Fuga Piyo

fizz buzz fizz buzz


hoge Fuga Piyo
fizz buzz fizz buzz

-------------

Impressions

For the time being, I'm glad that I could parse files in various formats, not just xlsx. However, the output of the parsed result is painful if it is in the format of BodyContentHandler, so I will search for something that better suits the purpose.

[Java] Parse Excel (not limited to various) files with Apathce Tika [Kotlin]