Parse the xlsx file. This time, we will parse using Apache Tika.
This time, we will parse the file with the following contents.
As a side note, the file used for confirmation when writing this article was generated using a Google Docs spreadsheet, so I have not been able to confirm that it will work properly even if it is generated from genuine / third-party software. Also, in the actual file, the fonts were changed separately, but the parse result was not affected.
Sheet 1 | ||
---|---|---|
hoge | Fuga | Piyo |
fizz | Bads | fizzBuzz |
You can do this by installing a library and parsing.
This time, I will try two methods, one is to parse only xml
type (including xlsx) files, and the other is to have Tika detect the file type and parse it automatically.
Introduced from Maven.
I used 1.21
for verification.
xml
type dataIt can be parsed using ʻOOXML Parser. The parsed result is stored in
ContentHandler (this time
BodyContentHandler). I couldn't find any way to get information from
BodyContentHandler other than
toString`.
The sample code is as follows.
Parse the xml files in src / main / resources
.
As a caveat, if you put a file that is not xml type (= ʻOOXMLParser` cannot parse), it will be dropped.
The execution result is the same as the method of having Tika detect the file type and automatically parsing it, so I will omit it.
import org.apache.tika.parser.ParseContext
import org.apache.tika.parser.microsoft.ooxml.OOXMLParser
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata as TikaMetadata
import java.io.File
import java.io.FileInputStream
fun main() {
val parser = OOXMLParser() //Parser for XML data
val metaData = TikaMetadata()
val context = ParseContext()
val handler = BodyContentHandler() //Handler for storing parse results
File(System.getProperty("user.dir") + "/src/main/resources").listFiles().forEach {
parser.parse(FileInputStream(it), handler, metaData, context)
println(handler.toString())
}
}
This can be achieved by replacing the parser with ʻAutoDetectParser` from the sample code above.
import org.apache.tika.parser.AutoDetectParser
import org.apache.tika.parser.ParseContext
import org.apache.tika.sax.BodyContentHandler
import org.apache.tika.metadata.Metadata as TikaMetadata
import java.io.File
import java.io.FileInputStream
fun main() {
val parser = AutoDetectParser() //Parser that automatically detects file types
val metaData = TikaMetadata()
val context = ParseContext()
val handler = BodyContentHandler() //Handler for storing parse results
File(System.getProperty("user.dir") + "/src/main/resources").listFiles().forEach {
println("-------------")
println(it.name)
parser.parse(FileInputStream(it), handler, metaData, context)
println(handler.toString())
println("-------------")
}
}
I tried all the file formats except zip that can be dropped in the spreadsheet.
-------------
xls_for_test.xlsx
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz
-------------
-------------
xls_for_test -Sheet 1.csv
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz
hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
-------------
-------------
xls_for_test.ods
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz
hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
hoge
Fuga
Piyo
fizz
Bads
fizzBuzz
???
Page
??? (???)
00/00/0000, 00:00:00
Page /
-------------
-------------
xls_for_test -Sheet 1.pdf
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz
hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
hoge
Fuga
Piyo
fizz
Bads
fizzBuzz
???
Page
??? (???)
00/00/0000, 00:00:00
Page /
hoge Fuga Piyo
fizz buzz fizz buzz
-------------
-------------
xls_for_test -Sheet 1.tsv
Sheet 1
hoge Fuga Piyo
fizz buzz fizz buzz
hoge,Fuga,Piyo
fizz,Bads,fizzBuzz
hoge
Fuga
Piyo
fizz
Bads
fizzBuzz
???
Page
??? (???)
00/00/0000, 00:00:00
Page /
hoge Fuga Piyo
fizz buzz fizz buzz
hoge Fuga Piyo
fizz buzz fizz buzz
-------------
For the time being, I'm glad that I could parse files in various formats, not just xlsx.
However, the output of the parsed result is painful if it is in the format of BodyContentHandler
, so I will search for something that better suits the purpose.
Recommended Posts