[JAVA] Apache POI Addictive Point List

What kind of article?

Apache POI is very useful, but it has a lot of addictive points. This article is a record of the points that got stuck in the process of creating the Excel file comparison tool "Graph paper Diff" [^ 1]. I hope it helps someone.

Again, ** Apache POI is very useful. ** I am grateful to the contributors for using this kind of thing for free. ** If you have any complaints, contribute yourself, at least raise an issue ** [^ 2]. In other words, this article is a record of my own lack of power to do that.

Please note that this article is likely to contain content based on my misunderstandings and lack of understanding. If you notice them, I would appreciate it if you could comment.

[^ 1]: It's very convenient, so please use it. It is published on here, and article is also written.

[^ 2]: One day, contributing to OSS is my next goal as an amateur programmer.

Version assumed by this article

List of fitting points

Addictive point Target POI ver. Date of description
1 I can't tell the type of sheet (worksheet, graph sheet, macro sheet, dialog sheet) Sheetinterface 4.1.0 2019/7/15 First draft
2 Is the case of the class name a typo? XSSFDialogsheetclass 4.1.0 2019/7/15 First draft
3 Note method names that differ from common naming conventions HSSFSheet#getDialog()Method 4.1.0 2019/7/15 First draft
4 HSSFSheet#getDialog()Does not work HSSFSheet#getDialog()Method 4.1.0 2019/7/15 First draft
5 .xlsx/.xlsm format dialog sheet / macro sheet is ignored WorkbookFactory#create(File)Method 4.1.0 2019/7/15 First draft
6 .xls-style dialog sheets cannot be identified by the event model API WSBoolRecord#getDialog()Method 4.1.0 2019/7/15 First draft
7 Book format (.xls vs .xlsx/.xlsm) may or may not reproduce whitespace in formulas Cell#getCellFormula()Method 4.1.0 2019/8/12 addition
8 API for manipulating diagonal ruled lines is not provided CellStyleinterface 4.1.0 2019/10/5 added
9 Behavior regarding empty comments depends on the file format Comment#getString()Method 4.1.2 2020/3/25 added
10 XSSFComment#setVisible(boolean)Does not work XSSFComment#setVisible(boolean)Method 4.1.2 2020/3/25 added
11

As soon as I remember, I will add it.

Individual explanation

No.1 Sheet type (worksheet, graph sheet, macro sheet, dialog sheet) cannot be distinguished

Description

There are the following types of Excel sheets.

You may want to distinguish between these types, especially worksheets, but for some reason Apache POI's user model API doesn't provide such functionality.

The first choice when working with Excel sheets in Apache POI is ʻorg.apache.poi, which is included in a set of features called the User Model API and can handle files in .xls / .xlsx / .xlsm format transparently. It is to use .ss.usermodel.Sheet. There is no API provided for this interface, such as Sheet # isWorksheet ()`.

Then what should we do? The Sheet interface has the following hierarchical structure.

① ** Sheet **: .xls / .xlsx / .xlsm format sheet ├─ ② ** HSSFSheet **: .xls format sheet ├─ ③ ** XSSFSheet **: .xlsx / .xlsm format sheet │ ├─ ④ ** XSSFChartSheet **: Graph sheet in .xlsx / .xlsm format │ └─ ⑤ ** XSSFDialogsheet **: Dialog sheet in .xlsx / .xlsm format └─ ⑥ ** SXSSFSheet **: .xlsx format sheet (streaming method)

For .xlsx / .xlsm format sheets, you can check the sheet type by checking if it is an instance of ④⑤ [^ 3]. However, for .xls format sheets, ② provides a method to return whether it is a dialog sheet or not [^ 4], but it does not provide a method to distinguish whether it is a graph sheet or not. This is a dead end.

[^ 3]: However, in reality, there is a pitfall introduced in No.5. [^ 4]: However, in reality, this method does not work as introduced in No.4.

solution

This can be solved by using a lower layer solution.

Details are introduced in this article.

[^ 5]: However, in reality, this also has the pitfalls introduced in No.6.

No.2 Is the case of the class name a typo?

Description

The derived hierarchy of the Sheet interface introduced in No. 1 is reprinted.

org.apache.poi.ss.usermodel.Sheet  ├─ org.apache.poi.hssf.usermodel.HSSFSheet  ├─ org.apache.poi.xssf.usermodel.XSSFSheet  │  ├─ org.apache.poi.xssf.usermodel.XSSFChartSheet  │  └─ org.apache.poi.xssf.usermodel.XSSFDialogsheet  └─ org.apache.poi.xssf.streaming.SXSSFSheet

Do you understand? Only the XSSFDialogsheet has a lowercase "s". Intentional? No, it's a typo.

solution

If the implementor so named it, that's the correct class name. Belief-based Ichamon is no solution. Take advantage of IDE completions and suggestions.

No.3 Note the method name that is different from the general naming convention

Description

In the explanation of No. 1, I wrote that (2) the HSSFSheet interface provides a method to return whether or not it is a dialog sheet. The next method is that. (Reprinted from [API documentation] provided by Apache POI (https://poi.apache.org/apidocs/4.1/org/apache/poi/hssf/usermodel/HSSFSheet.html#getDialog--))

getDialog   public boolean getDialog() get whether sheet is a dialog sheet or not Returns:   isDialog or not

get returns boolean. Isn't it good? There is no rule that it must be named ʻisDialogSheet ()`.

solution

When looking for a method, don't assume that method name. If you want a method that returns boolean, you should hit the method whose return type is boolean. That is the truth. You should read the API docs from top to bottom without any prejudice and look for the functionality you want.

No.4 HSSFSheet # getDialog () does not work

Description

Note that the HSSFSheet # getDialog () method introduced in No. 3 does not work. In particular,

I will demonstrate it.

1) Create a test Excel file (.xls format)

You can see that you can create the following types of sheets by right-clicking on the sheet name tab and selecting Insert ...

No.4-a.png

Create each sheet and describe the appropriate contents.

No.4-b.png

Save this as an Excel workbook in .xls format.

2) Creating test code

For example, write the following test code.

package mypackage;

import java.io.File;

import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;

public class Main {
    
    public static void main(String[] args) {
        
        try (Workbook wb = WorkbookFactory.create(
                new File("C:\\Users\\xxx\\No.4 tests.xls"))) {
            
            System.out.println(wb.getNumberOfSheets());
            
            wb.sheetIterator().forEachRemaining(s -> {
                System.out.print(String.format(
                        "%s : %s : ",
                        s.getSheetName(),
                        s.getClass().getName()));
                
                if (s instanceof HSSFSheet) {
                    HSSFSheet sheet = (HSSFSheet) s;
                    try {
                        System.out.println(sheet.getDialog());
                    } catch (NullPointerException e) {
                        System.out.println("★ NPE occurrence ★");
                    }
                } else {
                    System.out.println();
                }
            });
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

3) Test execution

When I run the above test code, I get the following results:

4
A_worksheet: org.apache.poi.hssf.usermodel.HSSFSheet : false
B_Graph sheet: org.apache.poi.hssf.usermodel.HSSFSheet :★ NPE occurrence ★
C_Macro sheet: org.apache.poi.hssf.usermodel.HSSFSheet : false
D_Dialog sheet: org.apache.poi.hssf.usermodel.HSSFSheet : false

You can see that the "B_graphsheet" is throwing a NullPointerException and the "D_dialog sheet" is returning false instead of true.

solution

I think you shouldn't use the HSSFSheet # getDialog () method.

No.5 .xlsx / .xlsm format dialog sheet / macro sheet is ignored

Description

It seems that dialog sheets and macro sheets cannot be loaded by the Apache POI user model API, at least for .xlsx / .xlsm format files created in Excel for Office 365. I will demonstrate it.

1) Create a test Excel file (.xlsm format)

Save the test file used in No.4 again in .xlsm format.

2) Creating test code

Use the same test code as No.4.

3) Test execution

When you run the test code, you get the following results:

2
A_worksheet: org.apache.poi.xssf.usermodel.XSSFSheet : 
B_Graph sheet: org.apache.poi.xssf.usermodel.XSSFChartSheet : 

You can see that "C_Macro Sheet" and "D_Dialog Sheet" are not loaded.

solution

If you want to read macro sheets and dialog sheets from an Excel file in .xlsx / .xlsm format, you should give up using POI's user model API and use event model API or SAX (Simple API for XML). I think (unverified).

No.6 .xls format dialog sheet cannot be identified even by the event model API

Description

As introduced in No. 1, the Apache POI user model API cannot distinguish between .xls format worksheet types. A feature called the Event Model API is your next choice.

The .xls format Excel file format is called the Binary Exchange File Format (BIFF) and its contents are a collection of BIFF records. For an overview of BIFF, this blog is easy to understand, and [detailed specifications](https://docs.microsoft.com/en-us/openspecs/office_file_formats/ ms-xls / cd03cb5f-ca02-4934-a391-bb674cb8aa06) has been published by Micorosoft.

I can read the contents of the BIFF record with Apache POI's event model API, but apparently the WSBoolRecord # getDialog () method isn't working properly and I still can't tell the dialog sheet correctly. I will demonstrate it below.

1) Prepare a test Excel file (.xls format)

Use the same file used in No.4.

2) Test code

For example, write the following test code.

package mypackage;

import java.io.FileInputStream;

import org.apache.poi.hssf.eventusermodel.HSSFEventFactory;
import org.apache.poi.hssf.eventusermodel.HSSFListener;
import org.apache.poi.hssf.eventusermodel.HSSFRequest;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;

public class Main {
    
    private static class TestListener implements HSSFListener {
        
        @Override
        public void processRecord(Record record) {
            System.out.println(record);
        }
    }
    
    public static void main(String[] args) throws Exception {
        
        try (FileInputStream fin = new FileInputStream(
                "C:\\Users\\xxx\\No.6 tests.xls");
                POIFSFileSystem poifs = new POIFSFileSystem(fin)) {
            
            HSSFRequest req = new HSSFRequest();
            TestListener listener = new TestListener();
            req.addListenerForAllRecords(listener);
            HSSFEventFactory factory = new HSSFEventFactory();
            factory.abortableProcessWorkbookEvents(req, poifs);
        }
    }
}

3) Test execution

When you run the test code, the contents of the BIFF record will be output.

If you look closely, it will be long, so I will explain it briefly. The following BOF record marks the beginning of the dialog sheet definition. It's hard to understand why, but the .type attribute of the BOF record in the dialog sheet is defined as 0x0010, which is the same as in the worksheet. (On the other hand, the graph sheet is defined as 0x0020 and the macro sheet is defined as 0x0040, which can be distinguished.)

...
5201: 
5202: [BOF RECORD]
5203:     .version  = 0x0600
5204:     .type     = 0x0010 (worksheet)
5205:     .build    = 0x4F5A
5206:     .buildyear= 1997
5207:     .history  = 0x000200C9
5208:     .reqver   = 0x00000806
5209: [/BOF RECORD]
5210: 
...

Worksheets and dialog sheets are supposed to be distinguished by the .dialog attribute of subsequent WSBOOL records (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/ccbd73f9- For some reason, ff1d-4069-be31-13d16c074ec4) is actually output as .dialog = false despite the dialog sheet.

...
5263: 
5264: [WSBOOL]
5265:     .wsbool1        = 4
5266:         .autobreaks = false
5267:         .dialog     = false
5268:         .rowsumsbelw= false
5269:         .rowsumsrigt= false
5270:     .wsbool2        = ffffffd1
5271:         .fittopage  = true
5272:         .displayguts= false
5273:         .alternateex= true
5274:         .alternatefo= true
5275: [/WSBOOL]
5276: 
...

solution

WSBoolRecord I suspect there is a problem with the class implementation. This BIFF specification and [Source code published by Apahe] You may be able to find out the cause and countermeasures by glancing at (https://poi.apache.org/download.html#POI-4.1.0).

Or maybe a sheet that is a bit tricky but doesn't have a ROW record can be considered a dialog sheet. (Unverified)

No.7 Book format (.xls vs .xlsx / .xlsm) may or may not reproduce whitespace in formulas

Description

The Apache POI user model API does not preserve whitespace in formulas. This is [specified] in the Apache POI documentation (https://poi.apache.org/components/spreadsheet/formula.html).

It would be nice if the whitespace was always removed without being retained, but the behavior depends on the format of the original Excel file.

Specifically, when a cell is filled with = 1 + 2 and this is read by theCell # getCellFormula ()method,

This is a problem, for example, if you want to compare formulas in Excel files in different formats.

solution

One approach is to standardize the math string by creating your own function that removes unnecessary spaces from the math string obtained with the Cell # getCellFormula () method. At this time, there are some points to be aware of.

The first is that whitespace in string literals must not be removed. This is because the formulas =" Hello, World !! " and the formulas =" Hello, World !! " have different meanings. However, it shouldn't be too difficult to deal with this kind of problem.

The second point to note is that spaces (half-width spaces) also have a role as a "reference operator" in Excel. For example, when the formula = MAX (A1: C3 B2: E5) is entered in the cell, this is "in the common area of cell area A1: C3 and cell area B2: E5, that is, in cell area B2: C3. Represents the maximum value of. Excel allows you to name cell areas. Therefore, the string = region A region B can also be a good formula. (If the shared part of "Area A" and "Area B" is multiple cells, an error "#VALUE!" Will occur, and if it is a single cell, the value of that cell will be the calculation result.) Of course, you shouldn't remove whitespace as such a "reference operator".

It is one way to create a standardized function with these points in mind.

Alternatively, you may be able to understand the Excel file format by referring to PDF file here and create your own parser.

Either way, it's a daunting task.

No.8 API for manipulating diagonal ruled lines is not provided

Description

CellStyle Interface and its derivative interface HSSFCellStyle .org / apidocs / 4.1 / org / apache / poi / hssf / usermodel / HSSFCellStyle.html), [XSSFCellStyle](https://poi.apache.org/apidocs/4.1/org/apache/poi/xssf/usermodel/ The XSSFCellStyle.html) interface provides several APIs for working with borders.

As you can see, only the API for vertical (Left, Right) and horizontal (Top, Bottom) ruled lines is provided, not the API for diagonal ruled lines.

solution

It is published on this blog (Program memorandum etc --Java Apache POI xls file only to get diagonal ruled lines) You need to refer to the information and implement the processing related to diagonal ruled lines yourself. In that case, it is necessary to understand the class structure inside POI.

No.9 Behavior regarding empty comments differs depending on the file format

Description

For empty comments (cells with comments but nothing in the comments), the xls format workbook returns the empty string " ", while xlsx / xlsm The format workbook returns null.

Cell cell = (Get a cell object in some way)
Comment comment = cell.getCellComment();  //Returns null if there are no comments
RichTextString richText = comment.getString();  //A non-null value is also returned for empty comments
String str = richText.getString();  //★ The behavior of empty comments differs depending on the book format ★

solution

For example, it would be better to normalize to the empty string " " by the following method.

String str = Optional.ofNullable(richText.getString).orElse("");

No.10 XSSFComment # setVisible (boolean) does not work

Description

Running setVisible (false) on a visible cell comment does not hide it, and running setVisible (true) on a hidden cell comment does not show it. Hmm. I wondered if true and false were the opposite, but it didn't work either.

solution

The XSSFComment # setVisible (boolean) method cannot be used. No alternative has been found.

Middle tightening

I'm a little tired, so I'll stop here.

It seems to be persistent, but Apache POI is very useful. It's very convenient, but as mentioned above, it's also true that when you try to use it, you're getting crazy.

One day I want to contribute myself, but I want to leave for a while to maintain my mental health.

Recommended Posts

Apache POI Addictive Point List
[Apache POI] Corresponding Excel version
Excel operation using Apache POI
Apache Commons BeanUtils fit point
apache POI personal notes crossfish21
How to use Apache POI
Manipulate Excel with Apache POI
Apache POI Excel in Kotlin
Excel output using Apache POI!
[Apache POI] Judgment of unnecessary cells