Apache POI is very useful, but it has a lot of addictive points. This article is a record of the points that got stuck in the process of creating the Excel file comparison tool "Graph paper Diff" [^ 1]. I hope it helps someone.
Again, ** Apache POI is very useful. ** I am grateful to the contributors for using this kind of thing for free. ** If you have any complaints, contribute yourself, at least raise an issue ** [^ 2]. In other words, this article is a record of my own lack of power to do that.
Please note that this article is likely to contain content based on my misunderstandings and lack of understanding. If you notice them, I would appreciate it if you could comment.
[^ 1]: It's very convenient, so please use it. It is published on here, and article is also written.
[^ 2]: One day, contributing to OSS is my next goal as an amateur programmer.
As soon as I remember, I will add it.
There are the following types of Excel sheets.
You may want to distinguish between these types, especially worksheets, but for some reason Apache POI's user model API doesn't provide such functionality.
The first choice when working with Excel sheets in Apache POI is ʻorg.apache.poi, which is included in a set of features called the User Model API and can handle files in .xls / .xlsx / .xlsm format transparently. It is to use .ss.usermodel.Sheet. There is no API provided for this interface, such as
Sheet # isWorksheet ()`.
Then what should we do? The Sheet interface has the following hierarchical structure.
① ** Sheet **: .xls / .xlsx / .xlsm format sheet ├─ ② ** HSSFSheet **: .xls format sheet ├─ ③ ** XSSFSheet **: .xlsx / .xlsm format sheet │ ├─ ④ ** XSSFChartSheet **: Graph sheet in .xlsx / .xlsm format │ └─ ⑤ ** XSSFDialogsheet **: Dialog sheet in .xlsx / .xlsm format └─ ⑥ ** SXSSFSheet **: .xlsx format sheet (streaming method)
For .xlsx / .xlsm format sheets, you can check the sheet type by checking if it is an instance of ④⑤ [^ 3]. However, for .xls format sheets, ② provides a method to return whether it is a dialog sheet or not [^ 4], but it does not provide a method to distinguish whether it is a graph sheet or not. This is a dead end.
[^ 3]: However, in reality, there is a pitfall introduced in No.5. [^ 4]: However, in reality, this method does not work as introduced in No.4.
This can be solved by using a lower layer solution.
Details are introduced in this article.
[^ 5]: However, in reality, this also has the pitfalls introduced in No.6.
The derived hierarchy of the Sheet interface introduced in No. 1 is reprinted.
org.apache.poi.ss.usermodel.Sheet ├─ org.apache.poi.hssf.usermodel.HSSFSheet ├─ org.apache.poi.xssf.usermodel.XSSFSheet │ ├─ org.apache.poi.xssf.usermodel.XSSFChartSheet │ └─ org.apache.poi.xssf.usermodel.XSSFDialogsheet └─ org.apache.poi.xssf.streaming.SXSSFSheet
Do you understand? Only the XSSFDialogsheet has a lowercase "s". Intentional? No, it's a typo.
If the implementor so named it, that's the correct class name. Belief-based Ichamon is no solution. Take advantage of IDE completions and suggestions.
In the explanation of No. 1, I wrote that (2) the HSSFSheet
interface provides a method to return whether or not it is a dialog sheet. The next method is that. (Reprinted from [API documentation] provided by Apache POI (https://poi.apache.org/apidocs/4.1/org/apache/poi/hssf/usermodel/HSSFSheet.html#getDialog--))
getDialog
public boolean getDialog()
get whether sheet is a dialog sheet or not Returns: isDialog or not
get returns boolean. Isn't it good? There is no rule that it must be named ʻisDialogSheet ()`.
When looking for a method, don't assume that method name. If you want a method that returns boolean, you should hit the method whose return type is boolean. That is the truth. You should read the API docs from top to bottom without any prejudice and look for the functionality you want.
Note that the HSSFSheet # getDialog ()
method introduced in No. 3 does not work. In particular,
I will demonstrate it.
You can see that you can create the following types of sheets by right-clicking on the sheet name tab and selecting Insert ...
Create each sheet and describe the appropriate contents.
Save this as an Excel workbook in .xls format.
For example, write the following test code.
package mypackage;
import java.io.File;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
public class Main {
public static void main(String[] args) {
try (Workbook wb = WorkbookFactory.create(
new File("C:\\Users\\xxx\\No.4 tests.xls"))) {
System.out.println(wb.getNumberOfSheets());
wb.sheetIterator().forEachRemaining(s -> {
System.out.print(String.format(
"%s : %s : ",
s.getSheetName(),
s.getClass().getName()));
if (s instanceof HSSFSheet) {
HSSFSheet sheet = (HSSFSheet) s;
try {
System.out.println(sheet.getDialog());
} catch (NullPointerException e) {
System.out.println("★ NPE occurrence ★");
}
} else {
System.out.println();
}
});
} catch (Exception e) {
e.printStackTrace();
}
}
}
When I run the above test code, I get the following results:
4
A_worksheet: org.apache.poi.hssf.usermodel.HSSFSheet : false
B_Graph sheet: org.apache.poi.hssf.usermodel.HSSFSheet :★ NPE occurrence ★
C_Macro sheet: org.apache.poi.hssf.usermodel.HSSFSheet : false
D_Dialog sheet: org.apache.poi.hssf.usermodel.HSSFSheet : false
You can see that the "B_graphsheet" is throwing a NullPointerException and the "D_dialog sheet" is returning false instead of true.
I think you shouldn't use the HSSFSheet # getDialog ()
method.
It seems that dialog sheets and macro sheets cannot be loaded by the Apache POI user model API, at least for .xlsx / .xlsm format files created in Excel for Office 365. I will demonstrate it.
Save the test file used in No.4 again in .xlsm format.
Use the same test code as No.4.
When you run the test code, you get the following results:
2
A_worksheet: org.apache.poi.xssf.usermodel.XSSFSheet :
B_Graph sheet: org.apache.poi.xssf.usermodel.XSSFChartSheet :
You can see that "C_Macro Sheet" and "D_Dialog Sheet" are not loaded.
If you want to read macro sheets and dialog sheets from an Excel file in .xlsx / .xlsm format, you should give up using POI's user model API and use event model API or SAX (Simple API for XML). I think (unverified).
As introduced in No. 1, the Apache POI user model API cannot distinguish between .xls format worksheet types. A feature called the Event Model API is your next choice.
The .xls format Excel file format is called the Binary Exchange File Format (BIFF) and its contents are a collection of BIFF records. For an overview of BIFF, this blog is easy to understand, and [detailed specifications](https://docs.microsoft.com/en-us/openspecs/office_file_formats/ ms-xls / cd03cb5f-ca02-4934-a391-bb674cb8aa06) has been published by Micorosoft.
I can read the contents of the BIFF record with Apache POI's event model API, but apparently the WSBoolRecord # getDialog ()
method isn't working properly and I still can't tell the dialog sheet correctly.
I will demonstrate it below.
Use the same file used in No.4.
For example, write the following test code.
package mypackage;
import java.io.FileInputStream;
import org.apache.poi.hssf.eventusermodel.HSSFEventFactory;
import org.apache.poi.hssf.eventusermodel.HSSFListener;
import org.apache.poi.hssf.eventusermodel.HSSFRequest;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class Main {
private static class TestListener implements HSSFListener {
@Override
public void processRecord(Record record) {
System.out.println(record);
}
}
public static void main(String[] args) throws Exception {
try (FileInputStream fin = new FileInputStream(
"C:\\Users\\xxx\\No.6 tests.xls");
POIFSFileSystem poifs = new POIFSFileSystem(fin)) {
HSSFRequest req = new HSSFRequest();
TestListener listener = new TestListener();
req.addListenerForAllRecords(listener);
HSSFEventFactory factory = new HSSFEventFactory();
factory.abortableProcessWorkbookEvents(req, poifs);
}
}
}
When you run the test code, the contents of the BIFF record will be output.
If you look closely, it will be long, so I will explain it briefly.
The following BOF record marks the beginning of the dialog sheet definition. It's hard to understand why, but the .type attribute of the BOF record in the dialog sheet is defined as 0x0010
, which is the same as in the worksheet. (On the other hand, the graph sheet is defined as 0x0020
and the macro sheet is defined as 0x0040
, which can be distinguished.)
...
5201:
5202: [BOF RECORD]
5203: .version = 0x0600
5204: .type = 0x0010 (worksheet)
5205: .build = 0x4F5A
5206: .buildyear= 1997
5207: .history = 0x000200C9
5208: .reqver = 0x00000806
5209: [/BOF RECORD]
5210:
...
Worksheets and dialog sheets are supposed to be distinguished by the .dialog attribute of subsequent WSBOOL records (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/ccbd73f9- For some reason, ff1d-4069-be31-13d16c074ec4) is actually output as .dialog = false
despite the dialog sheet.
...
5263:
5264: [WSBOOL]
5265: .wsbool1 = 4
5266: .autobreaks = false
5267: .dialog = false
5268: .rowsumsbelw= false
5269: .rowsumsrigt= false
5270: .wsbool2 = ffffffd1
5271: .fittopage = true
5272: .displayguts= false
5273: .alternateex= true
5274: .alternatefo= true
5275: [/WSBOOL]
5276:
...
WSBoolRecord I suspect there is a problem with the class implementation. This BIFF specification and [Source code published by Apahe] You may be able to find out the cause and countermeasures by glancing at (https://poi.apache.org/download.html#POI-4.1.0).
Or maybe a sheet that is a bit tricky but doesn't have a ROW record can be considered a dialog sheet. (Unverified)
The Apache POI user model API does not preserve whitespace in formulas. This is [specified] in the Apache POI documentation (https://poi.apache.org/components/spreadsheet/formula.html).
It would be nice if the whitespace was always removed without being retained, but the behavior depends on the format of the original Excel file.
Specifically, when a cell is filled with = 1 + 2
and this is read by theCell # getCellFormula ()
method,
1 + 2
. That is, the whitespace is removed.This is a problem, for example, if you want to compare formulas in Excel files in different formats.
One approach is to standardize the math string by creating your own function that removes unnecessary spaces from the math string obtained with the Cell # getCellFormula ()
method. At this time, there are some points to be aware of.
The first is that whitespace in string literals must not be removed. This is because the formulas =" Hello, World !! "
and the formulas =" Hello, World !! "
have different meanings. However, it shouldn't be too difficult to deal with this kind of problem.
The second point to note is that spaces (half-width spaces) also have a role as a "reference operator" in Excel.
For example, when the formula = MAX (A1: C3 B2: E5)
is entered in the cell, this is "in the common area of cell area A1: C3 and cell area B2: E5, that is, in cell area B2: C3. Represents the maximum value of.
Excel allows you to name cell areas. Therefore, the string = region A region B
can also be a good formula. (If the shared part of "Area A" and "Area B" is multiple cells, an error "#VALUE!" Will occur, and if it is a single cell, the value of that cell will be the calculation result.)
Of course, you shouldn't remove whitespace as such a "reference operator".
It is one way to create a standardized function with these points in mind.
Alternatively, you may be able to understand the Excel file format by referring to PDF file here and create your own parser.
Either way, it's a daunting task.
CellStyle Interface and its derivative interface HSSFCellStyle .org / apidocs / 4.1 / org / apache / poi / hssf / usermodel / HSSFCellStyle.html), [XSSFCellStyle](https://poi.apache.org/apidocs/4.1/org/apache/poi/xssf/usermodel/ The XSSFCellStyle.html) interface provides several APIs for working with borders.
As you can see, only the API for vertical (Left, Right) and horizontal (Top, Bottom) ruled lines is provided, not the API for diagonal ruled lines.
It is published on this blog (Program memorandum etc --Java Apache POI xls file only to get diagonal ruled lines) You need to refer to the information and implement the processing related to diagonal ruled lines yourself. In that case, it is necessary to understand the class structure inside POI.
For empty comments (cells with comments but nothing in the comments), the xls format workbook returns the empty string " "
, while xlsx / xlsm The format workbook returns null
.
Cell cell = (Get a cell object in some way)
Comment comment = cell.getCellComment(); //Returns null if there are no comments
RichTextString richText = comment.getString(); //A non-null value is also returned for empty comments
String str = richText.getString(); //★ The behavior of empty comments differs depending on the book format ★
For example, it would be better to normalize to the empty string " "
by the following method.
String str = Optional.ofNullable(richText.getString).orElse("");
Running setVisible (false)
on a visible cell comment does not hide it, and running setVisible (true)
on a hidden cell comment does not show it. Hmm.
I wondered if true and false were the opposite, but it didn't work either.
The XSSFComment # setVisible (boolean)
method cannot be used.
No alternative has been found.
I'm a little tired, so I'll stop here.
It seems to be persistent, but Apache POI is very useful. It's very convenient, but as mentioned above, it's also true that when you try to use it, you're getting crazy.
One day I want to contribute myself, but I want to leave for a while to maintain my mental health.
Recommended Posts