When loading Excel using Apache POI, I wanted to skip cells with strikethroughs and hidden cells, so I tried to find out how to do it. I will write it as my own memo, but please refer to it if you like.
Prevents reading of row / column data that is hidden (or width is 0).
Determines if the row is hidden.
sample.java
for(Row row : sheet){
//Determine if a row is hidden
if(row.getZeroHeight()){
//If hidden, read the next line
continue;
}
...
}
Determines if the column is hidden. Apache POI reads books ➡ sheets ➡ rows ➡ columns, so In the case of columns, it is necessary to judge by cell. In addition, since there is no hidden judgment in the Cell class for columns, it is necessary to describe the judgment process in the Sheet class.
sample.java
for(Cell cell : row){
//Determine if the column is hidden
if(sheet.isColumnHidden(cell.getColumnIndex())){
//If hidden, read the next cell
continue;
}
...
}
Prevents reading cells with strikethroughs. Since the correspondence method differs depending on whether the cell has a strikethrough or a part of the cell (a part of the character) has a strikethrough, we will introduce them separately.
This is a method for determining whether or not there is a strikethrough in the entire cell.
sample.java
for(Cell cell : row){
//Get cell font
CellStyle style = c.getCellStyle();
Font font = c.getSheet().getWorkbook().getFontAt(style.getFontIndex());
//Determine if a cell has a strikethrough
if(font.getStrikeout()){
//If there is a strikethrough, read the next cell
continue;
}
...
}
If there is a strikethrough in a part of the cell, it is necessary to use the writing method properly depending on the type of Excel extension to be read.
Gets the string with the strikethrough. Example) Ai ~~ Ue ~~ O → Ue
sample.java
//Get cell value in XSSFRichTextString format
XSSFRichTextString richStr = (XSSFRichTextString) cell.getRichStringCellValue();
//Get the number of font chunks that hit the string in the cell
int cnt = richStr.numFormattingRuns();
//0 when the font is not correct
if (cnt == 0 ) {
continue;
}
for (int i = 0; i < cnt; i++;) {
//Get the i-th font chunk index
XSSFFont xssfFont = richStr.getFontOfFormattingRun(i);
//If the font does not hit the chunk
if (xssfFont == null) {
continue;
}
//If the font hits the block, determine if there is a strikethrough
if (xssfFont.getStrikeout()) {
//If there is a strikethrough, get the string of the i-th index chunk
System.out.println(richStr.getCTRst().getRArray(i).getT());
}
}
As mentioned above, in the case of xlsx, it can be obtained relatively simply. I myself had a little trouble understanding the meaning of the Formatting Run system processing, but as far as I can see the processing, if it is "Ai ~~ Ue ~~ O", try dividing it into "Ai", "Ue", and "O" chunks. It seemed to be there. (However, "Ai" was obtained without font, and "Ue" and "O" were obtained by character size, character font (Meiryo, etc.), presence / absence of strikethrough, etc.)
HSSFFont cannot get the font itself like XSSF, probably because the type is old version, and it seems to get only the index with strikethrough.
sample.java
String cellStr = cell.getStringCellValue();
HSSFRichTextString richStr = (HSSFRichTextString)cell.getRichStringCellValue();
//Get the number of chunks the font hits
int cnt = richStr.numFormattingRuns();
if (cnt == 0){
continue;
} else {
int startStrikeoutIndex = -1;
boolean isStrikeoutStartIndex = false;
for (int i = 0; i < cnt; i++;) {
//Get the i-th font chunk index
Font hssfFont = wb.getFontAt(richStr.getFontOfFormattingRun(i));
int index = richStr.getIndexOfFormattingRun(i);
//If the font does not hit the chunk
if (hssfFont == null) {
if (isStrikeoutStartIndex) {
System.out.println(cellStr.substring(startStrikeoutIndex, index));
isStrikeoutStartIndex = false;
}
continue;
}
//If you have font settings but no strikethrough
if (!hssfFont.getStrikeout()) {
if (isStrikeoutStartIndex) {
System.out.println(cellStr.substring(startStrikeoutIndex, index));
isStrikeoutStartIndex = false;
}
}
//If there is a strikethrough in part
else if (hssfFont.getStrikeout()) {
startStrikeoutIndex = index;
isStrikeoutStartIndex = true;
if (i + 1 == cnt) {
//In the case of the last chunk, it is output here.
System.out.println(cellStr.substring(startStrikeoutIndex, cellStr.length()));
}
}
}
}
In my research, I implemented it using usable processing, so it became quite a skill. To briefly explain the mechanism, in the case of "Ah ~~ i ~~ u ~~ e ~~ o", the Index of the first character of the part with the strikethrough is acquired as startStrikeoutIndex (in the case of "I" 1), isStrikeoutStartIndex. Is set to true. After that, if isStrikeoutStartIndex is true, the character between startStrikeoutIndex and the Index of the first character of the next block (2 because it is the "U" part) is the character that is struck through, so get that part. I am.
It was a very difficult part, so I summarized it so that I can look back when needed. I hope it helps those who want to do the same.
Recommended Posts