Introduction

When loading Excel using Apache POI, I wanted to skip cells with strikethroughs and hidden cells, so I tried to find out how to do it. I will write it as my own memo, but please refer to it if you like.

I will omit the basic method of writing Apache POI.

[Hidden Matrix](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#1-%E9%9D%9E%E8%A1%A8%E7%A4%BA%E8%A1%8C%E5 % 88% 97)
[Judgment of hidden lines](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#1-%E9%9D%9E%E8%A1%A8%E7%A4%BA%E8%A1%8C % E3% 81% AE% E5% 88% A4% E5% AE% 9A)
[Judgment of hidden columns](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#2-%E9%9D%9E%E8%A1%A8%E7%A4%BA%E5%88%97 % E3% 81% AE% E5% 88% A4% E5% AE% 9A )
Cancellation
[There is a strikethrough throughout the cell](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#1-%E3%82%BB%E3%83%AB%E5%85%A8%E4%BD % 93% E3% 81% AB% E5% 8F% 96% E6% B6% 88% E7% B7% 9A% E3% 81% 8C% E3% 81% 82% E3% 82% 8B)
[There is a strikethrough in a part of the cell](https://qiita.com/milestone/items/8fef7ec91bab31354fdb#2-%E3%82%BB%E3%83%AB%E5%86%85% E3% 81% AE% E4% B8% 80% E9% 83% A8% E3% 81% AB% E5% 8F% 96% E6% B6% 88% E7% B7% 9A% E3% 81% 8C% E3% 81% 82% E3% 82% 8B)

1. Hidden matrix

Prevents reading of row / column data that is hidden (or width is 0).

As I learned for the first time this time, it seems that if the row / column width of Excel is set to 0, it will be treated the same as hidden.

1. Judgment of hidden lines

Determines if the row is hidden.

`sample.java`


for(Row row : sheet){
   //Determine if a row is hidden
   if(row.getZeroHeight()){
      //If hidden, read the next line
      continue;
   }
   ...
}

2. Judgment of hidden columns

Determines if the column is hidden. Apache POI reads books ➡ sheets ➡ rows ➡ columns, so In the case of columns, it is necessary to judge by cell. In addition, since there is no hidden judgment in the Cell class for columns, it is necessary to describe the judgment process in the Sheet class.

`sample.java`


for(Cell cell : row){
   //Determine if the column is hidden
   if(sheet.isColumnHidden(cell.getColumnIndex())){
      //If hidden, read the next cell
      continue;
   }
   ...
}

This process is judged every time in all cells, but if you read a large amount of files or multiple files at once, the process will be considerably slower. When dealing with a large amount of data, it may be good to keep the column numbers in a list.

2. Cancellation line

Prevents reading cells with strikethroughs. Since the correspondence method differs depending on whether the cell has a strikethrough or a part of the cell (a part of the character) has a strikethrough, we will introduce them separately.

1. There is a strikethrough throughout the cell

This is a method for determining whether or not there is a strikethrough in the entire cell.

`sample.java`


for(Cell cell : row){
   //Get cell font
   CellStyle style = c.getCellStyle();
   Font font = c.getSheet().getWorkbook().getFontAt(style.getFontIndex());
   //Determine if a cell has a strikethrough
   if(font.getStrikeout()){
      //If there is a strikethrough, read the next cell
      continue;
   }
   ...
}

2. There is a strikethrough in a part of the cell

If there is a strikethrough in a part of the cell, it is necessary to use the writing method properly depending on the type of Excel extension to be read.

For more information about HSSF and XSSF, click here [https://qiita.com/milestone/items/07d22bb1bb36e5fcf2eb#excel%E3%81%AE%E8%AA%AD%E3%81%BF%E6%9B % B8% E3% 81% 8D% E3% 81% AB% E4% BD% BF% E3% 81% 86% E3% 82% B3% E3% 83% B3% E3% 83% 9D% E3% 83% BC Please see% E3% 83% 8D% E3% 83% B3% E3% 83% 88).

When the extension is xlsx

Gets the string with the strikethrough. Example) Ai ~~ Ue ~~ O → Ue

`sample.java`


//Get cell value in XSSFRichTextString format
XSSFRichTextString richStr = (XSSFRichTextString) cell.getRichStringCellValue();
//Get the number of font chunks that hit the string in the cell
int cnt = richStr.numFormattingRuns();
//0 when the font is not correct
if (cnt == 0 ) {
   continue;
}
for (int i = 0; i < cnt; i++;) {
   //Get the i-th font chunk index
   XSSFFont xssfFont = richStr.getFontOfFormattingRun(i);
   //If the font does not hit the chunk
   if (xssfFont == null) {
       continue;
   }
   //If the font hits the block, determine if there is a strikethrough
   if (xssfFont.getStrikeout()) {
   //If there is a strikethrough, get the string of the i-th index chunk
       System.out.println(richStr.getCTRst().getRArray(i).getT());
   }
}

As mentioned above, in the case of xlsx, it can be obtained relatively simply. I myself had a little trouble understanding the meaning of the Formatting Run system processing, but as far as I can see the processing, if it is "Ai ~~ Ue ~~ O", try dividing it into "Ai", "Ue", and "O" chunks. It seemed to be there. (However, "Ai" was obtained without font, and "Ue" and "O" were obtained by character size, character font (Meiryo, etc.), presence / absence of strikethrough, etc.)

When the extension is xls (Excel 97-2003 ver.)

HSSFFont cannot get the font itself like XSSF, probably because the type is old version, and it seems to get only the index with strikethrough.

`sample.java`


String cellStr = cell.getStringCellValue();
HSSFRichTextString richStr = (HSSFRichTextString)cell.getRichStringCellValue();
//Get the number of chunks the font hits
int cnt = richStr.numFormattingRuns();
if (cnt == 0){
   continue;
} else {
   int startStrikeoutIndex = -1;
   boolean isStrikeoutStartIndex = false;
   for (int i = 0; i < cnt; i++;) {
      //Get the i-th font chunk index
      Font hssfFont = wb.getFontAt(richStr.getFontOfFormattingRun(i));
      int index = richStr.getIndexOfFormattingRun(i);
      //If the font does not hit the chunk
      if (hssfFont == null) {
          if (isStrikeoutStartIndex) {
             System.out.println(cellStr.substring(startStrikeoutIndex, index));
             isStrikeoutStartIndex = false;
          }
          continue;
      }
      //If you have font settings but no strikethrough
      if (!hssfFont.getStrikeout()) {
          if (isStrikeoutStartIndex) {
             System.out.println(cellStr.substring(startStrikeoutIndex, index));
             isStrikeoutStartIndex = false; 
          }
      } 
      //If there is a strikethrough in part
      else if (hssfFont.getStrikeout()) {
          startStrikeoutIndex = index;
          isStrikeoutStartIndex = true;
          if (i + 1 == cnt) {
             //In the case of the last chunk, it is output here.
             System.out.println(cellStr.substring(startStrikeoutIndex, cellStr.length()));
          }
      }
   }
}

In my research, I implemented it using usable processing, so it became quite a skill. To briefly explain the mechanism, in the case of "Ah ~~ i ~~ u ~~ e ~~ o", the Index of the first character of the part with the strikethrough is acquired as startStrikeoutIndex (in the case of "I" 1), isStrikeoutStartIndex. Is set to true. After that, if isStrikeoutStartIndex is true, the character between startStrikeoutIndex and the Index of the first character of the next block (2 because it is the "U" part) is the character that is struck through, so get that part. I am.

in conclusion

It was a very difficult part, so I summarized it so that I can look back when needed. I hope it helps those who want to do the same.

reference

[How to judge the "strikethrough" setting in the Excel sheet] (http://blog.livedoor.jp/itawasa7/archives/52144136.html)

[JAVA] [Apache POI] Judgment of unnecessary cells

Introduction

Table of contents

1. Hidden matrix

1. Judgment of hidden lines

`sample.java`

2. Judgment of hidden columns

`sample.java`

2. Cancellation line

1. There is a strikethrough throughout the cell

`sample.java`

2. There is a strikethrough in a part of the cell

When the extension is xlsx

`sample.java`

When the extension is xls (Excel 97-2003 ver.)

`sample.java`

in conclusion

reference