Extract elements by doing regular expression replacement from a lot of HTML with java

What I want to do this time

Read HTML text data, read with java, after putting it in list Read the String data in the list, replace it with a regular expression, process it, and save it in all.

Assumption

Batch processing a large number of similar files The addition to the line-by-line List from reading the file in java is complete.

advantage

You can save the data that gets stuck in the same regular expression as another Supports multiple hits The number of data does not shift by performing exception handling when there is no hit

specification

Search all the data of String in list Specifications to replace the data you are looking for and store it in all (String). From what number to store the desired data What to do when there is no desired data can be decided by the argument No operation at all when "" At other times, add the specified character to separate them with commas.

argument meaning
be String before regular expression replacement
af String after regular expression replacement
no Character string when not caught in the search
s How many hits do you think from?
num How many hits to think about
be_set Start the search after the character string entered here is hit
af_set End the search when the character string entered here is hit

Source code flow

--For all the elements of list --Be hit confirmation --Af hit confirmation --When be_flag is true --Check the hit before replacement, replace and add to save (list) --However, when the number of hit confirmation elements is 1, add it to all here and return --Af hit confirmation (break if hit)

--For all elements of save --Save.size = 0 exception handling --Exception handling of the magnitude relationship between save.size and the argument of the number of hits in the search --Add all the number of elements to all

How to call

call


sp("<.+><.+-(.+)\"></i></div>"
,"$1q"
,"noq"
,1
,14
,"<.+>Data 1 table</h3>"
,"<.+>Data 2 table</h3>");

In such a case HTML description Table of data 1 From Data 2 table With elements up to <.+><.+-(.+)\"></i></div> What hits the regular expression of $1q Replace with ($ 1 is the replacement symbol after a hit in regular expression replacement. The hit in parentheses is treated as an element as it is. When there is no data noq This is a process to shift the location of the data cell if it is a blank line when processing it later.

Then a comma is added and the process is finished. As an advantage, even if the formats of data 1 and data 2 are the same, the correct data can be obtained properly.

Source code

grobal



ArrayList<String> list = new ArrayList<String>();
String all = "";
String qq = "qqqqqqqqq";  //A string that won't hit
public static void add_all(String index){
    all = all + index + kn;
//kn is the data delimiter when adding data when debugging"\n"At the time of release","Recommended
}

over


//Overload with fewer arguments
//qq is a string that doesn't seem to hit
//Always look for one element when there are three arguments
public static void sp(String be,String af,String no){
    int s = 1; int num =1;
    sp(be,af,no,s,num,qq,qq);
  }
  public static void sp(String be,String af,String no,int s,int num){
    sp(be,af,no,s,num,qq,qq);
  }

sp


public static void sp(String be,String af,String no,int s,int num ,String be_set,String af_set){
  int i;
  boolean be_flag = false;
  boolean af_flag = false;
  boolean cutset = false;
  //If the start / end flag is not entered, perform a full search.
  if(be_set.equals(qq) && af_set.equals(qq)){
    be_flag=true;
  }
  //When the number of search hits is one, high-speed processing is possible with the cutset flag.
  if(s ==1 && num ==1){
    cutset = true;
  }
  ArrayList<String> save = new ArrayList<String>();
  save.clear();//I don't need it, but for the time being
  //Repeat for list size
  for(i = 0;i < list.size();i++){
    //Get list data
    String line = list.get(i);
    //Start flag operation
    if(line.matches(be_set)){
      be_flag=true;
    }
    //End flag operation
    if(line.matches(af_set) && be_flag){
      break;
    }
    if(line.matches(be) && be_flag){
      line = line.replaceAll(be,af);
      save.add(line);
      deb(0,line);
      //Aiming to speed up processing when only one piece of data is searched
      if(cutset){
        String tem = save.get(0);
        add_all(tem+",");
        return;
      }
    }
  }

  //After reading all the data
  //When there was no hit
  //The data after replacement""If not, add the argument no
  if(save.size() == 0){
    if(!no.equals(""))add_all(no);
  //When not
  }else{
    //Exception handling: Align the number of inputs and the number of hits
    if(save.size() < num){
      num = save.size();
    }
    if(save.size() < s){
      s = save.size();
    }
    //Add only the specified amount in the argument to all
    for(i=0;i<num;i++){
      String tem = save.get(s+i-1);
      add_all(tem);
    }
  }
  //Separated by commas after data reflection
  if(!no.equals(""))add_all(",");
}

Afterword

This is the result of adding what I need without thinking about the structure, but for me, I think it's not bad. I used this to read the data from HTML. I think it's okay to use the python library, but I took this format because it wasn't long before I came up with the idea.

Recommended Posts

Extract elements by doing regular expression replacement from a lot of HTML with java
[Java] Cut out a part of the character string with Matcher and regular expression
Extract a string starting with a capital letter with a regular expression (Ruby)
Replace with a value according to the match with a Java regular expression
I want to extract between character strings with a regular expression
[Android-Kotlin] Convert m-prefix and s-prefix, which are problems with java to kotlin, are removed by regular expression replacement.
<java> Split the address before and after the street address with a regular expression
Extract a part of a string with Ruby
(Java) How to implement equals () for a class with value elements added by inheritance
[Java] Sort ArrayList with elements of your own class
Extract elements in order from a class type ArrayList
Call a method with a Kotlin callback block from Java
[Note] Create a java environment from scratch with docker
String Replacement of the case where the regular expression * of the character string search condition contains a line break.