I used to import XML at work, but at that time the other party loaded the control code.
The following method was used for importing.
Unmarshaller#unmarshal(XMLStreamReader reader, Class
The flow is to get the contents of XML as a character string ⇒ convert it to a stream ⇒ make it an object.
And I get the following error.
Message: An invalid XML character (Unicode: 0x2) was found in the element content of the document.
So, it seems that the other party will not fix it, so I decided to remove invalid XML characters by string replacement.
According to the above site, the character code values that can be used in XML are the following 6 patterns.
① # x9 ⇒ tab ② #xA ⇒ Line feed (LF) ③ #xD ⇒ Line feed (CR) ④ [# x20- # xD7FF] ⇒ Half-width space-Hangul ⑤ [# xE000- # xFFFD] ⇒ Gaiji-special purpose characters ⑥ [# x10000- # x10FFFF] ⇒Linear B syllabary-undefined
Well, basically you should think that the characters you use are in ④
For the time being, put a table of characters and code values in Unicode
When you specify a code value in java and replace it, write it like this. In the following, half-width spaces are replaced with blanks. You can use Matcher or something, but for the time being, you can use regular expressions with String # replaceAll.
python
String str = "XML stringized version";
str = str.replaceAll("\\u0020", "");
In Java, you can write a 2-digit character code with "\ x00" and a 4-digit character code with "\ u0000". Two backlashes are written for escape.
If you write all in 4 digits, it will be like this ① #x9 ⇒ "\u0009" ② #xA ⇒ "\u000A" ③ #xD ⇒ "\u000D" ④ [#x20-#xD7FF] ⇒ "[\u0020-\uD7FF]" ⑤ [#xE000-#xFFFD] ⇒ "[\uE000-\uFFFD]"
Wait ... Unicode has more than 5 digits ... I wondered how to express it. There was a way to specify a multi-digit code value with a regular expression.
⑥ [#x10000-#x10FFFF] ⇒ "[\x{10000}-\x{10FFFF}]"
This seems to be fine.
Regular expressions can be OR-judged, so stick them together with a pipe and deny them all.
python
String str = "XML stringized version";
str = str.replaceAll("(?!\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFFD]|[\\x{10000}-\\x{10FFFF}]).", "");
With this, I was able to import after removing the unusable characters. I've been worried because it didn't respond unless I wrote a "." (Dot) at the end of the regular expression part. For the time being, I was able to remove the characters that could not be used.
** Added 2017/08/02 ** It was pointed out in the comment, but you can do this. I noticed that I was told, but for some reason I misunderstood that I had to separate them with pipes.
python
String str = "XML stringized version";
str = str.replaceAll("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]", "");
I wrote this article because I couldn't find "remove characters that are not included" when I searched on the WEB. There are many ways to search for rows that don't contain it.
Recommended Posts