Introduction

I used to import XML at work, but at that time the other party loaded the control code.

The following method was used for importing.

Unmarshaller#unmarshal(XMLStreamReader reader, Class declaredType)

The flow is to get the contents of XML as a character string ⇒ convert it to a stream ⇒ make it an object.

And I get the following error.

Message: An invalid XML character (Unicode: 0x2) was found in the element content of the document.

There were some invalid characters, but they were omitted.

So, it seems that the other party will not fix it, so I decided to remove invalid XML characters by string replacement.

Strings that can be used in XML

W3C Recommendation Document

According to the above site, the character code values that can be used in XML are the following 6 patterns.

① # x9 ⇒ tab ② #xA ⇒ Line feed (LF) ③ #xD ⇒ Line feed (CR) ④ [# x20- # xD7FF] ⇒ Half-width space-Hangul ⑤ [# xE000- # xFFFD] ⇒ Gaiji-special purpose characters ⑥ [# x10000- # x10FFFF] ⇒Linear B syllabary-undefined

Well, basically you should think that the characters you use are in ④

For the time being, put a table of characters and code values in Unicode

Unicode List

Representation in Java

When you specify a code value in java and replace it, write it like this. In the following, half-width spaces are replaced with blanks. You can use Matcher or something, but for the time being, you can use regular expressions with String # replaceAll.

`python`


String str = "XML stringized version";
str = str.replaceAll("\\u0020", "");

In Java, you can write a 2-digit character code with "\ x00" and a 4-digit character code with "\ u0000". Two backlashes are written for escape.

If you write all in 4 digits, it will be like this ① #x9　⇒　"\u0009" ② #xA　⇒　"\u000A" ③ #xD　⇒　"\u000D" ④ [#x20-#xD7FF]　⇒　"[\u0020-\uD7FF]" ⑤ [#xE000-#xFFFD]　⇒　"[\uE000-\uFFFD]"

Wait ... Unicode has more than 5 digits ... I wondered how to express it. There was a way to specify a multi-digit code value with a regular expression.

⑥ [#x10000-#x10FFFF]　⇒　"[\x{10000}-\x{10FFFF}]"

This seems to be fine.

Combine and refuse

Regular expressions can be OR-judged, so stick them together with a pipe and deny them all.

`python`


String str = "XML stringized version";
str = str.replaceAll("(?!\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFFD]|[\\x{10000}-\\x{10FFFF}]).", "");

With this, I was able to import after removing the unusable characters. I've been worried because it didn't respond unless I wrote a "." (Dot) at the end of the regular expression part. For the time being, I was able to remove the characters that could not be used.

** Added 2017/08/02 ** It was pointed out in the comment, but you can do this. I noticed that I was told, but for some reason I misunderstood that I had to separate them with pipes.

`python`


String str = "XML stringized version";
str = str.replaceAll("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]", "");

Finally

I wrote this article because I couldn't find "remove characters that are not included" when I searched on the WEB. There are many ways to search for rows that don't contain it.

Play non-XML valid characters in Java by specifying code values

Introduction

Strings that can be used in XML

Representation in Java

python

Combine and refuse

python

python

Finally

`python`

`python`

`python`