Play non-XML valid characters in Java by specifying code values

Introduction

I used to import XML at work, but at that time the other party loaded the control code.

The following method was used for importing.

Unmarshaller#unmarshal(XMLStreamReader reader, Class declaredType)

The flow is to get the contents of XML as a character string ⇒ convert it to a stream ⇒ make it an object.

And I get the following error.

Message: An invalid XML character (Unicode: 0x2) was found in the element content of the document.

So, it seems that the other party will not fix it, so I decided to remove invalid XML characters by string replacement.

Strings that can be used in XML

W3C Recommendation Document

According to the above site, the character code values that can be used in XML are the following 6 patterns.

① # x9 ⇒ tab ② #xA ⇒ Line feed (LF) ③ #xD ⇒ Line feed (CR) ④ [# x20- # xD7FF] ⇒ Half-width space-Hangul ⑤ [# xE000- # xFFFD] ⇒ Gaiji-special purpose characters ⑥ [# x10000- # x10FFFF] ⇒Linear B syllabary-undefined

Well, basically you should think that the characters you use are in ④

For the time being, put a table of characters and code values in Unicode

Unicode List

Representation in Java

When you specify a code value in java and replace it, write it like this. In the following, half-width spaces are replaced with blanks. You can use Matcher or something, but for the time being, you can use regular expressions with String # replaceAll.

python


String str = "XML stringized version";
str = str.replaceAll("\\u0020", "");

In Java, you can write a 2-digit character code with "\ x00" and a 4-digit character code with "\ u0000". Two backlashes are written for escape.

If you write all in 4 digits, it will be like this ① #x9 ⇒ "\u0009" ② #xA ⇒ "\u000A" ③ #xD ⇒ "\u000D" ④ [#x20-#xD7FF] ⇒ "[\u0020-\uD7FF]" ⑤ [#xE000-#xFFFD] ⇒ "[\uE000-\uFFFD]"

Wait ... Unicode has more than 5 digits ... I wondered how to express it. There was a way to specify a multi-digit code value with a regular expression.

⑥ [#x10000-#x10FFFF] ⇒ "[\x{10000}-\x{10FFFF}]"

This seems to be fine.

Combine and refuse

Regular expressions can be OR-judged, so stick them together with a pipe and deny them all.

python


String str = "XML stringized version";
str = str.replaceAll("(?!\\u0009|\\u000A|\\u000D|[\\u0020-\\uD7FF]|[\\uE000-\\uFFFD]|[\\x{10000}-\\x{10FFFF}]).", "");

With this, I was able to import after removing the unusable characters. I've been worried because it didn't respond unless I wrote a "." (Dot) at the end of the regular expression part. For the time being, I was able to remove the characters that could not be used.

** Added 2017/08/02 ** It was pointed out in the comment, but you can do this. I noticed that I was told, but for some reason I misunderstood that I had to separate them with pipes.

python


String str = "XML stringized version";
str = str.replaceAll("[^\\u0009\\u000A\\u000D\\u0020-\\uD7FF\\uE000-\\uFFFD\\x{10000}-\\x{10FFFF}]", "");

Finally

I wrote this article because I couldn't find "remove characters that are not included" when I searched on the WEB. There are many ways to search for rows that don't contain it.

Recommended Posts

Play non-XML valid characters in Java by specifying code values
[Java] Judgment by entering characters in the terminal
Java in Visual Studio Code
Write Java8-like code in Java8
Forcibly stop Java process by specifying PID in Windows PowerShell
Guess the character code in Java
Java Spring environment in vs Code
Get Null-safe Map values in Java
Arbitrary string creation code by Java
Play with Markdown in Java flexmark-java
Sample code to get key JDBC type values in Java + H2 Database
Duplicate Map sorted by key in Java
Play Framework 2.6 (Java) environment construction in Eclipse
Play RAW, WAV, MP3 files in Java
All same hash code string in Java
[Mac] Install Java in Visual Studio Code
Sample source code for finding the least common multiple of multiple values in Java
Sample code to get the values of major SQL types in Java + MySQL 8.0