This article summarizes how to deal with "Invalid codepoin xx" errors in [AWS] CloudSearch. I wrote the code in Python, but I think the points are the same for all languages.
[Added on 2020/09/22] The function introduced here that deletes the characters that CloudSearch returns an error may have deleted the characters that do not need to be deleted. We will verify and update this article at a later date.
It occurs when you set text or text-array type in the index field of the search domain of CloudSearch and try to upload characters that the field cannot accept. For example, if a text type field called title contains a substitute character (SUB), the following error message will be returned.
Validation error for field 'title': Invalid codepoint 1A
1A is a type of control character called a replacement character, which is an invalid code point for text type fields. I ran into this error because the data I was dealing with contained code point characters such as 1A
and 08
.
Here is a function that removes illegal characters.
def remove_invalid_code2(text: str) -> str:
RE_ILLEGAL = u"[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
return re.sub(RE_ILLEGAL, "", text)
For strings in fields typed as text, text-array, use this function to remove illegal characters.
The point is that ** I will apply the function to each character string **. Don't think, "If you serialize with json.dumps
, you can remove the invalid character code in one shot. " You will have a bitter feeling.
Below is sample code to upload User class data to the search domain. I love types, so I code while checking types with mypy.
import json
import re
from dataclasses import asdict, dataclass
from typing import Any, ClassVar, Dict, List, Literal, TypedDict
from uuid import uuid4
import boto3
from mypy_boto3_cloudsearchdomain import CloudSearchDomainClient
def remove_invalid_code(text: str) -> str:
RE_ILLEGAL = u"[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
return re.sub(RE_ILLEGAL, "", text)
class AddBatchItem(TypedDict):
type: Literal["add"]
id: str
fields: Dict[str, Any]
class DeleteBatchItem(TypedDict):
type: Literal["delete"]
id: str
BatchItem = Union[AddBatchItem, DeleteBatchItem]
Type = Literal["add", "delete"]
@dataclass
class User:
id: str
name: str
age: int
short_description: str
description: str
_text_fields: ClassVar[List[str]] = ["short_description", "description"]
def get_batch_item(self, operation_type: Type) -> BatchItem:
if operation_type == "delete":
return {"id": self.id, "type": "delete"}
fields = asdict(self)
del fields["id"]
#Point: Apply the function to each value! !!
fields = {
k: remove_invalid_code(v) if k in self._text_fields else v
for k, v in fields.items()
}
return {"id": self.id, "type": "add", "fields": fields}
if __name__ == "__main__":
SEARCH_ENDPOINT = "http://xxxx.com"
client: CloudSearchDomainClient = boto3.client(
"cloudsearchdomain", endpoint_url=SEARCH_ENDPOINT
)
user = User(
id=str(uuid4()),
name="John",
age=18,
short_description="I'm fine",
description="I'm fine! !!" + u"\b" + "Nice to meet you!",
)
batch_items = [user.get_batch_item("add")]
docs = json.dumps(batch_items).encode("utf-8")
client.upload_documents(documents=docs, contentType="application/json")
That's all for this article.
I had a hard time getting to the simple solution I've introduced here. I plan to write another article about what I learned from it.
Now I can write "... a kind of control character ... a code point (well, you know?)", But when I encountered an error, I wasn't familiar with the character code, so I had a lot of trouble debugging. did. If you are thrilled with a vague understanding of that area, please read the next article. looking forward to.
Recommended Posts