This article summarizes how to deal with "Invalid codepoin xx" errors in [AWS] CloudSearch. I wrote the code in Python, but I think the points are the same for all languages.

[Added on 2020/09/22] The function introduced here that deletes the characters that CloudSearch returns an error may have deleted the characters that do not need to be deleted. We will verify and update this article at a later date.

What is an "Invalid codepoint xx" error?

It occurs when you set text or text-array type in the index field of the search domain of CloudSearch and try to upload characters that the field cannot accept. For example, if a text type field called title contains a substitute character (SUB), the following error message will be returned.

Validation error for field 'title': Invalid codepoint 1A

1A is a type of control character called a replacement character, which is an invalid code point for text type fields. I ran into this error because the data I was dealing with contained code point characters such as 1A and 08.

Simple response method

Here is a function that removes illegal characters.

def remove_invalid_code2(text: str) -> str:
    RE_ILLEGAL = u"[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
    return re.sub(RE_ILLEGAL, "", text)

For strings in fields typed as text, text-array, use this function to remove illegal characters.

The point is that ** I will apply the function to each character string **. Don't think, "If you serialize with json.dumps, you can remove the invalid character code in one shot. " You will have a bitter feeling.

Below is sample code to upload User class data to the search domain. I love types, so I code while checking types with mypy.

import json
import re
from dataclasses import asdict, dataclass
from typing import Any, ClassVar, Dict, List, Literal, TypedDict
from uuid import uuid4

import boto3
from mypy_boto3_cloudsearchdomain import CloudSearchDomainClient

def remove_invalid_code(text: str) -> str:
    RE_ILLEGAL = u"[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
    return re.sub(RE_ILLEGAL, "", text)


class AddBatchItem(TypedDict):
    type: Literal["add"]
    id: str
    fields: Dict[str, Any]


class DeleteBatchItem(TypedDict):
    type: Literal["delete"]
    id: str


BatchItem = Union[AddBatchItem, DeleteBatchItem]
Type = Literal["add", "delete"]


@dataclass
class User:
    id: str
    name: str
    age: int
    short_description: str
    description: str

    _text_fields: ClassVar[List[str]] = ["short_description", "description"]

    def get_batch_item(self, operation_type: Type) -> BatchItem:
        if operation_type == "delete":
            return {"id": self.id, "type": "delete"}

        fields = asdict(self)
        del fields["id"]

        #Point: Apply the function to each value! !!
        fields = {
            k: remove_invalid_code(v) if k in self._text_fields else v
            for k, v in fields.items()
        }

        return {"id": self.id, "type": "add", "fields": fields}


if __name__ == "__main__":

    SEARCH_ENDPOINT = "http://xxxx.com"
    client: CloudSearchDomainClient = boto3.client(
        "cloudsearchdomain", endpoint_url=SEARCH_ENDPOINT
    )

    user = User(
        id=str(uuid4()),
        name="John",
        age=18,
        short_description="I'm fine",
        description="I'm fine! !!" + u"\b" + "Nice to meet you!",
    )

    batch_items = [user.get_batch_item("add")]
    docs = json.dumps(batch_items).encode("utf-8")
    client.upload_documents(documents=docs, contentType="application/json")

That's all for this article.

I had a hard time getting to the simple solution I've introduced here. I plan to write another article about what I learned from it.

Now I can write "... a kind of control character ... a code point (well, you know?)", But when I encountered an error, I wasn't familiar with the character code, so I had a lot of trouble debugging. did. If you are thrilled with a vague understanding of that area, please read the next article. looking forward to.

[AWS] How to deal with "Invalid codepoint" error in CloudSearch

What is an "Invalid codepoint xx" error?

Simple response method