[AWS] How to deal with "Invalid codepoint" error in CloudSearch

This article summarizes how to deal with "Invalid codepoin xx" errors in [AWS] CloudSearch. I wrote the code in Python, but I think the points are the same for all languages.

[Added on 2020/09/22] The function introduced here that deletes the characters that CloudSearch returns an error may have deleted the characters that do not need to be deleted. We will verify and update this article at a later date.

What is an "Invalid codepoint xx" error?

It occurs when you set text or text-array type in the index field of the search domain of CloudSearch and try to upload characters that the field cannot accept. For example, if a text type field called title contains a substitute character (SUB), the following error message will be returned.

Validation error for field 'title': Invalid codepoint 1A

1A is a type of control character called a replacement character, which is an invalid code point for text type fields. I ran into this error because the data I was dealing with contained code point characters such as 1A and 08.

Simple response method

Here is a function that removes illegal characters.

def remove_invalid_code2(text: str) -> str:
    RE_ILLEGAL = u"[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
    return re.sub(RE_ILLEGAL, "", text)

For strings in fields typed as text, text-array, use this function to remove illegal characters.

The point is that ** I will apply the function to each character string **. Don't think, "If you serialize with json.dumps, you can remove the invalid character code in one shot. " You will have a bitter feeling.

Below is sample code to upload User class data to the search domain. I love types, so I code while checking types with mypy.

import json
import re
from dataclasses import asdict, dataclass
from typing import Any, ClassVar, Dict, List, Literal, TypedDict
from uuid import uuid4

import boto3
from mypy_boto3_cloudsearchdomain import CloudSearchDomainClient

def remove_invalid_code(text: str) -> str:
    RE_ILLEGAL = u"[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]"
    return re.sub(RE_ILLEGAL, "", text)


class AddBatchItem(TypedDict):
    type: Literal["add"]
    id: str
    fields: Dict[str, Any]


class DeleteBatchItem(TypedDict):
    type: Literal["delete"]
    id: str


BatchItem = Union[AddBatchItem, DeleteBatchItem]
Type = Literal["add", "delete"]


@dataclass
class User:
    id: str
    name: str
    age: int
    short_description: str
    description: str

    _text_fields: ClassVar[List[str]] = ["short_description", "description"]

    def get_batch_item(self, operation_type: Type) -> BatchItem:
        if operation_type == "delete":
            return {"id": self.id, "type": "delete"}

        fields = asdict(self)
        del fields["id"]

        #Point: Apply the function to each value! !!
        fields = {
            k: remove_invalid_code(v) if k in self._text_fields else v
            for k, v in fields.items()
        }

        return {"id": self.id, "type": "add", "fields": fields}


if __name__ == "__main__":

    SEARCH_ENDPOINT = "http://xxxx.com"
    client: CloudSearchDomainClient = boto3.client(
        "cloudsearchdomain", endpoint_url=SEARCH_ENDPOINT
    )

    user = User(
        id=str(uuid4()),
        name="John",
        age=18,
        short_description="I'm fine",
        description="I'm fine! !!" + u"\b" + "Nice to meet you!",
    )

    batch_items = [user.get_batch_item("add")]
    docs = json.dumps(batch_items).encode("utf-8")
    client.upload_documents(documents=docs, contentType="application/json")

That's all for this article.

I had a hard time getting to the simple solution I've introduced here. I plan to write another article about what I learned from it.

Now I can write "... a kind of control character ... a code point (well, you know?)", But when I encountered an error, I wasn't familiar with the character code, so I had a lot of trouble debugging. did. If you are thrilled with a vague understanding of that area, please read the next article. looking forward to.

Recommended Posts

[AWS] How to deal with "Invalid codepoint" error in CloudSearch
How to deal with python installation error in pyenv (BUILD FAILED)
How to deal with memory leaks in matplotlib.pyplot
How to deal with run-time errors in subprocess.call
[AWS] How to deal with WordPress "An error occurred when cropping an image."
How to deal with pyenv initialization failure in fish 3.1.0
How to deal with Executing transaction: failed in Anaconda
How to deal with "No module named'〇〇'" error in Jupyter Notebook | Install with! Pip!
How to deal with imbalanced data
How to deal with imbalanced data
How to deal with DistributionNotFound errors
For beginners, how to deal with common errors in keras
How to suppress display error in matplotlib
How to work with BigQuery in Python
How to deal with enum compatibility errors
[Python] How to deal with module errors
How to deal with OAuth2 error when using Google APIs from Python
How to deal with SSL error when connecting to S3 with boto of Python
How to deal with garbled characters in json of Django REST Framework
How to deal with old Python versions in Cloud9 made by others
How to deal with errors when hitting pip ②
[REAPER] How to play with Reascript in Python
can't pickle annoy. How to deal with Annoy objects
How to deal with module'tensorflow' has no attribute'〇〇'
Easily log in to AWS with multiple accounts
How to use tkinter with python in pyenv
[AWS] Wordpress How to deal with "The response is not a correct JSON response"
How to convert / restore a string with [] in python
Explain in detail how to make sounds with python
How to deal with Django's Template Does Not Exist
How to do zero-padding in one line with OpenCV
How to run tests in bulk with Python unittest
How to load files in Google Drive with Google Colaboratory
How to access with cache when reading_json in pandas
How to install poetry (error handling) in zsh environment
How to deal with "Type Error: No matching signature found" error when using pandas fillna
How to embed multiple embeds in one message with Discord.py
How to extract any appointment in Google Calendar with Python
How to check ORM behavior in one file with django
How to manipulate the DOM in an iframe with Selenium
How to cast with Theano
[Linux] How to deal with garbled characters when viewing files
A story about how to deal with the CORS problem
How to Alter with SQLAlchemy?
How to deal with the error that Docker's MySQL container fails to start on Docker Toolbox
How to separate strings with','
How to RDP with Fedora31
How to develop in Python
How to create dataframes and mess with elements in pandas
How to deal with UnicodeDecodeError when executing google image download
Try HeloWorld in your own language (with How to & code)
2 ways to deal with SessionNotCreatedException
How to Delete with SQLAlchemy?
How to log in to AtCoder with Python and submit automatically
[VLC] How to deal with the problem that it is not in the foreground during playback
How to deal with the error "Failed to load module" canberra-gtk-module "that appears when you run OpenCV
How to not escape Japanese when dealing with json in python
How to deal with "You have multiple authentication backends configured ..." (Django)
How to create a heatmap with an arbitrary domain in Python
How to use python put in pyenv on macOS with PyCall
How to install pandas on EC2 (How to deal with MemoryError and PermissionError)