It's not Near Field Communication, it's Normalization Form Canonical Composition.
Unicode Normalization @ Wikipedia
In Unicode, Å and Å are different characters.
The latter smells like Latin-1 in terms of numbers. In fact, that's right.
When NFC normalized, Angstrom becomes A with upper ring. Let's check this with Python.
>>> import unicodedata
>>> ord(unicodedata.normalize('NFC', '\N{ANGSTROM SIGN}'))
197
>>> unicodedata.name(unicodedata.normalize('NFC', '\N{ANGSTROM SIGN}'))
'LATIN CAPITAL LETTER A WITH RING ABOVE'
unicodedata is a standard library module. It's a bonus, but NFD normalization, which is sometimes talked about on macOS, has 2 characters.
>>> len(unicodedata.normalize('NFD', '\N{ANGSTROM SIGN}'))
2
>>> [ord(ch) for ch in unicodedata.normalize('NFD', '\N{ANGSTROM SIGN}')]
[65, 778]
>>> [unicodedata.name(ch) for ch in unicodedata.normalize('NFD', '\N{ANGSTROM SIGN}')]
['LATIN CAPITAL LETTER A', 'COMBINING RING ABOVE']
In theory, this conversion can be a problem. For example, in Shift_JIS, "Angstrom" can be expressed, but "A with upper ring" cannot be expressed. If you read characters from a text file saved in Shift_JIS format and then try to save in Shift_JIS format again after NFC normalization, problems may occur.
>>> with open('from.txt', encoding='shift_jis') as fr:
...    with open('to.txt', 'w', encoding='shift_jis') as fw:
...        fw.write(unicodedata.normalize('NFC', fr.read()))
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
UnicodeEncodeError: 'shift_jis' codec can't encode character '\xc5' in position 0: illegal multibyte sequence
If you omit reading non-essential files
>>> '\N{ANGSTROM SIGN}'.encode('shift_jis')
b'\x81\xf0'
>>> unicodedata.normalize('NFC', '\N{ANGSTROM SIGN}').encode('shift_jis')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'shift_jis' codec can't encode character '\xc5' in position 0: illegal multibyte sequence
More frankly:
>>> '\N{LATIN CAPITAL LETTER A WITH RING ABOVE}'.encode('shift_jis')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'shift_jis' codec can't encode character '\xc5' in position 0: illegal multibyte sequence
I learned this example in "Introduction to Character Code Technology for Programmers", but the reason is unknown in the same book ("For some reason it is not" p353).
So, I happened to see the following description on Wikipedia.
The unit symbol of Angstrom is this character, but Unicode and JIS X 0213 define it as a character different from the original character. However, the Unicode angstrom symbol U + 212B is a compatible character that can only be used to maintain backward compatibility with older standards and is not recommended for use. (From Wikipedia)
I understand that there is a reason why it can be used only for backward compatibility.
However, all Unicode normalizations are quite worrisome. The letters are difficult.
As a bonus, if you search with either one in the browser, both will be caught. I think I'm searching after normalizing one of the four types. I'm not sure if the search operation has specifications common to all browsers.
When an end user's simple complaint, "This character is garbled," appears around here, it becomes "Hi". It's not someone else's affair, because I'm associated with a system where CP932, shift_jis, and UTF-8 are mixed up on Windows.
Recommended Posts