[JAVA] The unfortunate world of case-insensitive wildcards (macOS)

About this article

The macOS file system is case insensitive in most directories by default. In such a world, what would match with wildcards?

Being case insensitive

For example, if you execute this script

bash



#!/bin/bash

set -eu

#ASCII alphanumeric
touch foo.txt FOO.txt
ls *.txt && rm *.txt

#So-called double-byte characters
touch zen.txt ZEN.txt
ls *.txt && rm *.txt

#Greek letters, Cyrillic letters, Roman numerals
touch ωяⅶ.txt ΩЯⅦ.txt
ls *.txt && rm *.txt

# DZ, NJ
touch dz.txt # U+01F3	Latin Small Letter DZ
touch Dz.txt # U+01F2	Latin Capital Letter D with Small Letter z
touch DZ.txt # U+01F1	Latin Capital Letter DZ
touch nj.txt # U+01CC	Latin Small Letter NJ
touch Nj.txt # U+01CB	Latin Capital Letter N with Small Letter J
touch NJ.txt # U+01CA	Latin Capital Letter NJ
ls *.txt && rm *.txt

# i witout dot, etc.
touch ı.txt # U+0131 Small I without dot
touch İ.txt # U+0130 Capital I with dot
touch i.txt # U+0069 Small I
touch I.txt # U+0049 Capital I
ls *.txt && rm *.txt

#Enclosing character
touch ⓐ.txt # U+24D0
touch Ⓐ.txt # U+24B6
ls *.txt && rm *.txt

This is the result

foo.txt
zen.txt
ωяⅶ.txt
nj.txt	dz.txt
i.txt	İ.txt	ı.txt
ⓐ.txt

FOO and foo are easy. Since it is not case sensitive, only one can survive.

As you can see around zen, ωяⅶ and , characters outside the 7bit-ASCII range also have case, and they are not distinguished.

Dz is a letter in the category" Titlecase Letter "and is neither uppercase nor lowercase. The corresponding lowercase letter is dz and the corresponding uppercase letter is DZ. This character is also case insensitive, so even if you write touch dz.txt Dz.txt DZ.txt, only one can survive.

The table below shows the cases of ı, İ, ʻi, and ʻI.

language ILowercase iUppercase
English i I
Turkish language ı İ

If you touch ı.txt, İ.txt, ʻi.txt, ʻI.txt, ʻI.txt, İ.txt, ı.txt` survive.

Looking at unicode.org, lowercase ı without dots becomes normal ʻIwhen capitalized. Nevertheless, on APFS,ı.txt and ʻI.txt are considered to be different names.

Correspondence in various environments

I tried to see what happens in some environments.

shell script (bash) etc.

The result is in the following environment:

It's the same as POSIX glob and bash, so this seems to be basic, but it's rather unpleasant.

Basically,

It has become. It was quite surprising that Foo.txt would match 1 case and F * .txt would match 0 cases.

As far as I've noticed, I have the same opinion as the file system about the part without wildcards.

F*/f*/Foo

wildcard foo.txt fred.txt
F*.txt
f*.txt
Foo.txt

i / I / ı / İ

wildcard i-lat-lo.txt I-lat-up.txt ı-tur-lo.txt İ-tur-up.txt
i*.txt
I*.txt
ı*.txt
İ*.txt
İ-tur-lo.txt
I-tur-lo.txt
ı-lat-up.txt

DZ / Dz / dz

wildcard DZ-uu.txt Dz-ul.txt dz-ll.txt
DZ*.txt
Dz*.txt
dz*.txt
dz-uu.txt
dz-ul.txt
Dz-uu.txt

ruby(Dir.glob)

The behavior of ruby is quite different from POSIX glob.

Basically, it seems to be consistent with the operation of "case insensitive". With Foo.txt, only foo.txt matches, and with F * .txt, foo.txt and fred.txt match. Easy to understand.

But,

ruby


Dir.glob("dz-u*.txt") #=> []
Dir.glob("dz-uu.txt") #=> ["files/DZ-uu.txt"]

There is also a pattern that the number of matches decreases when a wild card is inserted. bug?

F*/f*/Foo

wildcard foo.txt fred.txt
F*.txt
f*.txt
Foo.txt

i / I / ı / İ

wildcard i-lat-lo.txt I-lat-up.txt ı-tur-lo.txt İ-tur-up.txt
i*.txt
I*.txt
ı*.txt
İ*.txt
İ-tur-lo.txt
I-tur-lo.txt
ı-lat-up.txt

DZ / Dz / dz

wildcard DZ-uu.txt Dz-ul.txt dz-ll.txt
DZ*.txt
Dz*.txt
dz*.txt
dz-uu.txt
dz-ul.txt
Dz-uu.txt

Java(PathMatcher)

There is an interface called PathMatcher in java.nio.file, so I tried using it. This is also quite different from POSIX glob. It always seems to be case sensitive. It behaves differently than the filename in the file system, but is consistent.

F*/f*/Foo

wildcard foo.txt fred.txt
F*.txt
f*.txt
Foo.txt

i / I / ı / İ

wildcard i-lat-lo.txt I-lat-up.txt ı-tur-lo.txt İ-tur-up.txt
i*.txt
I*.txt
ı*.txt
İ*.txt
İ-tur-lo.txt
I-tur-lo.txt
ı-lat-up.txt

DZ / Dz / dz

wildcard DZ-uu.txt Dz-ul.txt dz-ll.txt
DZ*.txt
Dz*.txt
dz*.txt
dz-uu.txt
dz-ul.txt
Dz-uu.txt

C#(.NET Core / Directory.GetFiles)

Similar to ruby's movement. Unlike ruby, DZ * .txt matches dz-ll.txt properly (?). However, on the contrary, DZ-uu.txt cannot be obtained with dz-uu.txt.

F*/f*/Foo

wildcard foo.txt fred.txt
F*.txt
f*.txt
Foo.txt

i / I / ı / İ

wildcard i-lat-lo.txt I-lat-up.txt ı-tur-lo.txt İ-tur-up.txt
i*.txt
I*.txt
ı*.txt
İ*.txt
İ-tur-lo.txt
I-tur-lo.txt
ı-lat-up.txt

DZ / Dz / dz

wildcard DZ-uu.txt Dz-ul.txt dz-ll.txt
DZ*.txt
Dz*.txt
dz*.txt
dz-uu.txt
dz-ul.txt
Dz-uu.txt

C#(Mono / Directory.GetFiles)

Surprisingly, .NET Core and Mono behave differently. I feel like I'm losing to the letters Dz, which are neither uppercase nor lowercase.

F*/f*/Foo

wildcard foo.txt fred.txt
F*.txt
f*.txt
Foo.txt

i / I / ı / İ

wildcard i-lat-lo.txt I-lat-up.txt ı-tur-lo.txt İ-tur-up.txt
i*.txt
I*.txt
ı*.txt
İ*.txt
İ-tur-lo.txt
I-tur-lo.txt
ı-lat-up.txt

DZ / Dz / dz

wildcard DZ-uu.txt Dz-ul.txt dz-ll.txt
DZ*.txt
Dz*.txt
dz*.txt
dz-uu.txt
dz-ul.txt
Dz-uu.txt

Perl(glob)

It behaves much like POSIX glob, but treats lowercase i without dots differently.

F*/f*/Foo

wildcard foo.txt fred.txt
F*.txt
f*.txt
Foo.txt

i / I / ı / İ

wildcard i-lat-lo.txt I-lat-up.txt ı-tur-lo.txt İ-tur-up.txt
i*.txt
I*.txt
ı*.txt
İ*.txt
İ-tur-lo.txt
I-tur-lo.txt
ı-lat-up.txt

DZ / Dz / dz

wildcard DZ-uu.txt Dz-ul.txt dz-ll.txt
DZ*.txt
Dz*.txt
dz*.txt
dz-uu.txt
dz-ul.txt
Dz-uu.txt

Summary

POSIX glob has the same opinion as the file system about the part without wildcards, but it is difficult to understand that it becomes case-sensitive when wildcards are included.

Has the same opinion as POSIX glob.

on the other hand

Seems to be processing with its own algorithm and returns different results than POSIX glob. It tends to be disturbing around "a set of two letters of the alphabet that can be capitalized only for the first letter" and "a lowercase i with dots removed".

Recommended Posts

The unfortunate world of case-insensitive wildcards (macOS)
The world of control engineering books
The story of trying Sourcetrail × macOS × VS Code
Study from the beginning of Python Hour1: Hello World
The beginning of cif2cell
The meaning of self
the zen of Python
The story of sys.path.append ()
Revenge of the Types: Revenge of types
I want to know the legend of the IT technology world