The macOS file system is case insensitive in most directories by default. In such a world, what would match with wildcards?
For example, if you execute this script
bash
#!/bin/bash
set -eu
#ASCII alphanumeric
touch foo.txt FOO.txt
ls *.txt && rm *.txt
#So-called double-byte characters
touch zen.txt ZEN.txt
ls *.txt && rm *.txt
#Greek letters, Cyrillic letters, Roman numerals
touch ωяⅶ.txt ΩЯⅦ.txt
ls *.txt && rm *.txt
# DZ, NJ
touch dz.txt # U+01F3 Latin Small Letter DZ
touch Dz.txt # U+01F2 Latin Capital Letter D with Small Letter z
touch DZ.txt # U+01F1 Latin Capital Letter DZ
touch nj.txt # U+01CC Latin Small Letter NJ
touch Nj.txt # U+01CB Latin Capital Letter N with Small Letter J
touch NJ.txt # U+01CA Latin Capital Letter NJ
ls *.txt && rm *.txt
# i witout dot, etc.
touch ı.txt # U+0131 Small I without dot
touch İ.txt # U+0130 Capital I with dot
touch i.txt # U+0069 Small I
touch I.txt # U+0049 Capital I
ls *.txt && rm *.txt
#Enclosing character
touch ⓐ.txt # U+24D0
touch Ⓐ.txt # U+24B6
ls *.txt && rm *.txt
This is the result
foo.txt
zen.txt
ωяⅶ.txt
nj.txt dz.txt
i.txt İ.txt ı.txt
ⓐ.txt
FOO
and foo
are easy. Since it is not case sensitive, only one can survive.
As you can see around zen
, ωяⅶ
and ⓐ
, characters outside the 7bit-ASCII range also have case, and they are not distinguished.
Dz
is a letter in the category" Titlecase Letter "and is neither uppercase nor lowercase.
The corresponding lowercase letter is dz
and the corresponding uppercase letter is DZ
.
This character is also case insensitive, so even if you write touch dz.txt Dz.txt DZ.txt
, only one can survive.
The table below shows the cases of ı
, İ
, ʻi, and ʻI
.
language | I Lowercase |
i Uppercase |
---|---|---|
English | i |
I |
Turkish language | ı |
İ |
If you touch ı.txt
, İ.txt
, ʻi.txt, ʻI.txt
,
ʻI.txt,
İ.txt,
ı.txt` survive.
Looking at unicode.org, lowercase ı
without dots becomes normal ʻIwhen capitalized. Nevertheless, on APFS,
ı.txt and ʻI.txt
are considered to be different names.
I tried to see what happens in some environments.
The result is in the following environment:
It's the same as POSIX glob and bash, so this seems to be basic, but it's rather unpleasant.
Basically,
It has become. It was quite surprising that Foo.txt
would match 1 case and F * .txt
would match 0 cases.
As far as I've noticed, I have the same opinion as the file system about the part without wildcards.
F*/f*/Foo
wildcard | foo.txt | fred.txt |
---|---|---|
F*.txt | ❌ | ❌ |
f*.txt | ✅ | ✅ |
Foo.txt | ✅ | ❌ |
i / I / ı / İ
wildcard | i-lat-lo.txt | I-lat-up.txt | ı-tur-lo.txt | İ-tur-up.txt |
---|---|---|---|---|
i*.txt | ✅ | ❌ | ❌ | ❌ |
I*.txt | ❌ | ✅ | ❌ | ❌ |
ı*.txt | ❌ | ❌ | ✅ | ❌ |
İ*.txt | ❌ | ❌ | ❌ | ✅ |
İ-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
I-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
ı-lat-up.txt | ❌ | ❌ | ❌ | ❌ |
DZ / Dz / dz
wildcard | DZ-uu.txt | Dz-ul.txt | dz-ll.txt |
---|---|---|---|
DZ*.txt | ✅ | ❌ | ❌ |
Dz*.txt | ❌ | ✅ | ❌ |
dz*.txt | ❌ | ❌ | ✅ |
dz-uu.txt | ✅ | ❌ | ❌ |
dz-ul.txt | ❌ | ✅ | ❌ |
Dz-uu.txt | ✅ | ❌ | ❌ |
ruby(Dir.glob)
The behavior of ruby is quite different from POSIX glob.
Basically, it seems to be consistent with the operation of "case insensitive".
With Foo.txt
, only foo.txt
matches, and with F * .txt
, foo.txt
and fred.txt
match. Easy to understand.
But,
ruby
Dir.glob("dz-u*.txt") #=> []
Dir.glob("dz-uu.txt") #=> ["files/DZ-uu.txt"]
There is also a pattern that the number of matches decreases when a wild card is inserted. bug?
F*/f*/Foo
wildcard | foo.txt | fred.txt |
---|---|---|
F*.txt | ✅ | ✅ |
f*.txt | ✅ | ✅ |
Foo.txt | ✅ | ❌ |
i / I / ı / İ
wildcard | i-lat-lo.txt | I-lat-up.txt | ı-tur-lo.txt | İ-tur-up.txt |
---|---|---|---|---|
i*.txt | ✅ | ✅ | ❌ | ❌ |
I*.txt | ✅ | ✅ | ❌ | ❌ |
ı*.txt | ❌ | ❌ | ✅ | ❌ |
İ*.txt | ❌ | ❌ | ❌ | ✅ |
İ-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
I-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
ı-lat-up.txt | ❌ | ❌ | ❌ | ❌ |
DZ / Dz / dz
wildcard | DZ-uu.txt | Dz-ul.txt | dz-ll.txt |
---|---|---|---|
DZ*.txt | ✅ | ❌ | ❌ |
Dz*.txt | ❌ | ✅ | ❌ |
dz*.txt | ❌ | ❌ | ✅ |
dz-uu.txt | ✅ | ❌ | ❌ |
dz-ul.txt | ❌ | ✅ | ❌ |
Dz-uu.txt | ✅ | ❌ | ❌ |
Java(PathMatcher)
There is an interface called PathMatcher
in java.nio.file
, so I tried using it.
This is also quite different from POSIX glob.
It always seems to be case sensitive.
It behaves differently than the filename in the file system, but is consistent.
F*/f*/Foo
wildcard | foo.txt | fred.txt |
---|---|---|
F*.txt | ❌ | ❌ |
f*.txt | ✅ | ✅ |
Foo.txt | ❌ | ❌ |
i / I / ı / İ
wildcard | i-lat-lo.txt | I-lat-up.txt | ı-tur-lo.txt | İ-tur-up.txt |
---|---|---|---|---|
i*.txt | ✅ | ❌ | ❌ | ❌ |
I*.txt | ❌ | ✅ | ❌ | ❌ |
ı*.txt | ❌ | ❌ | ✅ | ❌ |
İ*.txt | ❌ | ❌ | ❌ | ✅ |
İ-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
I-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
ı-lat-up.txt | ❌ | ❌ | ❌ | ❌ |
DZ / Dz / dz
wildcard | DZ-uu.txt | Dz-ul.txt | dz-ll.txt |
---|---|---|---|
DZ*.txt | ✅ | ❌ | ❌ |
Dz*.txt | ❌ | ✅ | ❌ |
dz*.txt | ❌ | ❌ | ✅ |
dz-uu.txt | ❌ | ❌ | ❌ |
dz-ul.txt | ❌ | ❌ | ❌ |
Dz-uu.txt | ❌ | ❌ | ❌ |
C#(.NET Core / Directory.GetFiles)
Similar to ruby's movement.
Unlike ruby, DZ * .txt
matches dz-ll.txt
properly (?).
However, on the contrary, DZ-uu.txt
cannot be obtained with dz-uu.txt
.
F*/f*/Foo
wildcard | foo.txt | fred.txt |
---|---|---|
F*.txt | ✅ | ✅ |
f*.txt | ✅ | ✅ |
Foo.txt | ✅ | ❌ |
i / I / ı / İ
wildcard | i-lat-lo.txt | I-lat-up.txt | ı-tur-lo.txt | İ-tur-up.txt |
---|---|---|---|---|
i*.txt | ✅ | ✅ | ❌ | ❌ |
I*.txt | ✅ | ✅ | ❌ | ❌ |
ı*.txt | ❌ | ❌ | ✅ | ❌ |
İ*.txt | ❌ | ❌ | ❌ | ✅ |
İ-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
I-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
ı-lat-up.txt | ❌ | ❌ | ❌ | ❌ |
DZ / Dz / dz
wildcard | DZ-uu.txt | Dz-ul.txt | dz-ll.txt |
---|---|---|---|
DZ*.txt | ✅ | ✅ | ✅ |
Dz*.txt | ✅ | ✅ | ✅ |
dz*.txt | ✅ | ✅ | ✅ |
dz-uu.txt | ❌ | ❌ | ❌ |
dz-ul.txt | ❌ | ❌ | ❌ |
Dz-uu.txt | ❌ | ❌ | ❌ |
C#(Mono / Directory.GetFiles)
Surprisingly, .NET Core and Mono behave differently.
I feel like I'm losing to the letters Dz
, which are neither uppercase nor lowercase.
F*/f*/Foo
wildcard | foo.txt | fred.txt |
---|---|---|
F*.txt | ✅ | ✅ |
f*.txt | ✅ | ✅ |
Foo.txt | ✅ | ❌ |
i / I / ı / İ
wildcard | i-lat-lo.txt | I-lat-up.txt | ı-tur-lo.txt | İ-tur-up.txt |
---|---|---|---|---|
i*.txt | ✅ | ✅ | ❌ | ❌ |
I*.txt | ✅ | ✅ | ❌ | ❌ |
ı*.txt | ❌ | ❌ | ✅ | ❌ |
İ*.txt | ❌ | ❌ | ❌ | ✅ |
İ-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
I-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
ı-lat-up.txt | ❌ | ❌ | ❌ | ❌ |
DZ / Dz / dz
wildcard | DZ-uu.txt | Dz-ul.txt | dz-ll.txt |
---|---|---|---|
DZ*.txt | ✅ | ❌ | ✅ |
Dz*.txt | ❌ | ✅ | ❌ |
dz*.txt | ✅ | ❌ | ✅ |
dz-uu.txt | ✅ | ❌ | ❌ |
dz-ul.txt | ❌ | ❌ | ❌ |
Dz-uu.txt | ❌ | ❌ | ❌ |
Perl(glob)
It behaves much like POSIX glob, but treats lowercase i without dots differently.
F*/f*/Foo
wildcard | foo.txt | fred.txt |
---|---|---|
F*.txt | ❌ | ❌ |
f*.txt | ✅ | ✅ |
Foo.txt | ✅ | ❌ |
i / I / ı / İ
wildcard | i-lat-lo.txt | I-lat-up.txt | ı-tur-lo.txt | İ-tur-up.txt |
---|---|---|---|---|
i*.txt | ✅ | ❌ | ❌ | ❌ |
I*.txt | ❌ | ✅ | ❌ | ❌ |
ı*.txt | ❌ | ❌ | ✅ | ❌ |
İ*.txt | ❌ | ❌ | ❌ | ✅ |
İ-tur-lo.txt | ❌ | ❌ | ❌ | ❌ |
I-tur-lo.txt | ❌ | ❌ | ✅ | ❌ |
ı-lat-up.txt | ❌ | ✅ | ❌ | ❌ |
DZ / Dz / dz
wildcard | DZ-uu.txt | Dz-ul.txt | dz-ll.txt |
---|---|---|---|
DZ*.txt | ✅ | ❌ | ❌ |
Dz*.txt | ❌ | ✅ | ❌ |
dz*.txt | ❌ | ❌ | ✅ |
dz-uu.txt | ✅ | ❌ | ❌ |
dz-ul.txt | ❌ | ✅ | ❌ |
Dz-uu.txt | ✅ | ❌ | ❌ |
POSIX glob has the same opinion as the file system about the part without wildcards, but it is difficult to understand that it becomes case-sensitive when wildcards are included.
Has the same opinion as POSIX glob.
on the other hand
Seems to be processing with its own algorithm and returns different results than POSIX glob. It tends to be disturbing around "a set of two letters of the alphabet that can be capitalized only for the first letter" and "a lowercase i with dots removed".
Recommended Posts