2017/1/3 postscript: DataFrames.jl is planning to release v.0.9 in February 2017 with a major fix that also partially breaks API compatibility. Please note that this article is based on v.0.6.10. please. If I have time and feel like it (hey), I may rewrite it accordingly. However, I haven't had much time to touch Julia these days ...
This is a continuation of the article Last time. Last time I ended up without looking at the contents of the data, but today I will write about simple content confirmation and how to retrieve columns.
In the code execution example below, it is assumed that the REPL contains the data read in the previous content.
All the libraries and languages introduced this time have a function called "Return a summary of a simple data structure", so let's use it.
Pandas:
>>> unique_artists.describe()
artist_id artist_mbid \
count 3888 3489
unique 3888 3488
top AROBQ0B1187FB404F6 2c0be2ba-fa63-430f-afca-7499852158f6
freq 1 2
track_id artist_name
count 3888 3888
unique 3888 3875
top TRAPGXH128F426D205 Bill & Gloria Gaither
freq 1 3
>>> artist_term.describe()
artist_id term
count 97493 97493
unique 3885 3502
top ARLYKKT11F4C83B350 rock
freq 91 2346
DataFrames.jl:
julia> DataFrames.describe(unique_artists)
artist_id
Length 3888
Type UTF8String
NAs 0
NA% 0.0%
Unique 3888
artist_mbid
Length 3888
Type UTF8String
NAs 399
NA% 10.26%
Unique 3489
track_id
Length 3888
Type UTF8String
NAs 0
NA% 0.0%
Unique 3888
artist_name
Length 3888
Type UTF8String
NAs 0
NA% 0.0%
Unique 3875
Julia> DataFrames.describe(artist_term)
artist_id
Length 97493
Type Nullable{UTF8String}
NAs 0
NA% 0.0%
Unique 3885
term
Length 97493
Type Nullable{UTF8String}
NAs 0
NA% 0.0%
Unique 3502
R:
> summary(unique_artists)
artist_id artist_mbid
AR009211187B989185: 1 : 399
AR00A6H1187FB5402A: 1 2c0be2ba-fa63-430f-afca-7499852158f6: 2
AR00LNI1187FB444A5: 1 001a1d90-56dc-4992-bd7d-970ecd7105d0: 1
AR00MBZ1187B9B5DB1: 1 001e4a87-fc46-4d66-ba37-e9714e0fe58a: 1
AR01IP11187B9AF5D2: 1 0039c7ae-e1a7-4a7d-9b49-0cbc716821a6: 1
AR01VU31187B997DA0: 1 0039cd9c-ea95-4553-86c4-4f729984cc2f: 1
(Other) :3882 (Other) :3483
track_id artist_name
TRAAAFD128F92F423A: 1 Bill & Gloria Gaither : 3
TRAAARJ128F9320760: 1 2-Gether feat. Sarinah : 2
TRAABCL128F4286650: 1 Abbott & Chambers : 2
TRAABJL12903CDCF1A: 1 Art Ensemble Of Chicago : 2
TRAABRB128F9306DD5: 1 Billy Price_ Fred Chapellier: 2
TRAABVM128F92CA9DC: 1 Eddie Palmieri : 2
(Other) :3882 (Other) :3875
> summary(artist_term)
artist_id term
Length:97493 Length:97493
Class :character Class :character
Mode :character Mode :character
It's nice that both Pandas and DataFrames.jl give us unique elements and modes. R can see some of the top cases, but depending on the nature of the data, it may not be meaningful. DataFrames.jl is also good for giving you the ratio of NA. Numerical data is more interesting or thankful for the display results, so let me give you an example.
>>> pandas.DataFrame({"a":range(1,11)}).describe()
a
count 10.00000
mean 5.50000
std 3.02765
min 1.00000
25% 3.25000
50% 5.50000
75% 7.75000
max 10.00000
julia> DataFrames.describe(DataFrames.DataFrame(a=1:10))
a
Min 1.0
1st Qu. 3.25
Median 5.5
Mean 5.5
3rd Qu. 7.75
Max 10.0
NAs 0
NA% 0.0%
> summary(seq(10))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 3.25 5.50 5.50 7.75 10.00
So how to take out the line for the time being. We will take out the 4th column "artist_name" counting from the beginning of 1 of unique_artists.
Pandas returns the same result for the following two:
>>> unique_artists.artist_name
0 Carroll Thompson
1 The Meatmen
2 Bruce BecVar
3 Memphis Minnie
4 Call To Preserve
5 Grand Funk
6 3OH!3
7 Ross
8 Annie Philippe
9 Tito Puente
10 Hi-Fi Companions
11 Delta 9
12 Bersuit Vergarabat
13 BT
14 John Williams
15 Britney Spears
16 Sylver
17 The Bug Featuring Ricky Ranking
18 Alexisonfire
19 Yellowcard
20 The Disposable Heroes Of Hiphoprisy
21 Dead Kennedys
22 Wilks featuring pee wee ellis
23 Mike Bloomfield
24 Roy Drusky
25 Alex Ubago
26 D'Molls
27 Kassav'
28 Ol' Dirty Bastard
29 Fudge Tunnel
...
3858 Smokin' Joe Kubek & Bnois King
3859 William Orbit
3860 Warner mack
3861 Biddu
3862 Ideal
3863 Reginaldo Bessa
3864 Rainbow
3865 Babylon Disco
3866 Mawatari Matsuko
3867 Widespread Bloodshed
3868 Top Of The Poppers
3869 Elliot Goldenthal
3870 Alexander Zonjic
3871 Jim Reeves
3872 Mister Ries
3873 Martriden
3874 Günter Noris
3875 Rise Against
3876 Henry Mancini
3877 Jason Michael Carroll
3878 Los Ninos De Sara
3879 Juelz Santana / Bezel
3880 Vaggelis Perpiniadis
3881 Davol
3882 John Stevens
3883 Taboo
3884 Mt. Wilson Repeater
3885 Chapta_ Stage McCloud_ SupaStarr & Tony Manshino
3886 Rooney
3887 Delroy Wilson
Name: artist_name, dtype: object
>>> unique_artists["artist_name"]
(Abbreviation)
DataFrames.jl seems to write:
unique_artists[:artist_name]
For R, it looks like this: (* Please note that if you hit this easily with the REPL of R in the default state, all the data in the column will be displayed.)
unique_artists$artist_name
unique_artists[, "artist_name"]
Well, is there anyone who is interested in having "," in \ [] only for R? Is _____ in. I will say that.
I wrote "take out columns" easily, but all of the above is how to write when "take out the data structures that make up the columns of the data frame". The "data structures that make up the columns of a data frame" are Series for Pandas, vector for R, and DataArray for DataFrames.jl. Note that these are different from a single row of data frames. If you do it properly, if you think that the return value of the function is data.frame, it will become a vector and it will be absurd, so be careful.
The following is the case for "retrieving a row of data frames".
Pandas:
unique_artists[["artist_name"]]
DataFrames.jl:
unique_artists[[:artist_name]]
R:
unique_artists["artist_name"]
unique_artists[, "artist_name", drop=F]
Rather, it is basically the same as the operation to retrieve multiple columns, but in the case of R, it is a trap that drop is TRUE by default.
When extracting multiple columns, it is sufficient to make multiple elements of the array or list passed as the column name to be acquired as it is, and it will be as follows.
Pandas:
unique_artists[["artist_id", "artist_name"]]
DataFrames.jl:
unique_artists[[:artist_id, :artist_name]]
R:
unique_artists[c("artist_id", "artist_name")]
unique_artists[, c("artist_id", "artist_name"), drop=F]
To exclude specific columns:
Pandas:
>>> unique_artists.drop(["track_id", "artist_mbid"], axis=1)
artist_id artist_name
0 AR009211187B989185 Carroll Thompson
1 AR00A6H1187FB5402A The Meatmen
2 AR00LNI1187FB444A5 Bruce BecVar
3 AR00MBZ1187B9B5DB1 Memphis Minnie
4 AR01IP11187B9AF5D2 Call To Preserve
5 AR01VU31187B997DA0 Grand Funk
6 AR01W2D1187FB5912F 3OH!3
7 AR022JO1187B99587B Ross
8 AR02DB61187B9A0B5E Annie Philippe
9 AR02IU11187FB513F2 Tito Puente
10 AR02KZG1187FB3B3B4 Hi-Fi Companions
11 AR02PUA1187FB52574 Delta 9
12 AR02YGA1187B9B8AC4 Bersuit Vergarabat
13 AR035N21187FB3938E BT
14 AR039B11187B9B30D0 John Williams
15 AR03BDP1187FB5B324 Britney Spears
16 AR03GWG1187B9B6C85 Sylver
17 AR040M31187B98CA41 The Bug Featuring Ricky Ranking
18 AR040QX1187FB4CFE1 Alexisonfire
19 AR048JZ1187B9AEB85 Yellowcard
20 AR04KTB1187B99B6B7 The Disposable Heroes Of Hiphoprisy
21 AR050VJ1187B9B13A7 Dead Kennedys
22 AR051KA1187B98B2FF Wilks featuring pee wee ellis
23 AR0569B1187B9A4036 Mike Bloomfield
24 AR059HI1187B9A14D7 Roy Drusky
25 AR05IU31187B9B9A1A Alex Ubago
26 AR05KQA1187B9963B3 D'Molls
27 AR05NQH1187B98E875 Kassav'
28 AR05OJD1187B9B99A6 Ol' Dirty Bastard
29 AR05VW21187FB407B4 Fudge Tunnel
... ... ...
3858 ARZRVFK11F4C83DF46 Smokin' Joe Kubek & Bnois King
3859 ARZS5VW1187FB567E8 William Orbit
3860 ARZSFIJ1187B98B8BC Warner mack
3861 ARZSTXE1187B99DD77 Biddu
3862 ARZTGQE1187FB461D9 Ideal
3863 ARZTSMH122988F522A Reginaldo Bessa
3864 ARZTSYB1187FB54987 Rainbow
3865 ARZU9NI1187B9AEE08 Babylon Disco
3866 ARZUVRN1269FB3759D Mawatari Matsuko
3867 ARZUZYU12086C17110 Widespread Bloodshed
3868 ARZVBBM1241B9CB622 Top Of The Poppers
3869 ARZVEF91187B9AC770 Elliot Goldenthal
3870 ARZVJ641187FB36FA4 Alexander Zonjic
3871 ARZVTZN1187FB579D4 Jim Reeves
3872 ARZVZRN1241B9C4B14 Mister Ries
3873 ARZWAHK119B8668273 Martriden
3874 ARZWBH21187B99ACE1 Günter Noris
3875 ARZWK2R1187B98F09F Rise Against
3876 ARZWPWP1241B9CA793 Henry Mancini
3877 ARZXLZD11E2835CEA7 Jason Michael Carroll
3878 ARZXMYV1187FB5B99B Los Ninos De Sara
3879 ARZXTEY1187B9997A7 Juelz Santana / Bezel
3880 ARZXTUO12509411DAE Vaggelis Perpiniadis
3881 ARZXVY01187B9972EC Davol
3882 ARZXYRK1187B99E432 John Stevens
3883 ARZY28S1187FB40BAB Taboo
3884 ARZYFWI11F4C84225B Mt. Wilson Repeater
3885 ARZYP6O1187B9892E7 Chapta_ Stage McCloud_ SupaStarr & Tony Manshino
3886 ARZYPLF1187FB45B9B Rooney
3887 ARZZRK91187B9A5CA5 Delroy Wilson
[3888 rows x 2 columns]
Pandas drop is non-destructive if you don't specify inplace = True, and the column isn't dropped from the DataFrame after calling the drop method. It seems that DataFrames.jl and R do not have a function that does this non-destructive removal of a specific column. (It seems that you can define the following function yourself)
In DataFrames.jl, it seems good to create a bool array in which only the columns you want to keep are true, depending on the inclusion notation. cf: http://stackoverflow.com/questions/24665439/julia-dataframe-remove-column-by-name
julia> unique_artists[~[(x in [:artist_mbid, :track_id]) for x in names(unique_artists)]]
3888x2 DataFrames.DataFrame
| Row | artist_id |
|------|----------------------|
| 1 | "AR009211187B989185" |
| 2 | "AR00A6H1187FB5402A" |
| 3 | "AR00LNI1187FB444A5" |
| 4 | "AR00MBZ1187B9B5DB1" |
| 5 | "AR01IP11187B9AF5D2" |
| 6 | "AR01VU31187B997DA0" |
| 7 | "AR01W2D1187FB5912F" |
| 8 | "AR022JO1187B99587B" |
⋮
| 3880 | "ARZXTEY1187B9997A7" |
| 3881 | "ARZXTUO12509411DAE" |
| 3882 | "ARZXVY01187B9972EC" |
| 3883 | "ARZXYRK1187B99E432" |
| 3884 | "ARZY28S1187FB40BAB" |
| 3885 | "ARZYFWI11F4C84225B" |
| 3886 | "ARZYP6O1187B9892E7" |
| 3887 | "ARZYPLF1187FB45B9B" |
| 3888 | "ARZZRK91187B9A5CA5" |
| Row | artist_name |
|------|----------------------------------------------------|
| 1 | "Carroll Thompson" |
| 2 | "The Meatmen" |
| 3 | "Bruce BecVar" |
| 4 | "Memphis Minnie" |
| 5 | "Call To Preserve" |
| 6 | "Grand Funk" |
| 7 | "3OH!3" |
| 8 | "Ross" |
⋮
| 3880 | "Juelz Santana / Bezel" |
| 3881 | "Vaggelis Perpiniadis" |
| 3882 | "Davol" |
| 3883 | "John Stevens" |
| 3884 | "Taboo" |
| 3885 | "Mt. Wilson Repeater" |
| 3886 | "Chapta_ Stage McCloud_ SupaStarr & Tony Manshino" |
| 3887 | "Rooney" |
| 3888 | "Delroy Wilson" |
julia> # destructive
julia> # DataFrames.delete!(unique_artists, [:artist_mbid, :track_id])
The idea is the same for R.
> unique_artists[!names(unique_artists) %in% c("artist_mbid", "track_id")]
(Abbreviation)
> # destructive
> # unique_artists$artist_mbid <- NULL
> # unique_artists$track_id <- NULL
I only wrote about how to access columns, such as slices, but this time around.
environment:
Recommended Posts