Comparison of Python (+ Pandas), R, Julia (+ DataFrames) (summary of table contents, access by column)

2017/1/3 postscript: DataFrames.jl is planning to release v.0.9 in February 2017 with a major fix that also partially breaks API compatibility. Please note that this article is based on v.0.6.10. please. If I have time and feel like it (hey), I may rewrite it accordingly. However, I haven't had much time to touch Julia these days ...


This is a continuation of the article Last time. Last time I ended up without looking at the contents of the data, but today I will write about simple content confirmation and how to retrieve columns.

In the code execution example below, it is assumed that the REPL contains the data read in the previous content.

Table summary

All the libraries and languages introduced this time have a function called "Return a summary of a simple data structure", so let's use it.

Pandas:

>>> unique_artists.describe()
                 artist_id                           artist_mbid  \
count                 3888                                  3489   
unique                3888                                  3488   
top     AROBQ0B1187FB404F6  2c0be2ba-fa63-430f-afca-7499852158f6   
freq                     1                                     2   

                  track_id            artist_name  
count                 3888                   3888  
unique                3888                   3875  
top     TRAPGXH128F426D205  Bill & Gloria Gaither  
freq                     1                      3  
>>> artist_term.describe()
                 artist_id   term
count                97493  97493
unique                3885   3502
top     ARLYKKT11F4C83B350   rock
freq                    91   2346

DataFrames.jl:

julia> DataFrames.describe(unique_artists)
artist_id
Length  3888
Type    UTF8String
NAs     0
NA%     0.0%
Unique  3888

artist_mbid
Length  3888
Type    UTF8String
NAs     399
NA%     10.26%
Unique  3489

track_id
Length  3888
Type    UTF8String
NAs     0
NA%     0.0%
Unique  3888

artist_name
Length  3888
Type    UTF8String
NAs     0
NA%     0.0%
Unique  3875

Julia> DataFrames.describe(artist_term)
artist_id
Length  97493
Type    Nullable{UTF8String}
NAs     0
NA%     0.0%
Unique  3885

term
Length  97493
Type    Nullable{UTF8String}
NAs     0
NA%     0.0%
Unique  3502

R:

> summary(unique_artists)
              artist_id                                  artist_mbid  
 AR009211187B989185:   1                                       : 399  
 AR00A6H1187FB5402A:   1   2c0be2ba-fa63-430f-afca-7499852158f6:   2  
 AR00LNI1187FB444A5:   1   001a1d90-56dc-4992-bd7d-970ecd7105d0:   1  
 AR00MBZ1187B9B5DB1:   1   001e4a87-fc46-4d66-ba37-e9714e0fe58a:   1  
 AR01IP11187B9AF5D2:   1   0039c7ae-e1a7-4a7d-9b49-0cbc716821a6:   1  
 AR01VU31187B997DA0:   1   0039cd9c-ea95-4553-86c4-4f729984cc2f:   1  
 (Other)           :3882   (Other)                             :3483  
               track_id                          artist_name  
 TRAAAFD128F92F423A:   1   Bill & Gloria Gaither       :   3  
 TRAAARJ128F9320760:   1   2-Gether feat. Sarinah      :   2  
 TRAABCL128F4286650:   1   Abbott & Chambers           :   2  
 TRAABJL12903CDCF1A:   1   Art Ensemble Of Chicago     :   2  
 TRAABRB128F9306DD5:   1   Billy Price_ Fred Chapellier:   2  
 TRAABVM128F92CA9DC:   1   Eddie Palmieri              :   2  
 (Other)           :3882   (Other)                     :3875  
> summary(artist_term)
  artist_id             term          
 Length:97493       Length:97493      
 Class :character   Class :character  
 Mode  :character   Mode  :character  

It's nice that both Pandas and DataFrames.jl give us unique elements and modes. R can see some of the top cases, but depending on the nature of the data, it may not be meaningful. DataFrames.jl is also good for giving you the ratio of NA. Numerical data is more interesting or thankful for the display results, so let me give you an example.

>>> pandas.DataFrame({"a":range(1,11)}).describe()
              a
count  10.00000
mean    5.50000
std     3.02765
min     1.00000
25%     3.25000
50%     5.50000
75%     7.75000
max    10.00000
julia> DataFrames.describe(DataFrames.DataFrame(a=1:10))
a
Min      1.0
1st Qu.  3.25
Median   5.5
Mean     5.5
3rd Qu.  7.75
Max      10.0
NAs      0
NA%      0.0%


> summary(seq(10))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    3.25    5.50    5.50    7.75   10.00 

Column-by-column access

So how to take out the line for the time being. We will take out the 4th column "artist_name" counting from the beginning of 1 of unique_artists.

Pandas returns the same result for the following two:

>>> unique_artists.artist_name
0                                       Carroll Thompson
1                                            The Meatmen
2                                           Bruce BecVar
3                                         Memphis Minnie
4                                       Call To Preserve
5                                             Grand Funk
6                                                  3OH!3
7                                                   Ross
8                                         Annie Philippe
9                                            Tito Puente
10                                      Hi-Fi Companions
11                                               Delta 9
12                                    Bersuit Vergarabat
13                                                    BT
14                                         John Williams
15                                        Britney Spears
16                                                Sylver
17                       The Bug Featuring Ricky Ranking
18                                          Alexisonfire
19                                            Yellowcard
20                   The Disposable Heroes Of Hiphoprisy
21                                         Dead Kennedys
22                         Wilks featuring pee wee ellis
23                                       Mike Bloomfield
24                                            Roy Drusky
25                                            Alex Ubago
26                                               D'Molls
27                                               Kassav'
28                                     Ol' Dirty Bastard
29                                          Fudge Tunnel
                              ...                       
3858                      Smokin' Joe Kubek & Bnois King
3859                                       William Orbit
3860                                         Warner mack
3861                                               Biddu
3862                                               Ideal
3863                                     Reginaldo Bessa
3864                                             Rainbow
3865                                       Babylon Disco
3866                                    Mawatari Matsuko
3867                                Widespread Bloodshed
3868                                  Top Of The Poppers
3869                                   Elliot Goldenthal
3870                                    Alexander Zonjic
3871                                          Jim Reeves
3872                                         Mister Ries
3873                                           Martriden
3874                                        Günter Noris
3875                                        Rise Against
3876                                       Henry Mancini
3877                               Jason Michael Carroll
3878                                   Los Ninos De Sara
3879                               Juelz Santana / Bezel
3880                                Vaggelis Perpiniadis
3881                                               Davol
3882                                        John Stevens
3883                                               Taboo
3884                                 Mt. Wilson Repeater
3885    Chapta_ Stage McCloud_ SupaStarr & Tony Manshino
3886                                              Rooney
3887                                       Delroy Wilson
Name: artist_name, dtype: object
>>> unique_artists["artist_name"]
(Abbreviation)

DataFrames.jl seems to write:

unique_artists[:artist_name]

For R, it looks like this: (* Please note that if you hit this easily with the REPL of R in the default state, all the data in the column will be displayed.)

unique_artists$artist_name
unique_artists[, "artist_name"]

Well, is there anyone who is interested in having "," in \ [] only for R? Is _____ in. I will say that.

I wrote "take out columns" easily, but all of the above is how to write when "take out the data structures that make up the columns of the data frame". The "data structures that make up the columns of a data frame" are Series for Pandas, vector for R, and DataArray for DataFrames.jl. Note that these are different from a single row of data frames. If you do it properly, if you think that the return value of the function is data.frame, it will become a vector and it will be absurd, so be careful.

The following is the case for "retrieving a row of data frames".

Pandas:

unique_artists[["artist_name"]]

DataFrames.jl:

unique_artists[[:artist_name]]

R:

unique_artists["artist_name"]
unique_artists[, "artist_name", drop=F]

Rather, it is basically the same as the operation to retrieve multiple columns, but in the case of R, it is a trap that drop is TRUE by default.

When extracting multiple columns, it is sufficient to make multiple elements of the array or list passed as the column name to be acquired as it is, and it will be as follows.

Pandas:

unique_artists[["artist_id", "artist_name"]]

DataFrames.jl:

unique_artists[[:artist_id, :artist_name]]

R:

unique_artists[c("artist_id", "artist_name")]
unique_artists[, c("artist_id", "artist_name"), drop=F]

To exclude specific columns:

Pandas:

>>> unique_artists.drop(["track_id", "artist_mbid"], axis=1)
               artist_id                                       artist_name
0     AR009211187B989185                                  Carroll Thompson
1     AR00A6H1187FB5402A                                       The Meatmen
2     AR00LNI1187FB444A5                                      Bruce BecVar
3     AR00MBZ1187B9B5DB1                                    Memphis Minnie
4     AR01IP11187B9AF5D2                                  Call To Preserve
5     AR01VU31187B997DA0                                        Grand Funk
6     AR01W2D1187FB5912F                                             3OH!3
7     AR022JO1187B99587B                                              Ross
8     AR02DB61187B9A0B5E                                    Annie Philippe
9     AR02IU11187FB513F2                                       Tito Puente
10    AR02KZG1187FB3B3B4                                  Hi-Fi Companions
11    AR02PUA1187FB52574                                           Delta 9
12    AR02YGA1187B9B8AC4                                Bersuit Vergarabat
13    AR035N21187FB3938E                                                BT
14    AR039B11187B9B30D0                                     John Williams
15    AR03BDP1187FB5B324                                    Britney Spears
16    AR03GWG1187B9B6C85                                            Sylver
17    AR040M31187B98CA41                   The Bug Featuring Ricky Ranking
18    AR040QX1187FB4CFE1                                      Alexisonfire
19    AR048JZ1187B9AEB85                                        Yellowcard
20    AR04KTB1187B99B6B7               The Disposable Heroes Of Hiphoprisy
21    AR050VJ1187B9B13A7                                     Dead Kennedys
22    AR051KA1187B98B2FF                     Wilks featuring pee wee ellis
23    AR0569B1187B9A4036                                   Mike Bloomfield
24    AR059HI1187B9A14D7                                        Roy Drusky
25    AR05IU31187B9B9A1A                                        Alex Ubago
26    AR05KQA1187B9963B3                                           D'Molls
27    AR05NQH1187B98E875                                           Kassav'
28    AR05OJD1187B9B99A6                                 Ol' Dirty Bastard
29    AR05VW21187FB407B4                                      Fudge Tunnel
...                  ...                                               ...
3858  ARZRVFK11F4C83DF46                    Smokin' Joe Kubek & Bnois King
3859  ARZS5VW1187FB567E8                                     William Orbit
3860  ARZSFIJ1187B98B8BC                                       Warner mack
3861  ARZSTXE1187B99DD77                                             Biddu
3862  ARZTGQE1187FB461D9                                             Ideal
3863  ARZTSMH122988F522A                                   Reginaldo Bessa
3864  ARZTSYB1187FB54987                                           Rainbow
3865  ARZU9NI1187B9AEE08                                     Babylon Disco
3866  ARZUVRN1269FB3759D                                  Mawatari Matsuko
3867  ARZUZYU12086C17110                              Widespread Bloodshed
3868  ARZVBBM1241B9CB622                                Top Of The Poppers
3869  ARZVEF91187B9AC770                                 Elliot Goldenthal
3870  ARZVJ641187FB36FA4                                  Alexander Zonjic
3871  ARZVTZN1187FB579D4                                        Jim Reeves
3872  ARZVZRN1241B9C4B14                                       Mister Ries
3873  ARZWAHK119B8668273                                         Martriden
3874  ARZWBH21187B99ACE1                                      Günter Noris
3875  ARZWK2R1187B98F09F                                      Rise Against
3876  ARZWPWP1241B9CA793                                     Henry Mancini
3877  ARZXLZD11E2835CEA7                             Jason Michael Carroll
3878  ARZXMYV1187FB5B99B                                 Los Ninos De Sara
3879  ARZXTEY1187B9997A7                             Juelz Santana / Bezel
3880  ARZXTUO12509411DAE                              Vaggelis Perpiniadis
3881  ARZXVY01187B9972EC                                             Davol
3882  ARZXYRK1187B99E432                                      John Stevens
3883  ARZY28S1187FB40BAB                                             Taboo
3884  ARZYFWI11F4C84225B                               Mt. Wilson Repeater
3885  ARZYP6O1187B9892E7  Chapta_ Stage McCloud_ SupaStarr & Tony Manshino
3886  ARZYPLF1187FB45B9B                                            Rooney
3887  ARZZRK91187B9A5CA5                                     Delroy Wilson

[3888 rows x 2 columns]

Pandas drop is non-destructive if you don't specify inplace = True, and the column isn't dropped from the DataFrame after calling the drop method. It seems that DataFrames.jl and R do not have a function that does this non-destructive removal of a specific column. (It seems that you can define the following function yourself)

In DataFrames.jl, it seems good to create a bool array in which only the columns you want to keep are true, depending on the inclusion notation. cf: http://stackoverflow.com/questions/24665439/julia-dataframe-remove-column-by-name

julia> unique_artists[~[(x in [:artist_mbid, :track_id]) for x in names(unique_artists)]]
3888x2 DataFrames.DataFrame
| Row  | artist_id            |
|------|----------------------|
| 1    | "AR009211187B989185" |
| 2    | "AR00A6H1187FB5402A" |
| 3    | "AR00LNI1187FB444A5" |
| 4    | "AR00MBZ1187B9B5DB1" |
| 5    | "AR01IP11187B9AF5D2" |
| 6    | "AR01VU31187B997DA0" |
| 7    | "AR01W2D1187FB5912F" |
| 8    | "AR022JO1187B99587B" |
⋮
| 3880 | "ARZXTEY1187B9997A7" |
| 3881 | "ARZXTUO12509411DAE" |
| 3882 | "ARZXVY01187B9972EC" |
| 3883 | "ARZXYRK1187B99E432" |
| 3884 | "ARZY28S1187FB40BAB" |
| 3885 | "ARZYFWI11F4C84225B" |
| 3886 | "ARZYP6O1187B9892E7" |
| 3887 | "ARZYPLF1187FB45B9B" |
| 3888 | "ARZZRK91187B9A5CA5" |

| Row  | artist_name                                        |
|------|----------------------------------------------------|
| 1    | "Carroll Thompson"                                 |
| 2    | "The Meatmen"                                      |
| 3    | "Bruce BecVar"                                     |
| 4    | "Memphis Minnie"                                   |
| 5    | "Call To Preserve"                                 |
| 6    | "Grand Funk"                                       |
| 7    | "3OH!3"                                            |
| 8    | "Ross"                                             |
⋮
| 3880 | "Juelz Santana / Bezel"                            |
| 3881 | "Vaggelis Perpiniadis"                             |
| 3882 | "Davol"                                            |
| 3883 | "John Stevens"                                     |
| 3884 | "Taboo"                                            |
| 3885 | "Mt. Wilson Repeater"                              |
| 3886 | "Chapta_ Stage McCloud_ SupaStarr & Tony Manshino" |
| 3887 | "Rooney"                                           |
| 3888 | "Delroy Wilson"                                    |


julia> # destructive

julia> # DataFrames.delete!(unique_artists, [:artist_mbid, :track_id])

The idea is the same for R.

> unique_artists[!names(unique_artists) %in% c("artist_mbid", "track_id")]
(Abbreviation)
> # destructive
> # unique_artists$artist_mbid <- NULL
> # unique_artists$track_id <- NULL

I only wrote about how to access columns, such as slices, but this time around.

environment:

Recommended Posts

Comparison of Python (+ Pandas), R, Julia (+ DataFrames) (summary of table contents, access by column)
[Python] Summary of table creation method using DataFrame (pandas)
Comparison of data frame handling in Python (pandas), R, Pig
[Python] [Table of Contents Links] Python Programming
Python Math Series ⓪ Table of Contents
Introductory table of contents for python3
Summary of differences between Python and PHP (comparison table of main items)
[Python] Summary of how to use pandas
Reading from text files and SQLite in Python (+ Pandas), R, Julia (+ DataFrames)
[Python] Sort the table by sort_values (pandas DataFrame)
Automating simple tasks with Python Table of contents
Comparison of R and Python writing (Euclidean algorithm)
[Language processing 100 knocks 2020] Summary of answer examples by Python
Image processing by matrix Basics & Table of Contents-Reinventor of Python image processing-
Summary of pre-processing practices for Python beginners (Pandas dataframe)
Creating BINGO "Web Tools" with Python (Table of Contents)
Comparison table of frequently used processes of Python and Clojure
Summary of Python arguments