--An open source tool for easy handling and cleansing of csv files for data analysis --Can be installed from python ――The official page documents are available, so please check them if you want to use them in earnest.
If you can use pip, you can install it with the following one line
$ pip install csvkit
Please see the Official Page for details.
$ ls /usr/local/bin/*csv*
/usr/local/bin/csvclean /usr/local/bin/csvgrep /usr/local/bin/csvlook /usr/local/bin/csvsql /usr/local/bin/in2csv
/usr/local/bin/csvcut /usr/local/bin/csvjoin /usr/local/bin/csvpy /usr/local/bin/csvstack /usr/local/bin/sql2csv
/usr/local/bin/csvformat /usr/local/bin/csvjson /usr/local/bin/csvsort /usr/local/bin/csvstat
There are quite a few. Is this all?
--Bring the csv file you want to touch.
$ head -n 10 iris.csv
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
csvlook
――It will be shaped and displayed for easy viewing.
--If the separator is other than a comma, `-d" | "`
etc.
$ csvlook iris.csv | head -n 10
|------+--------------+-------------+--------------+-------------+-------------|
| | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|------+--------------+-------------+--------------+-------------+-------------|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
csvcut -n Like names () in R
$ csvcut -n iris.csv
1:
2: Sepal.Length
3: Sepal.Width
4: Petal.Length
5: Petal.Width
6: Species
csvcut -c Only one column (by column number)
$ csvcut -c 2 iris.csv | head
Sepal.Length
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4
Only one column (by column name)
$ csvcut -c Sepal.Length iris.csv | head
Sepal.Length
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4
Select two or more columns & csvlook
$ csvcut -c 2,4 iris.csv | csvlook | head
|---------------+---------------|
| Sepal.Length | Petal.Length |
|---------------+---------------|
| 5.1 | 1.4 |
| 4.9 | 1.4 |
| 4.7 | 1.3 |
| 4.6 | 1.5 |
| 5 | 1.4 |
| 5.4 | 1.7 |
| 4.6 | 1.4 |
Except for some columns & csvlook
$ csvcut -C Species iris.csv | csvlook | head
|------+--------------+-------------+--------------+--------------|
| | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
|------+--------------+-------------+--------------+--------------|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 |
| 2 | 4.9 | 3 | 1.4 | 0.2 |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 |
| 5 | 5 | 3.6 | 1.4 | 0.2 |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 |
csvgrep --View data for a column under specified conditions --Grep-like --You can also define what goes into the interval
$ csvgrep -c Species -m setosa iris.csv | csvlook | head
|-----+--------------+-------------+--------------+-------------+----------|
| | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|-----+--------------+-------------+--------------+-------------+----------|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
Even regular expressions
$ csvgrep -c Species -r ^s.*$ iris.csv | csvlook | head
|-----+--------------+-------------+--------------+-------------+----------|
| | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|-----+--------------+-------------+--------------+-------------+----------|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
Even in the section
$ csvgrep -c Sepal.Length -r "[4.5-5]" iris.csv | csvlook | head
|------+--------------+-------------+--------------+-------------+-------------|
| | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|------+--------------+-------------+--------------+-------------+-------------|
| 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 2 | 4.9 | 3 | 1.4 | 0.2 | setosa |
| 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5 | 5 | 3.6 | 1.4 | 0.2 | setosa |
| 6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 7 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
csvformat --Format can be transformed such as tab delimiter
$ csvformat -T iris.csv
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
<snip>
csvjoin --You can join with another file --Can be specified as Outer, Left, Right
csvstat ――It's convenient.
csvstat iris.csv
1. _unnamed
<type 'int'>
Nulls: False
Min: 1
Max: 150
Sum: 11325
Mean: 75.5
Median: 75.5
Standard Deviation: 43.3003079281
Unique values: 150
2. Sepal.Length
<snip>
csvclean ――It will check if the number of columns is correct. If there is no problem, ** No errors. ** is output.
$ csvclean iris.csv
No errors.
csvjson --Converts to JSON --It is easier to see if you specify the indent with the -i option.
$ csvjson -i 2 iris.csv | head -n 20
[
{
"": "1",
"Sepal.Length": "5.1",
"Sepal.Width": "3.5",
"Petal.Length": "1.4",
"Petal.Width": "0.2",
"Species": "setosa"
},
{
"": "2",
"Sepal.Length": "4.9",
"Sepal.Width": "3",
"Petal.Length": "1.4",
"Petal.Width": "0.2",
"Species": "setosa"
},
{
"": "3",
"Sepal.Length": "4.7",
<snip>
csvsql -i --You can convert to SQL system such as MySQL, create a sample, change the table name and insert it.
List of things that can be converted
$ csvsql -h
<snip>
-i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}, --dialect {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}
<snip>
It looks like this in mysql
$ csvsql -i mysql iris.csv
CREATE TABLE iris (
_unnamed INTEGER NOT NULL,
`Sepal.Length` FLOAT NOT NULL,
`Sepal.Width` FLOAT NOT NULL,
`Petal.Length` FLOAT NOT NULL,
`Petal.Width` FLOAT NOT NULL,
`Species` VARCHAR(10) NOT NULL
);
in2csv --Conversion from Excel file (.xlsx) to csv. The data received from the business division has many xlsx patterns and is uploaded to the server as it is, so it's actually quite convenient. --In R, I wonder if I don't have to ** read.xlsx **.
in2csv -f json --You can convert a json file to csv by specifying json with the -f option. --It is very convenient to be able to instantly convert the data (JSON format) acquired by REST API into a form that is easy to read with R.
sql2csv --Conversion from the one saved in SQL DB to csv ――This also seems to be useful for people who do not understand SQL grammar to touch it lightly.
--Since the delimiter can be specified, it can also be applied to tsv files, etc. ――Speaking of formatting csv files, there is also NYSOL, so if you have large data, you need to compare the speed with it, but the simple one is csvkit.
-Official page -Try using csvkit -Python csvkit that can be used for data preprocessing and processing -Data science starting from the command line
Recommended Posts