Summary of how to use csvkit

Introduction

--An open source tool for easy handling and cleansing of csv files for data analysis --Can be installed from python ――The official page documents are available, so please check them if you want to use them in earnest.

Installation

If you can use pip, you can install it with the following one line

$ pip install csvkit

Please see the Official Page for details.

Command list

$ ls /usr/local/bin/*csv*
/usr/local/bin/csvclean   /usr/local/bin/csvgrep  /usr/local/bin/csvlook  /usr/local/bin/csvsql    /usr/local/bin/in2csv
/usr/local/bin/csvcut     /usr/local/bin/csvjoin  /usr/local/bin/csvpy    /usr/local/bin/csvstack  /usr/local/bin/sql2csv
/usr/local/bin/csvformat  /usr/local/bin/csvjson  /usr/local/bin/csvsort  /usr/local/bin/csvstat

There are quite a few. Is this all?

Data preparation

--Bring the csv file you want to touch.

$ head -n 10 iris.csv
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa

Data confirmation

csvlook ――It will be shaped and displayed for easy viewing. --If the separator is other than a comma, `-d" | "` etc.

$ csvlook iris.csv | head -n 10
|------+--------------+-------------+--------------+-------------+-------------|
|      | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species     |
|------+--------------+-------------+--------------+-------------+-------------|
|  1   | 5.1          | 3.5         | 1.4          | 0.2         | setosa      |
|  2   | 4.9          | 3           | 1.4          | 0.2         | setosa      |
|  3   | 4.7          | 3.2         | 1.3          | 0.2         | setosa      |
|  4   | 4.6          | 3.1         | 1.5          | 0.2         | setosa      |
|  5   | 5            | 3.6         | 1.4          | 0.2         | setosa      |
|  6   | 5.4          | 3.9         | 1.7          | 0.4         | setosa      |
|  7   | 4.6          | 3.4         | 1.4          | 0.3         | setosa      |

Check only column names

csvcut -n Like names () in R

$ csvcut -n iris.csv
  1:
  2: Sepal.Length
  3: Sepal.Width
  4: Petal.Length
  5: Petal.Width
  6: Species

Column selection

csvcut -c Only one column (by column number)

$ csvcut -c 2 iris.csv | head
Sepal.Length
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4

Only one column (by column name)

$ csvcut -c Sepal.Length iris.csv | head
Sepal.Length
5.1
4.9
4.7
4.6
5
5.4
4.6
5
4.4

Select two or more columns & csvlook

$ csvcut -c 2,4 iris.csv | csvlook | head
|---------------+---------------|
|  Sepal.Length | Petal.Length  |
|---------------+---------------|
|  5.1          | 1.4           |
|  4.9          | 1.4           |
|  4.7          | 1.3           |
|  4.6          | 1.5           |
|  5            | 1.4           |
|  5.4          | 1.7           |
|  4.6          | 1.4           |

Except for some columns & csvlook

$ csvcut -C Species iris.csv | csvlook | head
|------+--------------+-------------+--------------+--------------|
|      | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width  |
|------+--------------+-------------+--------------+--------------|
|  1   | 5.1          | 3.5         | 1.4          | 0.2          |
|  2   | 4.9          | 3           | 1.4          | 0.2          |
|  3   | 4.7          | 3.2         | 1.3          | 0.2          |
|  4   | 4.6          | 3.1         | 1.5          | 0.2          |
|  5   | 5            | 3.6         | 1.4          | 0.2          |
|  6   | 5.4          | 3.9         | 1.7          | 0.4          |
|  7   | 4.6          | 3.4         | 1.4          | 0.3          |

Row extraction / filtering

csvgrep --View data for a column under specified conditions --Grep-like --You can also define what goes into the interval

$ csvgrep -c Species -m setosa iris.csv | csvlook | head
|-----+--------------+-------------+--------------+-------------+----------|
|     | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species  |
|-----+--------------+-------------+--------------+-------------+----------|
|  1  | 5.1          | 3.5         | 1.4          | 0.2         | setosa   |
|  2  | 4.9          | 3           | 1.4          | 0.2         | setosa   |
|  3  | 4.7          | 3.2         | 1.3          | 0.2         | setosa   |
|  4  | 4.6          | 3.1         | 1.5          | 0.2         | setosa   |
|  5  | 5            | 3.6         | 1.4          | 0.2         | setosa   |
|  6  | 5.4          | 3.9         | 1.7          | 0.4         | setosa   |
|  7  | 4.6          | 3.4         | 1.4          | 0.3         | setosa   |

Even regular expressions

$ csvgrep -c Species -r ^s.*$ iris.csv | csvlook | head
|-----+--------------+-------------+--------------+-------------+----------|
|     | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species  |
|-----+--------------+-------------+--------------+-------------+----------|
|  1  | 5.1          | 3.5         | 1.4          | 0.2         | setosa   |
|  2  | 4.9          | 3           | 1.4          | 0.2         | setosa   |
|  3  | 4.7          | 3.2         | 1.3          | 0.2         | setosa   |
|  4  | 4.6          | 3.1         | 1.5          | 0.2         | setosa   |
|  5  | 5            | 3.6         | 1.4          | 0.2         | setosa   |
|  6  | 5.4          | 3.9         | 1.7          | 0.4         | setosa   |
|  7  | 4.6          | 3.4         | 1.4          | 0.3         | setosa   |

Even in the section

$ csvgrep -c Sepal.Length  -r "[4.5-5]"  iris.csv | csvlook | head
|------+--------------+-------------+--------------+-------------+-------------|
|      | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species     |
|------+--------------+-------------+--------------+-------------+-------------|
|  1   | 5.1          | 3.5         | 1.4          | 0.2         | setosa      |
|  2   | 4.9          | 3           | 1.4          | 0.2         | setosa      |
|  3   | 4.7          | 3.2         | 1.3          | 0.2         | setosa      |
|  4   | 4.6          | 3.1         | 1.5          | 0.2         | setosa      |
|  5   | 5            | 3.6         | 1.4          | 0.2         | setosa      |
|  6   | 5.4          | 3.9         | 1.7          | 0.4         | setosa      |
|  7   | 4.6          | 3.4         | 1.4          | 0.3         | setosa      |

Format transformation

csvformat --Format can be transformed such as tab delimiter

$ csvformat -T  iris.csv
	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
<snip>

Join with other files

csvjoin --You can join with another file --Can be specified as Outer, Left, Right

Summary of statistical information

csvstat ――It's convenient.

csvstat iris.csv
  1. _unnamed
	<type 'int'>
	Nulls: False
	Min: 1
	Max: 150
	Sum: 11325
	Mean: 75.5
	Median: 75.5
	Standard Deviation: 43.3003079281
	Unique values: 150
  2. Sepal.Length
<snip>

csv file check

csvclean ――It will check if the number of columns is correct. If there is no problem, ** No errors. ** is output.

$ csvclean iris.csv
No errors.

Convert from csv

csvjson --Converts to JSON --It is easier to see if you specify the indent with the -i option.

$ csvjson -i 2 iris.csv | head -n 20
[
  {
    "": "1",
    "Sepal.Length": "5.1",
    "Sepal.Width": "3.5",
    "Petal.Length": "1.4",
    "Petal.Width": "0.2",
    "Species": "setosa"
  },
  {
    "": "2",
    "Sepal.Length": "4.9",
    "Sepal.Width": "3",
    "Petal.Length": "1.4",
    "Petal.Width": "0.2",
    "Species": "setosa"
  },
  {
    "": "3",
    "Sepal.Length": "4.7",
<snip>

csvsql -i --You can convert to SQL system such as MySQL, create a sample, change the table name and insert it.

List of things that can be converted

$ csvsql -h
<snip>
  -i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}, --dialect {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}
<snip>

It looks like this in mysql

$ csvsql -i mysql iris.csv
CREATE TABLE iris (
	_unnamed INTEGER NOT NULL,
	`Sepal.Length` FLOAT NOT NULL,
	`Sepal.Width` FLOAT NOT NULL,
	`Petal.Length` FLOAT NOT NULL,
	`Petal.Width` FLOAT NOT NULL,
	`Species` VARCHAR(10) NOT NULL
);

Convert to csv

in2csv --Conversion from Excel file (.xlsx) to csv. The data received from the business division has many xlsx patterns and is uploaded to the server as it is, so it's actually quite convenient. --In R, I wonder if I don't have to ** read.xlsx **.

in2csv -f json --You can convert a json file to csv by specifying json with the -f option. --It is very convenient to be able to instantly convert the data (JSON format) acquired by REST API into a form that is easy to read with R.

sql2csv --Conversion from the one saved in SQL DB to csv ――This also seems to be useful for people who do not understand SQL grammar to touch it lightly.

in conclusion

--Since the delimiter can be specified, it can also be applied to tsv files, etc. ――Speaking of formatting csv files, there is also NYSOL, so if you have large data, you need to compare the speed with it, but the simple one is csvkit.

reference

-Official page -Try using csvkit -Python csvkit that can be used for data preprocessing and processing -Data science starting from the command line