Try to aggregate doujin music data with pandas

Hello. I am writing the first draft while smoking shisha. It's muffled. It's been about 3 months since I posted to Qiita. long time no see.

Synopsis

Chata Do you know? I'm a relaxed douujin singer, but since I was singing "Dango Big Family," you may have heard it even if you didn't know the name. An acquaintance made a music database for Mr. Chata, and I was able to get the data, so I would like to analyze it using pandas.

What to do this time

It's not an analysis, but the CDs containing Chata's songs are tabulated by the release date. I will try two methods, a simple method using Seaborn's * count plot * and a slightly detour method that aggregates by year and month and then outputs with * bar plot *.

About the dataset

Uses data organized by CD. It consists of "CD name", "Circle name 1", "Circle name 2", "Release date", "Major label flag", and "Remarks", and is saved in csv format.

environment

Ubuntu 16.04 Python 3.5.2 :: Anaconda custom (64-bit)

Let's get started.

#coding:utf-8

import csv
import pandas as pd
import seaborn as sns

if __name__ == "__main__":

    #Read CSV
    dfCD = pd.read_csv("ChataData_CD.csv")

    #Release date
    releaseYear = []
    #Get the release date by turning the CSV storage data frame
    # i:Line name, row:series(Row value)
    for i,row in dfCD.iterrows():
        ymd = str(row['ReleaseYmd'])
        #Slice the release date to make it the release date
        releaseYear.append(ymd[0:4])

    #Put the release date list in the data frame
    chataCD = pd.DataFrame({'year':releaseYear})

    #Set Japanese font to seaborn
    sns.set(font='TakaoPGothic')

*** iterrows *** turns the data frame containing the original data to get the release date of the CD. *** iterrows *** is a method that turns a tuple consisting of ** row name ** and ** row value **. Is it an image of rotating the data frame vertically? (Reference: http://sinhrks.hatenablog.com/entry/2015/06/18/221747) Since it is registered as the release date in the dataset, slice it so that it becomes "yyyymm" and put it in the list.

pd.DataFrame ({** label name **: *** Series ***}) puts a list of release dates into the data frame. Series is a one-dimensional list. At first, I tried to add the release date (string) while turning the original data and failed.

Set the Japanese font in Seaborn and you're ready to go. Here, TakaoPGothic is specified. I referred to the following for specifying Japanese fonts. [Seaborn] Display Japanese (change font)

Let's visualize it using Seaborn. Let's start with a simple method.

    fig = sns.countplot(x='year',data=chataCD,palette='Greens_r').get_figure()
    fig.suptitle('Changes in the number of CD releases of Chata's songs(2000-2016)')
    sns.plt.savefig('countByYear_simple.png')

*** countplot *** is a method that counts X-axis or Y-axis data. It's easy because you only need to pass a data frame containing the release date of each CD. [Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 2] (http://qiita.com/hik0107/items/7233ca334b2a5e1ca924) *** palette *** specifies the color palette. How to choose Seaborn color palette was helpful. The following is the output graph. countByYear_simple.png

Next is a slightly detour method. It will take more time, but this is more likely to be a study of pandas.

    #Find the number of releases by year and month
    yrCount = chataCD['year'].value_counts(ascending=True).sort_index()
    year = []
    count = []
    for row in yrCount.iteritems():
        year.append(row[0])
        count.append(row[1])
    dfCount = pd.DataFrame({'year':year,'count':count})

    barplot = sns.barplot(x='year',y='count',data=dfCount,palette='Greens_r').get_figure()
    barplot.suptitle('Changes in the number of CD releases of Chata's songs(2000-2016)')
    sns.plt.savefig('countByYear.png')

For those who are detoured, we will create a data frame with two types of data, the release date and the number of releases. First, use ** DataFrame.value_counts ** to count the number of elements.

** DataFrame.value_counts ** returns * Series *. In this case, * index * is the year and month, and * values * is the number of sheets. Make a list of each release date and number of releases, and enter the value there. *** iteritems *** turns a tuple consisting of * index * and * values *, so add * row [0] * to the release date and * row [1] * to the number of releases.

After creating a data frame from the list of release date and number of releases, it is visualized with Seaborn. Use *** barplot *** to draw a so-called bar graph. countByYear.png

The story goes awry

When I import and run Seaborn, I get the following Warning, but I'm having trouble understanding the cause ...

/.pyenv/versions/anaconda3-4.1.0/lib/python3.5/site-packages/PIL/Image.py:85: RuntimeWarning: The _imaging extension was built for another  version of Pillow or PIL
  warnings.warn(str(v), RuntimeWarning)

I would appreciate it if you could tell me the cause in the comments.

Recommended Posts

Try to aggregate doujin music data with pandas
Try converting to tidy data with pandas
Quickly try to visualize datasets with pandas
Convert 202003 to 2020-03 with pandas
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
How to convert horizontally held data to vertically held data with pandas
How to extract non-missing value nan data with pandas
How to extract non-missing value nan data with pandas
Try to factorial with recursion
Data processing tips with Pandas
Ingenuity to handle data with Pandas in a memory-saving manner
Try to extract Azure SQL Server data table with pyodbc
Try to process Titanic data with preprocessing library DataLiner (Append)
Try to get data while port forwarding to RDS with anaconda.
Try to process Titanic data with preprocessing library DataLiner (Encoding)
Try to process Titanic data with preprocessing library DataLiner (conversion)
Try to extract the features of the sensor data with CNN
Try to operate Facebook with Python
How to deal with imbalanced data
How to deal with imbalanced data
Try to profile with ONNX Runtime
Versatile data plotting with pandas + matplotlib
Try to put data in MongoDB
How to Data Augmentation with PyTorch
I want to do ○○ with Pandas
Try to output audio with M5STACK
Try data parallelism with Distributed TensorFlow
[Stock price analysis] Learn pandas with Nikkei 225 (004: Change read data to Nikkei 225)
[Pandas] I tried to analyze sales data with Python [For beginners]
Try to solve the shortest path with Python + NetworkX + social data
Try to get CloudWatch metrics with re: dash python data source
Try to process Titanic data with preprocessing library DataLiner (Drop edition)
Try to reproduce color film with Python
Try logging in to qiita with Python
Try working with binary data in Python
Convert Excel data to JSON with python
Try to image the elevation data of the Geographical Survey Institute with Python
Use pandas to convert grid data to row-holding (?) Data
Send data to DRF API with Vue.js
Convert FX 1-minute data to 5-minute data with Python
Try to predict cherry blossoms with xgboost
First YDK to try with Cisco IOS-XE
Try to generate an image with aliasing
How to read problem data with paiza
Working with 3D data structures in pandas
Example of efficient data processing with PANDAS
Best practices for messing with data with pandas
Extract the Azure SQL Server data table with pyodbc and try to make it numpy array / pandas dataframe
Convert Mobile Suica usage history PDF to pandas Data Frame format with tabula-py
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
Try to make your own AWS-SDK with bash
How to create sample CSV data with hypothesis
Try to solve the fizzbuzz problem with Keras
[Introduction to SEIR model] Try fitting COVID-19 data ♬
Try using django-import-export to add csv data to django
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Read Python csv data with Pandas ⇒ Graph with Matplotlib
I tried to save the data with discord
Try to solve the man-machine chart with Python