things to do

I'll do something like this. This time, we will do "3, data preprocessing". [Data analysis basics] 1, data collection (scraping) 2, data storage 3, data preprocessing 4, Data visualization and consideration 5, Conclusions and measures for data

By the way, the last time was "1, Data collection (scraping)". Previous article Link for people who want to watch the video

Preprocessing is performed using the data collected by the above scraping. If you haven't read it, it's difficult to understand the flow, so I hope you can read the previous article roughly.

Creation background

Even if you search for "python data preprocessing", all of them are the same Titanic and scikit-learn data. It was boring, so I wanted to do data pre-processing that no one was doing with data that no one was doing.

environment

python3.7.0
jupyter notebook
BicCamera csv
Download link: https://drive.google.com/open?id=1-PguvcT5kRIrvBifXhdiJDli2zyNsg62
jupyber code
Download link: https://drive.google.com/open?id=1pogwpzM4HHwTHrSDIsvXr8r79EDr4WhX

Required skills & environment

I know the basic syntax of python
pandas I know a little
There is a jupyter notebook or google colaboratory environment

1, observe the data

■ First, read the csv file as a pandas data frame.

Well, I got the data myself, so the content is easy.

`python`


import pandas as pd
df = pd.read_csv("biccamera_all_laptop.csv")
df.head()

スクリーンショット 2019-11-09 20.09.01.png

■ Get the number of rows, check the column name, check the number of nulls

Normally, I don't know until I look at the data, but Wai processes in advance that the number of nulls is zero.

`python`


#Get the number of lines
print(len(df))
#Check the column name
print(df.columns)
#Check the number of nulls
print(df.isnull().sum())

スクリーンショット 2019-11-09 20.10.02.png

■ Check dataframe information, count unique numbers for each column

I also know info ().
However, looking at the unique numbers, the titles do not overlap a little.
There are 30 makers, and the number of prices and points is different.
I saw the point in advance and it was "point (10%)", so
I thought it was 10% of the price, but I wonder if it's different.
I thought.

`python`


#Check dataframe information
print(df.info())
#Count the number of uniques per column
print(df.nunique())

スクリーンショット 2019-11-09 20.15.28.png

■ Try various value_counts ()

Well, is it like this?
I don't say much (laughs)

`python`


#Delivery date
print(df.terms.value_counts())
#Inventory information
print(df.stock.value_counts())
#Manufacture name
print(df.maker.value_counts())

2, extract necessary data

■ The title is suspicious, so take a look

`python`


for t in df.title:
    print(t)
    print(len(t))
    print("*" * 100)

スクリーンショット 2019-11-09 20.25.17.png

■ Since it is a mixture of full-width and half-width characters, I made a function to unify it.

A function that converts half-width katakana to full-width and full-width alphanumeric characters to half-width.
I often use it personally.

`python`


import re
import jctconv

def han2zen2han(string):
    """
Make half-width katakana full-width,
Make full-width alphanumeric characters half-width
    :param string: string text
    :return: string text
    """
    string = jctconv.h2z(string, kana=True, digit=False, ascii=False)
    string = jctconv.z2h(string, kana=False, digit=True, ascii=True)
    return string

■ Get a list like [XXXX / XXXX /] from the title.

Get all [] with the regular expression r "\ [. +? ]".
Some patterns have multiple [] in the title.
So, take the screen size of the notebook PC with a regular expression.
Some [] have a size, and some do not.

`python`


#Try to get with Series
df.title.apply(get_spec_list)
#Take out one and check inside
df.title.apply(get_spec_list)[0]

スクリーンショット 2019-11-09 20.30.53.png

The function is below.

`python`


def get_spec_list(title):
    """
    spec_list =From the product title[]Extract with the contents
    inch_list =Extract PC screen inch text from product title
    l = spec_Put the PC specs extracted from the list back into the list
    :param title: string text
    :return: list
    """
    l = []
    t = han2zen2han(title)
    spec_list = re.findall(r"\[.+?\]", t)
    inch_list = re.findall(r"(\d\d\.\d|\d\d|\d\..|\d)(inch|Mold)", t)
    inch = "".join(inch_list[0]) if inch_list else ""
    for spec in spec_list:
        specs = spec.replace("[", "").replace("]", "").replace(" ", "").replace("・", "/").replace(":", "").split("/")
        for s in specs:
            l.append(s)
    if inch:
        l.append(inch)
    return list(set(l))

■ It's hard to write and paste screenshots, so I'll stick to this area.

For the time being, please try the following.

`python`


#Extract the list that is the basis of PC specifications
df["spec_list"] = df.title.apply(get_spec_list)

#Get CPU data
df["intel_cpu"] = df.spec_list.apply(get_intelcpu)
df["amd_cpu"] = df.spec_list.apply(lambda x: "".join([i for i in x if re.search(r"amd", i.lower())]))

#Memory data acquisition(int)
df["memory"] = df.spec_list.apply(get_memory)

#HDD data acquisition(int)
df["hdd"] = df.spec_list.apply(get_hdd)

#SSD data acquisition(int)
df["ssd"] = df.spec_list.apply(get_ssd)

#eMMC data acquisition(int)
df["emmc"] = df.spec_list.apply(get_emmc)

#Inch, type data acquisition(float)
df["inch"] = df.spec_list.apply(get_inch)

#Inch, type data acquisition(int)
df["int_inch"] = df.inch.astype("int")

#Acquired manufacturer name(str)
df["new_maker"] = df.maker.apply(get_maker)

#Get PC price(int)
df["new_price"] = df.price.str.replace(r"\D", "").astype("int")

#Get points when purchasing a PC(int)
df["new_point"] = df.point.str.replace(r"(point|\n).*", "").str.replace(",", "").astype("int")

#Get PC rating(int)
df["new_ratings"] = df.ratings.str.replace(r"\D", "").astype("int")

#Get the number of characters in the PC title(int)
df["string_len"] = df.title.str.len()

#Get the number of words in your PC title(int)
df["words_len"] = df.title.str.split().str.len()

The final result will be like this.

スクリーンショット 2019-11-09 20.34.08.png

At the end

I put it as a video, so if you want to see the process flow, please watch it on youtube.

Video link

If you want to see the code running, go to "data processing 02" in the link above. The explanation is quite long, so fast forward is recommended.

python jupyter notebook Data preprocessing championship (target site: BicCamera)

things to do

Creation background

environment

Required skills & environment

1, observe the data

■ First, read the csv file as a pandas data frame.

python

■ Get the number of rows, check the column name, check the number of nulls

python

■ Check dataframe information, count unique numbers for each column

python

■ Try various value_counts ()

python

2, extract necessary data

■ The title is suspicious, so take a look

python

■ Since it is a mixture of full-width and half-width characters, I made a function to unify it.

python

■ Get a list like [XXXX / XXXX /] from the title.

python

python

■ It's hard to write and paste screenshots, so I'll stick to this area.

python

At the end

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`