I'll do something like this. This time, we will do "3, data preprocessing". [Data analysis basics] 1, data collection (scraping) 2, data storage 3, data preprocessing 4, Data visualization and consideration 5, Conclusions and measures for data
By the way, the last time was "1, Data collection (scraping)". Previous article Link for people who want to watch the video
Preprocessing is performed using the data collected by the above scraping. If you haven't read it, it's difficult to understand the flow, so I hope you can read the previous article roughly.
Even if you search for "python data preprocessing", all of them are the same Titanic and scikit-learn data. It was boring, so I wanted to do data pre-processing that no one was doing with data that no one was doing.
Well, I got the data myself, so the content is easy.
python
import pandas as pd
df = pd.read_csv("biccamera_all_laptop.csv")
df.head()
Normally, I don't know until I look at the data, but Wai processes in advance that the number of nulls is zero.
python
#Get the number of lines
print(len(df))
#Check the column name
print(df.columns)
#Check the number of nulls
print(df.isnull().sum())
I also know info ().
However, looking at the unique numbers, the titles do not overlap a little.
There are 30 makers, and the number of prices and points is different.
I saw the point in advance and it was "point (10%)", so
I thought it was 10% of the price, but I wonder if it's different.
I thought.
python
#Check dataframe information
print(df.info())
#Count the number of uniques per column
print(df.nunique())
Well, is it like this?
I don't say much (laughs)
python
#Delivery date
print(df.terms.value_counts())
#Inventory information
print(df.stock.value_counts())
#Manufacture name
print(df.maker.value_counts())
python
for t in df.title:
print(t)
print(len(t))
print("*" * 100)
A function that converts half-width katakana to full-width and full-width alphanumeric characters to half-width.
I often use it personally.
python
import re
import jctconv
def han2zen2han(string):
"""
Make half-width katakana full-width,
Make full-width alphanumeric characters half-width
:param string: string text
:return: string text
"""
string = jctconv.h2z(string, kana=True, digit=False, ascii=False)
string = jctconv.z2h(string, kana=False, digit=True, ascii=True)
return string
Get all [] with the regular expression r "\ [. +? ]".
Some patterns have multiple [] in the title.
So, take the screen size of the notebook PC with a regular expression.
Some [] have a size, and some do not.
python
#Try to get with Series
df.title.apply(get_spec_list)
#Take out one and check inside
df.title.apply(get_spec_list)[0]
The function is below.
python
def get_spec_list(title):
"""
spec_list =From the product title[]Extract with the contents
inch_list =Extract PC screen inch text from product title
l = spec_Put the PC specs extracted from the list back into the list
:param title: string text
:return: list
"""
l = []
t = han2zen2han(title)
spec_list = re.findall(r"\[.+?\]", t)
inch_list = re.findall(r"(\d\d\.\d|\d\d|\d\..|\d)(inch|Mold)", t)
inch = "".join(inch_list[0]) if inch_list else ""
for spec in spec_list:
specs = spec.replace("[", "").replace("]", "").replace(" ", "").replace("・", "/").replace(":", "").split("/")
for s in specs:
l.append(s)
if inch:
l.append(inch)
return list(set(l))
For the time being, please try the following.
python
#Extract the list that is the basis of PC specifications
df["spec_list"] = df.title.apply(get_spec_list)
#Get CPU data
df["intel_cpu"] = df.spec_list.apply(get_intelcpu)
df["amd_cpu"] = df.spec_list.apply(lambda x: "".join([i for i in x if re.search(r"amd", i.lower())]))
#Memory data acquisition(int)
df["memory"] = df.spec_list.apply(get_memory)
#HDD data acquisition(int)
df["hdd"] = df.spec_list.apply(get_hdd)
#SSD data acquisition(int)
df["ssd"] = df.spec_list.apply(get_ssd)
#eMMC data acquisition(int)
df["emmc"] = df.spec_list.apply(get_emmc)
#Inch, type data acquisition(float)
df["inch"] = df.spec_list.apply(get_inch)
#Inch, type data acquisition(int)
df["int_inch"] = df.inch.astype("int")
#Acquired manufacturer name(str)
df["new_maker"] = df.maker.apply(get_maker)
#Get PC price(int)
df["new_price"] = df.price.str.replace(r"\D", "").astype("int")
#Get points when purchasing a PC(int)
df["new_point"] = df.point.str.replace(r"(point|\n).*", "").str.replace(",", "").astype("int")
#Get PC rating(int)
df["new_ratings"] = df.ratings.str.replace(r"\D", "").astype("int")
#Get the number of characters in the PC title(int)
df["string_len"] = df.title.str.len()
#Get the number of words in your PC title(int)
df["words_len"] = df.title.str.split().str.len()
The final result will be like this.
I put it as a video, so if you want to see the process flow, please watch it on youtube.
If you want to see the code running, go to "data processing 02" in the link above. The explanation is quite long, so fast forward is recommended.
Recommended Posts