Introduction

I scraped a job change site with Python and briefly investigated "necessary skills". Since I am aiming to be an engineer, I was wondering what kind of requirements are actually set up on the job change site.

Ideas and implementations <a href="https://qiita.com/kakiuchis/items/2c9b327cadf1e8dbdf6e" rel=”nofollow noopener” target="_blank"> Since I studied scraping, how to make an inexperienced person a web director I have taken the article of , which I considered in a data-driven manner, as a reference.

Please point out mistakes and small mistakes!

Required Skills Conclusion

For those who want to see only the results

--Common --Team development and communication skills --Git / GitHub

front end --html / css / javascript is compulsory ――It would be nice if you could do either vue or React, then Angular --Knowledge of design skills UI / UX --Server-side languages such as PHP, Ruby, Java --Skills to handle Photoshop / Illustrator
webpack
etc……

--Server side --Learning languages such as Ruby, PHP, Python, C (C #?), Java --Cloud knowledge such as AWS / GCP --Knowledge of server infrastructure in general --Web application frameworks such as Rails, Laravel, and Django --Database in general --General network --Front-end language --Docker container

CI/CD
etc……

I don't mind getting messy from the middle.

Is it the expected result?

Then to the contents.

Environment & build order

--Environment

MacOS Mojave 10.14.6
Python3.8.1
Environment --Install Homebrew (included from the beginning) --Mysql installation (included from the beginning) --Install pyenv with homebrew --Install python with pyenv --Create a python environment under the directory with venv --Install libraries such as bs4, MeCab, jupyter notebook, pandas

--Reference

-<a href="https://qiita.com/sakaeda11/items/3832472b5eb923e0128f" rel=”nofollow noopener” target="_blank"> Python environment construction on Mac

-<a href="https://qiita.com/fiftystorm36/items/b2fd47cf32c7694adc2e" rel=”nofollow noopener” target="_blank"> venv: Python virtual environment management -<a href="https://qiita.com/kakiuchis/items/2c9b327cadf1e8dbdf6e" rel=”nofollow noopener” target="_blank"> Since I studied scraping, I learned how to become a web director by data-driven. I considered it.

Collecting information from job change sites

Scraping → The implementation of morphological analysis is generally as <a href="https://qiita.com/kakiuchis/items/2c9b327cadf1e8dbdf6e" rel=”nofollow noopener” target="_blank"> reference article , so it is omitted. The flow is as follows: Get the job URL for each job → Open the URL and get the required skill part → Calculate the frequency for each noun by morphological analysis. <a href="https://qiita.com/itkr/items/513318a9b5b92bd56185" rel=”nofollow noopener” target="_blank"> Scraping with Python and Beautiful Soup and <a href="https:/ /www.crummy.com/software/BeautifulSoup/bs4/doc/ "rel =” nofollow noopener ”target =" _ blank "> Official document was also referred to.

Especially scraping ・ Put sleep before opening the page ・ Check robots.txt and check disallow Let's be careful.

The number of job offers acquired is Front end: 208 cases Server side: 181 cases Games: 175 was

analysis

Click here for the code used for the analysis.

`analysis_job_search.py`


import MeCab
import pandas as pd
import numpy as np
import mysql.connector as mydb
import pandas.io.sql as psql
import collections

pd.set_option('display.max_rows', 500)

#Method to extract nouns from text
def devide_by_mecab(text):
    tagger = MeCab.Tagger("-Ochasen")
    node = tagger.parseToNode(text)
    word_list = []
    while node:
        pos = node.feature.split(",")[0]
        if pos in ["noun"]:
            word = node.surface
            word_list.append(word)
        node = node.next
    return "  ".join(word_list)

#Connect to MySQL Personal Settings Here
connection = mydb.connect(
  host = '',
  port = '',
  user = '',
  password = '',
  database = ''
)

#Get data from DB
df_frontend = psql.read_sql("SELECT * FROM table WHERE search_word = 'front end'",connection)
df_serverside = psql.read_sql("SELECT * FROM table WHERE search_word = 'Server side'",connection)
df_game = psql.read_sql("SELECT * FROM table WHERE search_word = 'Game programmer'",connection)

#need_A method that decomposes skills into nouns and returns them
def get_all_words(df):
    all_words = []
    for index, row in df.iterrows():
        words = devide_by_mecab(row['need_skills']).split()
        all_words.extend(words)
    return all_words

count_of_words_frontend = collections.Counter(get_all_words(df_frontend))
count_of_words_serverside = collections.Counter(get_all_words(df_serverside))
count_of_words_game = collections.Counter(get_all_words(df_game))

#See all nouns in order of frequency
count_of_words_frontend.most_common()
count_of_words_serverside.most_common()
count_of_words_game.most_common()

#Arrangement of words related to skill(count >= 6)
frontend_top_words = ['JavaScript', 'CSS', 'HTML', 'js', 'Vue', 'React', 'design', 'team', 'Git', 'UI', 'PHP', 'UX', 'Angular', 'Ruby', 'Javascript', 'jQuery', 'Photoshop', 'communication', 'API', 'TypeScript', 'SPA', 'Java', 'Sass', 'designer', 'Illustrator', 'JS', 'test', 'server', 'webpack', 'GitHub', 'AWS', 'AngularJS', 'WordPress', 'Webpack', 'Rails', 'iOS', 'CMS', 'Python', 'Redux', 'MySQL', 'Gulp', 'Android', 'gulp', 'C', 'SCSS', 'git', 'DB', 'Linux', 'Babel', 'Docker', 'CI']
serverside_top_words = ['Ruby', 'PHP', 'AWS', 'Python', 'C', 'Java', 'server', 'Rails', 'team', 'infrastructure', 'js', 'Git', 'server', 'Android', 'JavaScript', 'Go', 'Linux', 'Perl', 'HTML', 'MySQL', 'RDBMS', 'CSS', 'Kura', 'Udo', 'API', 'front', 'management', 'GitHub', 'iOS', 'DB', 'GCP', 'React', 'Vue', 'network', 'Node', 'HTTP', 'Swift', 'CI', 'Objective', 'Docker', 'Security', 'Javascript', 'Azure', 'native', 'PostgreSQL', 'architecture', 'SQL', 'test', '#', 'smart', 'phone', 'UI', 'MVC', 'communication', 'git', 'Scala', 'Kotlin' , 'CD', 'Database', 'TypeScript', 'Apache', 'LAMP', 'designer', 'container', 'RDB', 'Laravel']
game_top_words = ['C', '3', 'D', 'Unity', 'Java', 'PHP', 'design', '++', 'server', '++、', 'network', 'JavaScript', 'Android', 'management', '#、', 'Objective', 'Photoshop', 'Maya', 'team', 'designer', 'Linux', 'MySQL', 'Ruby', 'Python', 'infrastructure', '#', 'Graphics', 'server', 'Excel', 'graphic', 'communication', 'Unreal', 'DCG', 'AWS', 'Perl', 'Illustrator', 'Engine', 'planner', 'Word', 'native', 'motion', 'director', 'HTML', 'UI', 'Flash', 'effect', 'VB', 'sound', 'DS', 'OpenGL', 'iOS', 'DirectX']

#Method to create DataFrame of word and number of occurrences
def get_top_word_df(top_words,count_of_words):
    df = pd.DataFrame({})
    for i,word in enumerate(top_words):
        word_data = pd.Series([word,count_of_words[word]], index=['word','count'], name=i)
        df = df.append(word_data)
    return df

df_frontend_top_words =  get_top_word_df(frontend_top_words,count_of_words_frontend)
df_serverside_top_words =  get_top_word_df(serverside_top_words,count_of_words_serverside)
df_game_top_words =  get_top_word_df(game_top_words,count_of_words_game)

for df in [df_frontend_top_words,df_serverside_top_words,df_game_top_words]:
    df['rank'] = df['count'].rank(ascending = False, method = 'min').astype(int)
    df['count'] = df['count'].astype(int)

df_frontend_top_words[['rank','word','count']]
df_serverside_top_words[['rank','word','count']]
df_game_top_words[['rank','word','count']]

First, all the words are output in order of frequency, and the words that do not seem to be related to the skill are manually removed and output again. The game programmer was personally interested, so I added it.

The result is as follows.

front end

rank	word	count
1	JavaScript	147
2	CSS	145
3	HTML	131
4	js	72
5	Vue	63
5	React	63
7	design	60
8	team	40
9	Git	34
9	UI	34
11	PHP	31
12	UX	30
13	Angular	29
14	Ruby	23
15	Javascript	21
16	jQuery	20
16	Photoshop	20
18	communication	18
18	API	18
18	TypeScript	18
18	SPA	18
22	Java	16
22	Sass	16
24	designer	15
24	Illustrator	15
24	JS	15
24	test	15
28	server	14
29	webpack	13
29	GitHub	13
29	AWS	13
29	AngularJS	13
33	WordPress	12
33	Webpack	12
33	Rails	12
36	iOS	11
36	CMS	11
36	Python	11
36	Redux	11
40	MySQL	10
40	Gulp	10
42	Android	9
42	gulp	9
42	C	9
45	SCSS	8
45	git	8
47	DB	7
47	Linux	7
49	Babel	6
49	Docker	6
49	CI	6

Server side

rank	word	count
1	Ruby	81
2	PHP	67
3	AWS	50
4	Python	43
5	C	42
6	Java	41
7	server	37
8	Rails	34
9	team	33
10	infrastructure	31
11	js	29
12	Git	27
12	server	27
14	Android	26
14	JavaScript	26
14	Go	26
17	Linux	24
17	Perl	24
19	HTML	21
19	MySQL	21
19	RDBMS	21
22	CSS	19
23	Kura	18
23	Udo	18
23	API	18
26	front	17
26	management	17
26	GitHub	17
29	iOS	16
30	DB	15
30	GCP	15
30	React	15
33	Vue	14
34	network	12
34	Node	12
36	HTTP	11
36	Swift	11
36	CI	11
36	Objective	11
40	Docker	10
40	Security	10
40	Javascript	10
40	Azure	10
44	native	9
44	PostgreSQL	9
44	architecture	9
44	SQL	9
44	test	9
49	#	8
49	smart	8
49	phone	8
49	UI	8
49	MVC	8
49	communication	8
49	git	8
49	Scala	8
57	Kotlin	7
57	CD	7
57	Database	7
57	TypeScript	7
57	Apache	7
57	LAMP	7
63	designer	6
63	container	6
63	RDB	6
63	Laravel	6

Game programmer

rank	word	count
1	C	156
2	3	62
3	D	49
4	Unity	45
5	Java	32
6	PHP	31
7	design	29
8	++	26
9	server	22
10	++、	19
10	network	19
12	JavaScript	17
13	Android	15
14	management	14
15	#、	13
16	Objective	12
16	Photoshop	12
16	Maya	12
19	team	11
19	designer	11
19	Linux	11
19	MySQL	11
19	Ruby	11
19	Python	11
25	infrastructure	10
25	#	10
25	Graphics	10
28	server	9
28	Excel	9
30	graphic	8
30	communication	8
30	Unreal	8
30	DCG	8
30	AWS	8
30	Perl	8
36	Illustrator	7
36	Engine	7
36	planner	7
36	Word	7
36	native	7
36	motion	7
42	director	6
42	HTML	6
42	UI	6
42	Flash	6
42	effect	6
42	VB	6
42	sound	6
42	DS	6
42	OpenGL	6
42	iOS	6
42	DirectX	6

I feel that the results are almost as expected.

Originally, the same words such as'javascript',' Javascript', and'js' should be named properly, but it seemed to be difficult because there were many words, so I was frustrated.

Conclusion

Front-end engineer: It seems that html, css, javascript are outstanding and indispensable. In addition, the second step is to be able to use frameworks such as Vue and React while strengthening the design, UI / UX, and it seems good to deepen the understanding on the server side after that.

Server-side engineer: The cloud is divided into'Kura'and'Udo'! …… Quiet talk. It seems that Ruby, PHP, Python, and Java are the main languages (I don't know which one is C). In addition, cloud is also a compulsory subject. I also want the understanding of the database, network, and front end side. There seems to be a lot of studying.

Game programmer: After all the direction is a little different and it is interesting. C ++, C # and Unity, 3D are the main fields, and it seems that it is necessary to learn around graphics while programming.

I have little knowledge, so I can only say something really rough ... orz. I would like to study based on this result!

Task

――Since the word is a little strict, the number of acquisitions is small (around 200)

-Since it is a survey of only one site, there is a bias

--Name identification such as lowercase letters, abbreviations, typographical errors, etc.

the end

I just looked at the frequency briefly, but I'm glad that it also served as a guide for the fields I should study.

It might be more interesting to compare on multiple sites or try on a cross-sectional site. If you are interested, please check it out with your own eyes!

I searched for the skills needed to become a web engineer in Python