I scraped a job change site with Python and briefly investigated "necessary skills". Since I am aiming to be an engineer, I was wondering what kind of requirements are actually set up on the job change site.
Ideas and implementations <a href="https://qiita.com/kakiuchis/items/2c9b327cadf1e8dbdf6e" rel=”nofollow noopener” target="_blank"> Since I studied scraping, how to make an inexperienced person a web director I have taken the article of , which I considered in a data-driven manner, as a reference.
Please point out mistakes and small mistakes!
For those who want to see only the results
--Common --Team development and communication skills --Git / GitHub
--Server side --Learning languages such as Ruby, PHP, Python, C (C #?), Java --Cloud knowledge such as AWS / GCP --Knowledge of server infrastructure in general --Web application frameworks such as Rails, Laravel, and Django --Database in general --General network --Front-end language --Docker container
I don't mind getting messy from the middle.
Is it the expected result?
Then to the contents.
--Environment
MacOS Mojave 10.14.6
Python3.8.1
Environment --Install Homebrew (included from the beginning) --Mysql installation (included from the beginning) --Install pyenv with homebrew --Install python with pyenv --Create a python environment under the directory with venv --Install libraries such as bs4, MeCab, jupyter notebook, pandas
--Reference
-<a href="https://qiita.com/sakaeda11/items/3832472b5eb923e0128f" rel=”nofollow noopener” target="_blank"> Python environment construction on Mac
-<a href="https://qiita.com/fiftystorm36/items/b2fd47cf32c7694adc2e" rel=”nofollow noopener” target="_blank"> venv: Python virtual environment management -<a href="https://qiita.com/kakiuchis/items/2c9b327cadf1e8dbdf6e" rel=”nofollow noopener” target="_blank"> Since I studied scraping, I learned how to become a web director by data-driven. I considered it.
Scraping → The implementation of morphological analysis is generally as <a href="https://qiita.com/kakiuchis/items/2c9b327cadf1e8dbdf6e" rel=”nofollow noopener” target="_blank"> reference article , so it is omitted. The flow is as follows: Get the job URL for each job → Open the URL and get the required skill part → Calculate the frequency for each noun by morphological analysis. <a href="https://qiita.com/itkr/items/513318a9b5b92bd56185" rel=”nofollow noopener” target="_blank"> Scraping with Python and Beautiful Soup and <a href="https:/ /www.crummy.com/software/BeautifulSoup/bs4/doc/ "rel =” nofollow noopener ”target =" _ blank "> Official document was also referred to.
Especially scraping ・ Put sleep before opening the page ・ Check robots.txt and check disallow Let's be careful.
The number of job offers acquired is Front end: 208 cases Server side: 181 cases Games: 175 was
Click here for the code used for the analysis.
analysis_job_search.py
import MeCab
import pandas as pd
import numpy as np
import mysql.connector as mydb
import pandas.io.sql as psql
import collections
pd.set_option('display.max_rows', 500)
#Method to extract nouns from text
def devide_by_mecab(text):
tagger = MeCab.Tagger("-Ochasen")
node = tagger.parseToNode(text)
word_list = []
while node:
pos = node.feature.split(",")[0]
if pos in ["noun"]:
word = node.surface
word_list.append(word)
node = node.next
return " ".join(word_list)
#Connect to MySQL Personal Settings Here
connection = mydb.connect(
host = '',
port = '',
user = '',
password = '',
database = ''
)
#Get data from DB
df_frontend = psql.read_sql("SELECT * FROM table WHERE search_word = 'front end'",connection)
df_serverside = psql.read_sql("SELECT * FROM table WHERE search_word = 'Server side'",connection)
df_game = psql.read_sql("SELECT * FROM table WHERE search_word = 'Game programmer'",connection)
#need_A method that decomposes skills into nouns and returns them
def get_all_words(df):
all_words = []
for index, row in df.iterrows():
words = devide_by_mecab(row['need_skills']).split()
all_words.extend(words)
return all_words
count_of_words_frontend = collections.Counter(get_all_words(df_frontend))
count_of_words_serverside = collections.Counter(get_all_words(df_serverside))
count_of_words_game = collections.Counter(get_all_words(df_game))
#See all nouns in order of frequency
count_of_words_frontend.most_common()
count_of_words_serverside.most_common()
count_of_words_game.most_common()
#Arrangement of words related to skill(count >= 6)
frontend_top_words = ['JavaScript', 'CSS', 'HTML', 'js', 'Vue', 'React', 'design', 'team', 'Git', 'UI', 'PHP', 'UX', 'Angular', 'Ruby', 'Javascript', 'jQuery', 'Photoshop', 'communication', 'API', 'TypeScript', 'SPA', 'Java', 'Sass', 'designer', 'Illustrator', 'JS', 'test', 'server', 'webpack', 'GitHub', 'AWS', 'AngularJS', 'WordPress', 'Webpack', 'Rails', 'iOS', 'CMS', 'Python', 'Redux', 'MySQL', 'Gulp', 'Android', 'gulp', 'C', 'SCSS', 'git', 'DB', 'Linux', 'Babel', 'Docker', 'CI']
serverside_top_words = ['Ruby', 'PHP', 'AWS', 'Python', 'C', 'Java', 'server', 'Rails', 'team', 'infrastructure', 'js', 'Git', 'server', 'Android', 'JavaScript', 'Go', 'Linux', 'Perl', 'HTML', 'MySQL', 'RDBMS', 'CSS', 'Kura', 'Udo', 'API', 'front', 'management', 'GitHub', 'iOS', 'DB', 'GCP', 'React', 'Vue', 'network', 'Node', 'HTTP', 'Swift', 'CI', 'Objective', 'Docker', 'Security', 'Javascript', 'Azure', 'native', 'PostgreSQL', 'architecture', 'SQL', 'test', '#', 'smart', 'phone', 'UI', 'MVC', 'communication', 'git', 'Scala', 'Kotlin' , 'CD', 'Database', 'TypeScript', 'Apache', 'LAMP', 'designer', 'container', 'RDB', 'Laravel']
game_top_words = ['C', '3', 'D', 'Unity', 'Java', 'PHP', 'design', '++', 'server', '++、', 'network', 'JavaScript', 'Android', 'management', '#、', 'Objective', 'Photoshop', 'Maya', 'team', 'designer', 'Linux', 'MySQL', 'Ruby', 'Python', 'infrastructure', '#', 'Graphics', 'server', 'Excel', 'graphic', 'communication', 'Unreal', 'DCG', 'AWS', 'Perl', 'Illustrator', 'Engine', 'planner', 'Word', 'native', 'motion', 'director', 'HTML', 'UI', 'Flash', 'effect', 'VB', 'sound', 'DS', 'OpenGL', 'iOS', 'DirectX']
#Method to create DataFrame of word and number of occurrences
def get_top_word_df(top_words,count_of_words):
df = pd.DataFrame({})
for i,word in enumerate(top_words):
word_data = pd.Series([word,count_of_words[word]], index=['word','count'], name=i)
df = df.append(word_data)
return df
df_frontend_top_words = get_top_word_df(frontend_top_words,count_of_words_frontend)
df_serverside_top_words = get_top_word_df(serverside_top_words,count_of_words_serverside)
df_game_top_words = get_top_word_df(game_top_words,count_of_words_game)
for df in [df_frontend_top_words,df_serverside_top_words,df_game_top_words]:
df['rank'] = df['count'].rank(ascending = False, method = 'min').astype(int)
df['count'] = df['count'].astype(int)
df_frontend_top_words[['rank','word','count']]
df_serverside_top_words[['rank','word','count']]
df_game_top_words[['rank','word','count']]
First, all the words are output in order of frequency, and the words that do not seem to be related to the skill are manually removed and output again. The game programmer was personally interested, so I added it.
The result is as follows.
rank | word | count |
---|---|---|
1 | JavaScript | 147 |
2 | CSS | 145 |
3 | HTML | 131 |
4 | js | 72 |
5 | Vue | 63 |
5 | React | 63 |
7 | design | 60 |
8 | team | 40 |
9 | Git | 34 |
9 | UI | 34 |
11 | PHP | 31 |
12 | UX | 30 |
13 | Angular | 29 |
14 | Ruby | 23 |
15 | Javascript | 21 |
16 | jQuery | 20 |
16 | Photoshop | 20 |
18 | communication | 18 |
18 | API | 18 |
18 | TypeScript | 18 |
18 | SPA | 18 |
22 | Java | 16 |
22 | Sass | 16 |
24 | designer | 15 |
24 | Illustrator | 15 |
24 | JS | 15 |
24 | test | 15 |
28 | server | 14 |
29 | webpack | 13 |
29 | GitHub | 13 |
29 | AWS | 13 |
29 | AngularJS | 13 |
33 | WordPress | 12 |
33 | Webpack | 12 |
33 | Rails | 12 |
36 | iOS | 11 |
36 | CMS | 11 |
36 | Python | 11 |
36 | Redux | 11 |
40 | MySQL | 10 |
40 | Gulp | 10 |
42 | Android | 9 |
42 | gulp | 9 |
42 | C | 9 |
45 | SCSS | 8 |
45 | git | 8 |
47 | DB | 7 |
47 | Linux | 7 |
49 | Babel | 6 |
49 | Docker | 6 |
49 | CI | 6 |
rank | word | count |
---|---|---|
1 | Ruby | 81 |
2 | PHP | 67 |
3 | AWS | 50 |
4 | Python | 43 |
5 | C | 42 |
6 | Java | 41 |
7 | server | 37 |
8 | Rails | 34 |
9 | team | 33 |
10 | infrastructure | 31 |
11 | js | 29 |
12 | Git | 27 |
12 | server | 27 |
14 | Android | 26 |
14 | JavaScript | 26 |
14 | Go | 26 |
17 | Linux | 24 |
17 | Perl | 24 |
19 | HTML | 21 |
19 | MySQL | 21 |
19 | RDBMS | 21 |
22 | CSS | 19 |
23 | Kura | 18 |
23 | Udo | 18 |
23 | API | 18 |
26 | front | 17 |
26 | management | 17 |
26 | GitHub | 17 |
29 | iOS | 16 |
30 | DB | 15 |
30 | GCP | 15 |
30 | React | 15 |
33 | Vue | 14 |
34 | network | 12 |
34 | Node | 12 |
36 | HTTP | 11 |
36 | Swift | 11 |
36 | CI | 11 |
36 | Objective | 11 |
40 | Docker | 10 |
40 | Security | 10 |
40 | Javascript | 10 |
40 | Azure | 10 |
44 | native | 9 |
44 | PostgreSQL | 9 |
44 | architecture | 9 |
44 | SQL | 9 |
44 | test | 9 |
49 | # | 8 |
49 | smart | 8 |
49 | phone | 8 |
49 | UI | 8 |
49 | MVC | 8 |
49 | communication | 8 |
49 | git | 8 |
49 | Scala | 8 |
57 | Kotlin | 7 |
57 | CD | 7 |
57 | Database | 7 |
57 | TypeScript | 7 |
57 | Apache | 7 |
57 | LAMP | 7 |
63 | designer | 6 |
63 | container | 6 |
63 | RDB | 6 |
63 | Laravel | 6 |
rank | word | count |
---|---|---|
1 | C | 156 |
2 | 3 | 62 |
3 | D | 49 |
4 | Unity | 45 |
5 | Java | 32 |
6 | PHP | 31 |
7 | design | 29 |
8 | ++ | 26 |
9 | server | 22 |
10 | ++、 | 19 |
10 | network | 19 |
12 | JavaScript | 17 |
13 | Android | 15 |
14 | management | 14 |
15 | #、 | 13 |
16 | Objective | 12 |
16 | Photoshop | 12 |
16 | Maya | 12 |
19 | team | 11 |
19 | designer | 11 |
19 | Linux | 11 |
19 | MySQL | 11 |
19 | Ruby | 11 |
19 | Python | 11 |
25 | infrastructure | 10 |
25 | # | 10 |
25 | Graphics | 10 |
28 | server | 9 |
28 | Excel | 9 |
30 | graphic | 8 |
30 | communication | 8 |
30 | Unreal | 8 |
30 | DCG | 8 |
30 | AWS | 8 |
30 | Perl | 8 |
36 | Illustrator | 7 |
36 | Engine | 7 |
36 | planner | 7 |
36 | Word | 7 |
36 | native | 7 |
36 | motion | 7 |
42 | director | 6 |
42 | HTML | 6 |
42 | UI | 6 |
42 | Flash | 6 |
42 | effect | 6 |
42 | VB | 6 |
42 | sound | 6 |
42 | DS | 6 |
42 | OpenGL | 6 |
42 | iOS | 6 |
42 | DirectX | 6 |
I feel that the results are almost as expected.
Originally, the same words such as'javascript',' Javascript', and'js' should be named properly, but it seemed to be difficult because there were many words, so I was frustrated.
Front-end engineer: It seems that html, css, javascript are outstanding and indispensable. In addition, the second step is to be able to use frameworks such as Vue and React while strengthening the design, UI / UX, and it seems good to deepen the understanding on the server side after that.
Server-side engineer: The cloud is divided into'Kura'and'Udo'! …… Quiet talk. It seems that Ruby, PHP, Python, and Java are the main languages (I don't know which one is C). In addition, cloud is also a compulsory subject. I also want the understanding of the database, network, and front end side. There seems to be a lot of studying.
Game programmer: After all the direction is a little different and it is interesting. C ++, C # and Unity, 3D are the main fields, and it seems that it is necessary to learn around graphics while programming.
I have little knowledge, so I can only say something really rough ... orz. I would like to study based on this result!
――Since the word is a little strict, the number of acquisitions is small (around 200)
-Since it is a survey of only one site, there is a bias
--Name identification such as lowercase letters, abbreviations, typographical errors, etc.
I just looked at the frequency briefly, but I'm glad that it also served as a guide for the fields I should study.
It might be more interesting to compare on multiple sites or try on a cross-sectional site. If you are interested, please check it out with your own eyes!
Recommended Posts