[JAVA] It's time to seriously think about the definition and skill set of data scientists

What is a data scientist?

As the title suggests, let's think about the profession of data scientist, which is said to be various these days. Even in the industry, the definition of this occupation is ambiguous and there is no unified view.

Well, to be honest, it's a story that "a person who wants to call himself a data scientist should call himself", but since it's a big deal, I'll write my personal thoughts during this period.

In addition, there may not be anything new for those who are thinking about data scientists on a regular basis.

If anything, it's an article that I want people to see, "How can I become a data scientist?" Or "I want to hire a trendy data scientist, but what kind of person should I hire?" Please understand

Take a look at the views of the general public

For now, let's take a look at some of the well-known definitions of myths that are already in the world.

"Data Scientist’ is a Data Analyst who lives in California"

"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."

Both seem to say good things, and I feel that they have neither a former nor a child.

There is also such a famous figure

"THE DATA SCIENCE VENN DIAGRAM" http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram     For details, please read the link above, but the point is that data scientists have the three "hacking skills," "math and statistics," and "unique strengths."

Alternatively, a Japanese professional association has officially announced the skills required for data scientists.

** Data Scientist Definitions, Skill Sets, Skill Levels ** http://www.datascientist.or.jp/news/2014/pdf/1210.pdf   The wording is a little different, but it's very similar to the Venn diagram above.      image

I want a more understandable and structural definition

I know what I mean by the existing definitions above, but I feel that none of them are sufficient. There are two reasons

** ① Not structural. ** ** The value chain of analytical work in business is not shown, and it is difficult to understand when and how each skill is required.

** ② It is a dualism of whether the skill is "yes" or "no" ** To be honest, it is very difficult to acquire the three skills shown in the Venn diagram above at a comprehensive and perfect level. In fact, I don't know at what level I should get it.

In this article, I would like to structurally define the skill set of data scientists in a way that solves these two problems. Especially for (2), I would like to discuss the necessary skill sets after introducing the concept that there are two skill levels, "reading" and "writing".

Data Scientist = "Multilingual" Theory

I've always thought that the concept of a data scientist might be a bit closer to a "multilingual translator." Data scientists must be able to manipulate multiple different languages and translate between languages.

The "language" mentioned here is

・ Business language ・ Numbers / KPI ・ Statistics / mathematical formulas ・ Programming language

And so on. I want you to think about it.

Management and planning departments speak in business languages and often do not understand other languages. Although it is called "Aim to increase sales Yo", it is not mentioned as a concrete mathematical formula or algorithm.    Statisticians are good at dealing with difficult mathematical formulas, but sometimes they are not good at comparing them to real-life business events. Even if you output a list of mathematical formulas and numbers, people in other occupations will not look at you.    Engineers are not always strong in numbers.    Also, computers do not understand anything other than programming languages.

Even if all players (except calculators) understand the importance of data analysis and data-driven, the above situations are common. "I'm speaking in the same Japanese, so I can understand! Don't think that. From different positions, the words they use and the meaning behind them are completely different.

** There is a "data scientist" to overcome this situation **, which is the image of the data scientist that I would like to discuss in this article. You need to be free to move between different languages, become a translator when needed, and a handyman who travels through a series of data analysis value chains.

Analysis value chain and "reading" and "writing"

The people involved in the value chain of the analysis process in business are roughly structured as follows. And this value chain will form a V-shaped flow by the round-trip process of "writing" and "reading". ** Let's explain one by one. ** **

image

Most analysis processes begin with words like "Our sales structure, make it more visible Yo" by business managers (managers, product managers, business managers, etc.), with a few people in between. Eventually, it will reach the computational resources through the parties involved. (However, each person is not divided and shared, and it is very likely that one person covers multiple ranges.)

This is the ** "writing" process **.

Roughly speaking, at each step of this process, the following happens: The lines of each player are probably symbolic (and prejudiced), and I don't think each prince usually speaks all these words ...

However, there is no doubt that during work, the conversation is centered on the language described here (it may be the same Japanese, but if you do not understand that language properly, the conversation will not be established).

[Insert figure]Image

It takes very strong skills to move this process down. In general ** the "writing" process is often much more difficult than the "reading" **.

** Planning and management person **

You must mobilize all your business knowledge and logic to design meaningful, convincing, and computeable KPIs. It is also important that the content meets the expectations of business managers.

** Analyst / statistician **

It is necessary to consider specific measures for how to analyze and quantify the KPIs you want to see. It is necessary to scrutinize the data that can be used, determine the range of data that is actually used (excluding data that does bad things), think about the particle size to see, and design a statistical model if necessary. You also have to think about what kind of chart the results will be shown in.

** Engineer person **

Implement the calculation logic that the analyst thought. It is necessary to make a general-purpose design in consideration of the time when the data range and granularity change, and pay attention to the processing speed. In some cases, knowledge of statistical model packaging is also required. It is also desirable to consider the output format and, if necessary, be familiar with methods such as visualization.

** Computational infrastructure **

Depending on the amount of data, it may be necessary to have human resources with knowledge of infrastructure engineers for tuning such as computational resources and parallelization.

Basically, it is terrible or almost impossible to ask one person for all these skills. Therefore, if a person who has a full stack of "writing" skills is defined as a "data scientist", this occupation will be ** quickly clogged **.

"Writing" and "reading" have different difficulty levels

Whether it's English or Lebanese, reading is easier than writing. Often in Japanese English learning courses, the problem is that reading becomes stronger but writing and speaking do not develop, but I personally think that it makes a lot of sense just to be able to read (or reading). I hope I can also listen to it)

Aside from that, the process of "writing" a business process, that is, analyzing it and "reading" the results, is important and interesting. Would it look like the following if written in the same way as before?

image

In this process, even if you do not have the ability to write, if you have the ability to read, you can fully participate in the conversation, and you can also contribute by giving your opinion.

Even if you can't build a statistical model yourself, you just need to know the structure of the input data and how to read the model results.

Even if KPIs are designed for half a person, it is only necessary to be able to read the numerical values of each KPI and formulate business interpretations and hypotheses.   Also, even if you can't scratch the code yourself, it can be useful if you have the skills to fix and reuse part of a person's code.

In this case, the hurdle to learn is much lower than that of "writing", but there are many scenes that are useful for business at this level.

Try to summarize what you need

I think this is a rough summary of what has been discussed so far. In this table, if "writing" has about two strengths and other than that, "reading" is possible, I think that it is ** strong enough for a person involved in the data analysis process **.

In other words, if you want to call yourself **, you can call yourself a data scientist, right? **I'm saying that.

image

What I want to emphasize is    ** 1. You don't have to "write" all fields ** If you can, it's better, but it's practically difficult. In some cases, it is more beneficial to further specialize and deepen the fields that can already be written, rather than trying hard to increase the coverage of "writing".

** 2. If you can "read" a language that you can't "write" (・ ∀ ・) Good !! ** As I mentioned earlier, whether it's English or Lebanese, reading is easier than writing. And depending on what you do, that may be enough. The first thing is to be able to read / listen without forcibly going to Writing / Speaking.

That's it. The important thing is ** "you will be able to participate in conversations in any language" **. And if you are talking in a "language you can write", you should take more initiative to participate in the conversation.

At the end

I wrote it hard, but what I want to say is

"Despair that I think perfect quattro lingual is impossible. But if it's bilingual + you can read and hear in two languages, that's pretty much it, but it's still quite useful, so let's do our best.

That is.

"I have a Ph.D., ** Spark, Hadoop, SQL can be used for **, and Python can not only analyze but also ** build algorithms to be incorporated into products **, ** statistical models and machine learning. I hope that there will be no ridiculous data scientist recruitment such as "Recruitment of ** people with abundant knowledge **, sufficient ** business experience **, good team management, and ** high communication".

Enjoy!

This article also

If you are interested in data scientists, first look around here, a summary of literature and videos http://qiita.com/hik0107/items/ef5e044d2f47940ba712

Recommended Posts

It's time to seriously think about the definition and skill set of data scientists
I just wanted to extract the data of the desired date and time with Django
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
About Boxplot and Violinplot that visualize the variability of independent data
Think seriously about what language to use in programming education and programming education.
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
Ford-Fulkerson Method and Its Applications-Supplement to Chapter 8 of the Algorithm Quick Reference-
[Introduction to Data Scientists] Basics of Probability and Statistics ♬ Probability / Random Variables and Probability Distribution
Set the time zone to Japan Standard Time
How to set the server time to Japanese time
How to make VS Code aware of the venv environment and its benefits
Make sure to align the pre-processing at the time of forecast model creation and forecast
[Verification] Does levelDB take time to register data when the amount of data increases? ??
Set the range of active strips to the preview range
About the * (asterisk) argument of python (and itertools.starmap)
[Python] Seriously think about the M-1 winning method.
The story of Airflow's webserver and DAG, which takes a long time to load
I investigated the calculation time of "X in list" (linear search / binary search) and "X in set"
[Challenger Wanted] The fastest Data Loading and Data Augmentation (Kaggle notebook) I can think of
Analysis of financial data by pandas and its visualization (2)
List of Python libraries for data scientists and data engineers
Analysis of financial data by pandas and its visualization (1)
Visualize data and understand correlation at the same time
About the inefficiency of data transfer in luigi on-memory
[Blender] How to dynamically set the selection of EnumProperty
Set the specified column of QTableWidget to ReadOnly StyledItemDelegate
Personal notes about the integration of vscode and anaconda
Overview of natural language processing and its data preprocessing
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
[Introduction to SIR model] Predict the end time of each country with COVID-19 data fitting ♬
[Introduction to logarithmic graph] Predict the end time of each country from the logarithmic graph of infection number data ♬
Return the image data with Flask of Python and draw it to the canvas element of HTML
How to calculate the sum or average of time series csv data in an instant