As the title suggests, let's think about the profession of data scientist, which is said to be various these days. Even in the industry, the definition of this occupation is ambiguous and there is no unified view.
Well, to be honest, it's a story that "a person who wants to call himself a data scientist should call himself", but since it's a big deal, I'll write my personal thoughts during this period.
In addition, there may not be anything new for those who are thinking about data scientists on a regular basis.
If anything, it's an article that I want people to see, "How can I become a data scientist?" Or "I want to hire a trendy data scientist, but what kind of person should I hire?" Please understand
For now, let's take a look at some of the well-known definitions of myths that are already in the world.
"Data Scientist’ is a Data Analyst who lives in California"
"A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician."
Both seem to say good things, and I feel that they have neither a former nor a child.
There is also such a famous figure
"THE DATA SCIENCE VENN DIAGRAM" http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram For details, please read the link above, but the point is that data scientists have the three "hacking skills," "math and statistics," and "unique strengths."
Alternatively, a Japanese professional association has officially announced the skills required for data scientists.
** Data Scientist Definitions, Skill Sets, Skill Levels ** http://www.datascientist.or.jp/news/2014/pdf/1210.pdf The wording is a little different, but it's very similar to the Venn diagram above.
I know what I mean by the existing definitions above, but I feel that none of them are sufficient. There are two reasons
** ① Not structural. ** ** The value chain of analytical work in business is not shown, and it is difficult to understand when and how each skill is required.
** ② It is a dualism of whether the skill is "yes" or "no" ** To be honest, it is very difficult to acquire the three skills shown in the Venn diagram above at a comprehensive and perfect level. In fact, I don't know at what level I should get it.
In this article, I would like to structurally define the skill set of data scientists in a way that solves these two problems. Especially for (2), I would like to discuss the necessary skill sets after introducing the concept that there are two skill levels, "reading" and "writing".
I've always thought that the concept of a data scientist might be a bit closer to a "multilingual translator." Data scientists must be able to manipulate multiple different languages and translate between languages.
The "language" mentioned here is
・ Business language ・ Numbers / KPI ・ Statistics / mathematical formulas ・ Programming language
And so on. I want you to think about it.
Management and planning departments speak in business languages and often do not understand other languages. Although it is called "Aim to increase sales Yo", it is not mentioned as a concrete mathematical formula or algorithm. Statisticians are good at dealing with difficult mathematical formulas, but sometimes they are not good at comparing them to real-life business events. Even if you output a list of mathematical formulas and numbers, people in other occupations will not look at you. Engineers are not always strong in numbers. Also, computers do not understand anything other than programming languages.
Even if all players (except calculators) understand the importance of data analysis and data-driven, the above situations are common. "I'm speaking in the same Japanese, so I can understand! Don't think that. From different positions, the words they use and the meaning behind them are completely different.
** There is a "data scientist" to overcome this situation **, which is the image of the data scientist that I would like to discuss in this article. You need to be free to move between different languages, become a translator when needed, and a handyman who travels through a series of data analysis value chains.
The people involved in the value chain of the analysis process in business are roughly structured as follows. And this value chain will form a V-shaped flow by the round-trip process of "writing" and "reading". ** Let's explain one by one. ** **
Most analysis processes begin with words like "Our sales structure, make it more visible Yo" by business managers (managers, product managers, business managers, etc.), with a few people in between. Eventually, it will reach the computational resources through the parties involved. (However, each person is not divided and shared, and it is very likely that one person covers multiple ranges.)
This is the ** "writing" process **.
Roughly speaking, at each step of this process, the following happens: The lines of each player are probably symbolic (and prejudiced), and I don't think each prince usually speaks all these words ...
However, there is no doubt that during work, the conversation is centered on the language described here (it may be the same Japanese, but if you do not understand that language properly, the conversation will not be established).
[Insert figure]
It takes very strong skills to move this process down. In general ** the "writing" process is often much more difficult than the "reading" **.
** Planning and management person **
You must mobilize all your business knowledge and logic to design meaningful, convincing, and computeable KPIs. It is also important that the content meets the expectations of business managers.
** Analyst / statistician **
It is necessary to consider specific measures for how to analyze and quantify the KPIs you want to see. It is necessary to scrutinize the data that can be used, determine the range of data that is actually used (excluding data that does bad things), think about the particle size to see, and design a statistical model if necessary. You also have to think about what kind of chart the results will be shown in.
** Engineer person **
Implement the calculation logic that the analyst thought. It is necessary to make a general-purpose design in consideration of the time when the data range and granularity change, and pay attention to the processing speed. In some cases, knowledge of statistical model packaging is also required. It is also desirable to consider the output format and, if necessary, be familiar with methods such as visualization.
** Computational infrastructure **
Depending on the amount of data, it may be necessary to have human resources with knowledge of infrastructure engineers for tuning such as computational resources and parallelization.
Basically, it is terrible or almost impossible to ask one person for all these skills. Therefore, if a person who has a full stack of "writing" skills is defined as a "data scientist", this occupation will be ** quickly clogged **.
Whether it's English or Lebanese, reading is easier than writing. Often in Japanese English learning courses, the problem is that reading becomes stronger but writing and speaking do not develop, but I personally think that it makes a lot of sense just to be able to read (or reading). I hope I can also listen to it)
Aside from that, the process of "writing" a business process, that is, analyzing it and "reading" the results, is important and interesting. Would it look like the following if written in the same way as before?
In this process, even if you do not have the ability to write, if you have the ability to read, you can fully participate in the conversation, and you can also contribute by giving your opinion.
Even if you can't build a statistical model yourself, you just need to know the structure of the input data and how to read the model results.
Even if KPIs are designed for half a person, it is only necessary to be able to read the numerical values of each KPI and formulate business interpretations and hypotheses. Also, even if you can't scratch the code yourself, it can be useful if you have the skills to fix and reuse part of a person's code.
In this case, the hurdle to learn is much lower than that of "writing", but there are many scenes that are useful for business at this level.
I think this is a rough summary of what has been discussed so far. In this table, if "writing" has about two strengths and other than that, "reading" is possible, I think that it is ** strong enough for a person involved in the data analysis process **.
In other words, if you want to call yourself **, you can call yourself a data scientist, right? **I'm saying that.
What I want to emphasize is ** 1. You don't have to "write" all fields ** If you can, it's better, but it's practically difficult. In some cases, it is more beneficial to further specialize and deepen the fields that can already be written, rather than trying hard to increase the coverage of "writing".
** 2. If you can "read" a language that you can't "write" (・ ∀ ・) Good !! ** As I mentioned earlier, whether it's English or Lebanese, reading is easier than writing. And depending on what you do, that may be enough. The first thing is to be able to read / listen without forcibly going to Writing / Speaking.
That's it. The important thing is ** "you will be able to participate in conversations in any language" **. And if you are talking in a "language you can write", you should take more initiative to participate in the conversation.
I wrote it hard, but what I want to say is
"Despair that I think perfect quattro lingual is impossible. But if it's bilingual + you can read and hear in two languages, that's pretty much it, but it's still quite useful, so let's do our best.
That is.
"I have a Ph.D., ** Spark, Hadoop, SQL can be used for **, and Python can not only analyze but also ** build algorithms to be incorporated into products **, ** statistical models and machine learning. I hope that there will be no ridiculous data scientist recruitment such as "Recruitment of ** people with abundant knowledge **, sufficient ** business experience **, good team management, and ** high communication".
Enjoy!
If you are interested in data scientists, first look around here, a summary of literature and videos http://qiita.com/hik0107/items/ef5e044d2f47940ba712