: warning: This article was first posted in June 2017, but at this point ([February 2019]) it is already out of date. The article itself will be left as it is for the purpose of archiving, but please do not refer to the contents of this article. Alternative articles include: </ font>
-A quick introduction to the AI platform "Watson Studio" announced at IBM's Think 2018
Hello! On 06/01/2017 ** The Data Science Experience icon has appeared on IBM Cloud! Even if I get excited with **, I think most people have a cool reaction, "What is Data Science Experience?" (Gackli ..) There are some articles on Data Science Experience (DSX) on Qiita, but I would like to briefly introduce "What's that?" After being registered in the IBM Cloud catalog. I did.
(I think the following expressions are quick for skillful Qiita readers.) In short, we will provide a set of development and execution environments for the following open data science analysis that has been gaining momentum recently. It is a SaaS service. As a user, we assume a team of data scientists who can code. (For those who hate coding: new: SPSS is also available on DSX! :-))
--Scala / Python on Jupyter Notebook (*) --R on R Studio
Furthermore
--Articles for study ・ Tutotial and open data --Collaboration function for analysis team --Notebook GitHub integration
It is also attached.
Well, the current situation is ** the point is that it is a SaaS service that integrates open source software **, so it can be said that you can create a similar environment by yourself, but I think it has the following advantages.
-(Because it is SaaS) There is no need to arrange infrastructure or set up the environment in the first place. --No knowledge of infrastructure settings such as cooperation between Jupyter and Spark is required ――Therefore, you can start developing the code immediately (or you can try and study immediately). --Multilingual environment (Polyglot) eliminates the need for analysis teams to "unify tools and environments" ――No need to build and operate the Spark Cluster environment (it's quite difficult, this) --Easy integration with services on IBM Cloud such as dashDB and Object Storage --You can easily deploy your notebook on github
DSX seems to be particularly focused on ** "increasing the productivity of the analysis team" **. Each data scientist has his or her favorite language and tools that he is good at, such as "I want to do it in R" and "Well, it's AI from now on, so it's Python." If you want to analyze on an individual basis, you can use whatever you like, but if you do "analysis work" with ** "team" ** and ** "work" **, that is not the case. If you do not unify the language and tool environment, it will be inconvenient for the team to evaluate and share the analysis results. However, it is quite painful and moral down to be forced to decide "this analysis work is xxx". .. .. DSX seems to be aiming for an environment where the team can ** analyze this area ** with their favorite language and tools, and ** collaborate ** the deliverables. (It can be inferred from the fact that the price system is not the price of one user, but how much it costs for five people.)
The Data Science Experience itself was offered as a service on SaaS independently of Bluemix in 2016, but only with a 30-day trial. (That is, it could not be used after the trial deadline.): New: After that, it was published in the Bluemix catalog & Free version was provided in 2017/06, and it will be free for a long time with the name change from Bluemix to IBM Cloud in 2017/11. The Lite plan was offered, but DSX and WML are also available on the Lite plan. The point is (although resources are limited) ** Lite plan allows you to try it for free and for free **, so it's a good place to start "studying Jupyter / Python / Scala + Spark". I think not. (Tutotial for studying and sample notebooks are also available.)
By the way, the resources that can be used in the Lite plan are as follows. Although it is small, I think that the "study" level is sufficient. (The Lite plan has the same functionality as the paid Enterprise version, only the available machine resources and the number of Spark Clusters are different.) Data Science Experience
Below, while introducing the functions of DSX in the Free environment as an introduction, I will try from creating a project to running an existing notebook with explanations of Python / Spark. In DSX, resources such as various notebooks and data are collected, managed and shared using a management unit called "project".
Log in to IBM Cloud and select Data Science Experience from the catalog.
On the next screen, give the service a name of your choice, select Lite Plan, and then "Create". For the: warning: Lite plan, set ** "Deployment area" to "Southern United States" **. As of November 2017, only "Southern United States" is available for Lite plans. (Is it appropriate because the largest selection of services is in the "Southern United States")
When the screen changes, "Get Started"
Select the IBM Cloud organization and space to be used with DSX and "Continue" (Is it okay by default)
Wait for a while, and when it becomes Done, "Get Started"
Below is the initial screen of DSX. : new: 2017/11 update made it cooler. --This panel is displayed by clicking "Get Started" in the upper right.
―― ① This is the center of the operation, creating a project and setting the data source. -② Links to documents and various settings ―― ③ Shortcut icon
The menu of ① is as follows. --Projects --Access to created projects and notebooks --Tools --Access to Jupyter and RStudio --Data Services --Definition of various data sources such as databases and storage
: new: Beta, but SPSS Modeler and Stream Designer have also been added
The bottom of the screen
--④ Recently used project ――⑤ There are many blog articles and tutorials in the community resources, so you can start studying immediately from here. ――Click ⑥ to ask DSX support. (I have never done it)
"Create Project" with the shortcut of ③
Enter your favorite project name in the Name field
** To use DSX, ①Spark ②Object Storage instance is required **. You can also make these for free with the Lite plan. If it is undefined, you can define it immediately by clicking the following from this panel, so please specify the instance to be used by "Reload" again after creating it. (If it's already defined, just select it)
[If the account does not have an instance]
After specifying the instance, click "Create"
The project is ready. It's still clean, but you can see that the structure is such that notebooks and data assets are stored in the project. From here you can create new notebooks and machine learning models.
Create a new Notebook. "Add notebooks" on the upper right
Set your favorite name in Name, select the language and Spark version, and click "Create Notebook". I chose the latest Python 3.5 / Spark 2.1 here.
As a result, the familiar Jupyter Notebook environment has been created as shown below. The menu and color scheme at the top are different from the open source Jupyter Notebook, but since the substance is Jupyter itself, those who already have Jupyter experience will not get lost in operation.
By the way, the following menus on the upper right are the functions of DSX.
# | Explanation |
---|---|
① | Publish notebook to github |
② | Share your notebook on direct links, twitter, and LinkedIn |
③ | Notebook run scheduling |
④ | Project token(※)Insert |
⑤ | Information about this notebook, such as environment, creation date, etc. |
⑥ | Notebook version storage (up to 10)) |
⑦ | Add comment |
⑧ | File or data source connection |
⑨ | Search for bookmarks and community resources |
Once the notebook opens, all you have to do is start coding. As shown below, Spark Context has already been initialized, and numpy, pandas, matplotlib, etc., which are standard libraries for data science in Python, can also be used. By the way, seaborn was not included, but I was able to install it with! Pip install seaborn. In this way, it is easy to "add a library that does not exist".
It is hard for "studying from now on" to start from nothing, but DSX has many notebooks (in English) that "you can study while reading the explanation and actually moving it". Let's try running the existing "Notebook for using Spark with Python".
Search for "Apache Spark Lab" in Community Notebooks and you'll find the following three-part Notebooks: Double-click on Part 1 to open it.
A notebook with explanations will open as shown below. Select "Copy" from the icon on the upper right.
Select the project name and Spark environment to use and select "Create Notebook"
After waiting for a while, Notebook will be copied to your environment and will work as shown below.
As a preparation before execution, clear the previous output if it remains. 「Cell」-「All Output」-「Clear」
All you have to do is actually execute the cell while reading the explanation. I think it's good for studying because you can immediately try what you learned in the commentary. (By the way, step execution of the cell is done with the following button or "Shift + Enter")
The contents of this notebook are beyond the scope of this article, so I will omit them, but there are various other notebooks, so you can choose the theme you are interested in and study in the same way.
That was "I tried to touch it".
To collaborate with multiple members on a single project, follow these steps: As far as I tried, it seems that Lite accounts can also do it.
Enter the email address of the user you want to invite, set the appropriate access rights, and then click the "Invite User" button.
The following email will be sent to the invited members, so accept the invitation with "Join Now" and sign up for IBM Cloud.
Since you already have an ID to log in to IBM Cloud, sign up with "Already hace an IBM Cloud account?" At the bottom right. However, at this point you can't see anything because the invitee hasn't shared the project yet.
There are both IBM Cloud account and DSX account around here, and it is complicated, so please refer to the document Set up an enterprise accountをご参照ください。
Note that the notebook is locked while someone is editing it so that multiple people do not update the same notebook.
Although not introduced in this article, DSX also has DSX Local that runs in a private cloud and DSX Desktop that can be used on desktops (open beta as of June 2017). If you are interested, please search DSX Document or the Internet.
DSX and WML are separate services on IBM Cloud, but their cooperation is steadily progressing. If you're doing data science / predictive analytics on the IBM Cloud, you'll probably use both. Watson Machine Learning is also available for free with the Lite plan, so please try it.