When I was interested in data preprocessing and was looking for materials, [Recruit Artificial Intelligence Laboratory started offering "Big Gorilla", an open source ecosystem for data integration and preparation | Recruit Holdings --Recruit Holdings](http: // www) I found a press release called .recruit.jp/news_data/release/2017/0630_17541.html).
At first glance, I wasn't sure what it was, so I looked up the outline.
BigGorilla - Data Integration & Preparation in Python
--Python environment with recommended libraries for data preprocessing --With some proprietary libraries
From the naming and the figure on the official website, it seemed like a huge framework, So to speak, it is an assortment of libraries. (It seems that it will not inherit the class peculiar to BigGorilla)
To actually do the pre-processing, you need to program with python normally.
Recommendation for building a portable Python environment with conda --Qiita
$Add anaconda
$ conda env create biggorilla/py3gorilla
#If you are using pyenv, you need to specify the conda activate command with the full path. With source activate Py3 Gorilla, the shell falls.
$ source /Users/kkanazaw/.pyenv/versions/anaconda3-4.2.0/envs/Py3Gorilla/bin/activate Py3Gorilla
Reference: Let's get started | BigGorilla
~~ Addendum: When I tried it as of July 12, 2017, the following error did not appear with this method. (It seems that the older yml is applied, probably because the file name updated in June is strange. Probably it will be fixed by the update from now on) ~~
2017/07/21 postscript: The file has been updated. This should work as documented.
$ conda env create biggorilla/py3gorilla
Collecting urllib==1.21.1
Downloading urllib-1.21.1.tar.gz (226kB)
100% |████████████████████████████████| 235kB 640kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/bx/k4yrl_bd3nb0v8pz7fm60t8r0000gp/T/pip-build-58rsg5li/urllib/setup.py", line 191
s.connect((base64.b64decode(rip), 017620))
^
SyntaxError: invalid token
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/bx/k4yrl_bd3nb0v8pz7fm60t8r0000gp/T/pip-build-58rsg5li/urllib/
CondaValueError: Value error: pip returned an error.
You can install it by downloading yml from Files :: Anaconda Cloud and removing the line that specifies urllib.
###Erase the environment once
$ conda env remove -n Py3Gorilla
#Recreate the environment by specifying the locally modified yml file
$ conda env create --name test --file ~/Downloads/Py3Gorilla.yml
#If you are using pyenv, you need to specify the conda activate command with the full path. With source activate Py3 Gorilla, the shell falls.
$ source /Users/kkanazaw/.pyenv/versions/anaconda3-4.2.0/envs/test/bin/activate test
#Drop the notebook for operation check and start it
$ anaconda download biggorilla/hi_gorilla
$ jupyter notebook hi_gorilla.ipynb
There is a list of packages to be installed in Files :: Anaconda Cloud, so take a look. Although it is introduced in the component list on the official website, it turns out that only a small part is installed. Compared to the explanation on the site, the composition is surprisingly minimal. If it is not included, pip install it yourself.
--urllib http access standard library --https access library richer than requests urllib --scrapy scraping -(Tweepy is not included)
--beautifulsoup4 Web page loading and analysis --lxml xml parser --nltk Natural language processing (morphological analysis, etc.)
--FlexMatcher (manufactured by Recruit Works Institute)
--Magellan (developed by the University of Wisconsin) --Provided as part of a library called py-entitymatching, py-stringmatching
--xlrd Excel operation --Standard json, csv
――Isn't it included just because it has commercial tools?
――Is this not included?
Perhaps it is a dependency, scikit-learn and jupyter-notebook are included.
According to Press Release, the following three libraries are implemented independently.
Currently, RIT is available in packages called KOKO and FlexMatcher.)And d)Is being developed, and Professor Doan's team has a package called Magellan.)Is developing.
FlexMatcher --Schema matching library made by Recruit Works Institute ――Even if the data item name is different between the two data, it will automatically find the correspondence. --Are you estimating the similarity using the contents of the data as teacher data? -(Personally interested)
Magellan --Data matching library developed by the University of Wisconsin ――Can you combine data with notational fluctuations into one or do something like address identification?
KOKO --Only the press release has a name. ~~ Unpublished? ~~ --The repository was open to the public
--Conda env and try to actually build the environment --Try using FlexMatcher and Magellan
Recommended Posts