I will try to build a system to crawl information from Twitter, format the information, and store it in the DB. For the time being, create a test environment on the virtual environment using Vagrant as an experiment.
Install the latest version from https://www.virtualbox.org/wiki/Downloads.
Create a virtual environment from Vagrant.
Download and install the latest version from http://www.vagrantup.com/downloads.
Create a virtual environment with Vagrant.
$ mkdir -p ~/vagrant/debian7_twitter
$ vagrant box add debian7.6_twitter https://github.com/jose-lpa/packer-debian_7.6.0/releases/download/1.0/packer_virtualbox-iso_virtualbox.box
$ vagrant init debian7.6_twitter
$ vagrant up
$ vagrant ssh
You can now log in to your virtual environment.
If you want to exit, after logging out
$ vagrant halt
You can end with. After that, under the directory where the virtual environment was created,
You can log in to this environment at any time with vagrant up
-> vagrant ssh
.
Python Python 2.7.3 is installed by default, so use it.
$ Python -V
Python 2.7.3
virtualenv Use virtualenv to manage the modules to be used on a project-by-project basis.
Use apt.
$ sudo apt-get update
$ sudo apt-get install python-dev python-virtualenv
Move to any working directory and execute the following command.
$ virtualenv twi-py
This will create a twi-py
directory in the current directory, and an independent Python environment will be created here.
Move to the created directory and execute the following command.
$ source bin/activate
If (twi-py) ...
is added at the beginning of the shell prompt, the environment has been switched to twi-py
.
To exit the specific module environment and return to the default, execute the following command.
$ deactivate
After that, install MeCab related modules in this twi-py
environment.
Use apt.
$ sudo apt-get update
$ sudo apt-get -y install mecab
$ sudo apt-get -y install mecab-ipadic-utf8
$ sudo update-alternatives --config mecab-dictionary # ipadic-Check if it is utf8
Install the required libraries with apt.
$ sudo apt-get -y install python-dev
$ sudo apt-get -y install libmecab-dev
$ sudo apt-get -y install build-essential
$ sudo apt-get -y install g++
Install the version of Python bindings for Debian 7 wheezy under the twi-py
environment.
(twi-py)$ pip install https://mecab.googlecode.com/files/mecab-python-0.99.tar.gz
Let's morphologically analyze "Sumomomo Momomo".
$ Python
>>> import MeCab
>>> mecab = MeCab.Tagger("-Ochasen")
>>> print mecab.parse("Of the thighs and thighs")
Plum Sumomo Noun-General
Momo particle-Particle
Peach peach noun-General
Momo particle-Particle
Peach peach noun-General
Nono particle-Attributive
Uchi Uchi Noun-Non-independent-Adverbs possible
EOS
>>>
I was able to analyze it properly.
Now we have all the tools to realize the system we will make this time.
Install using apt.
Install using apt.
$ sudo apt-get -y install libmysqlclient-dev
$ sudo apt-get -y install mysql-server-5.5
During installation, you will be asked to set a root password, so enter vagrant
.
Log in as the root user.
$ mysql -u root -pvagrant
mysql> SELECT user,host,password FROM mysql.user;
+------------------+----------------------------------+-------------------------------------------+
| user | host | password |
+------------------+----------------------------------+-------------------------------------------+
| root | localhost | *04E6E1273D1783DF7D57DC5479FE01CFFDFD0058 |
| root | packer-virtualbox-iso-1411922062 | *04E6E1273D1783DF7D57DC5479FE01CFFDFD0058 |
| root | 127.0.0.1 | *04E6E1273D1783DF7D57DC5479FE01CFFDFD0058 |
| root | ::1 | *04E6E1273D1783DF7D57DC5479FE01CFFDFD0058 |
| debian-sys-maint | localhost | *A5B3FEE41C7F1F2C147B4876D39D6A4F65E79B7D |
+------------------+----------------------------------+-------------------------------------------+
It worked safely.
Installed under the twi-py
environment.
(twi-py)$ pip install MySQL-python
$ python
>>> import MySQLdb
If there is no error with this, it is OK.
Now you have all the tools you need. From now on, we will create a crawl part, an information molding part, a store part in the DB, etc. Those articles will come later.
Recommended Posts