Recently, I am doing various analyzes such as acquiring data by scraping and performing morphological analysis with mecab.
Recent articles Clustering books from Aozora Bunko with Doc2Vec Scraping & Negative / Positive Analysis of Bunshun Online Articles
At that time, the environment in which the analysis is performed is all done in the Docker environment. This time I will publish the Dockerfile I am using.
Base: ʻubuntu Included: ʻanaconda
, mecab
, NEologd
, gensim
, janome
, Beautiful Soup
, etc.
Ingenuity: Setting NEologd as the default dictionary. This way you don't have to specify the NEologd dictionary every time you start mecab.
reference Kame-san's udemy Docker course ・ ・ ・ It is my basic knowledge of Docker. Highly recommended course. NEologd's GitHub page ・ ・ ・ It is stronger against proper nouns than the default dictionary. Change the default dictionary of MeCab [Mac] ・ ・ ・ I used it as a reference when specifying the default dictionary of mecab.
FROM ubuntu:latest
RUN apt-get update && apt-get install -y \
sudo \
wget \
vim \
mecab \
libmecab-dev \
mecab-ipadic-utf8 \
git \
make \
curl \
xz-utils \
file
WORKDIR /opt
RUN wget https://repo.anaconda.com/archive/Anaconda3-2020.07-Linux-x86_64.sh && \
sh Anaconda3-2020.07-Linux-x86_64.sh -b -p /opt/anaconda3 && \
rm -f Anaconda3-2020.07-Linux-x86_64.sh
ENV PATH /opt/anaconda3/bin:$PATH
RUN git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git ; exit 0
RUN cd mecab-ipadic-neologd && \
./bin/install-mecab-ipadic-neologd -n -y && \
echo "dicdir=/usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd">/etc/mecabrc
RUN conda update -n base -c defaults conda
RUN pip install --upgrade pip && \
pip install mecab-python3 \
Janome \
jaconv \
tinysegmenter==0.3 \
gensim \
unidic-lite \
japanize-matplotlib
RUN conda install -c conda-forge \
newspaper3k && \
conda install beautifulsoup4 \
lxml \
html5lib \
requests
WORKDIR /work
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root"]
Recommended Posts