I need to extract the full text of evernote and publish the method I used at that time. It seems that you can do it using the Evernote API, but it's not so much, so it's troublesome. Therefore, I will introduce how to output all notes in html format and scrape them with Beautiful Soup.
First, select all notes with Command + A
. Export your notes from there.
Select html as the output format.
This time save it to your desktop as mynote.
The index.html of mynote is the table of contents of all the output files, and there is a link to each html file, so use that.
As a procedure
That is.
In the first place, scraping is the act of extracting specific information from a website. The file you scraped earlier is not a website, but it is in html format so you can scrape it. There are several python modules that can be scraped, but this time I will use something called BeautifulSoup.
Install with Beautiful Soup with pip.
$ pip install beautifulsoup4
Beatiful Soup is basically used as follows.
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen("http://~ ~ ~")
soup = BeautifulSoup(html)
scrape = soup.find_all("a")
See the official documentation for details. http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Only soup.get_text ()
, soup.find_all ("a")
and soup.get ("href ")
are used this time.
SQLAlchemy is an OR mapper, which is a convenient one that can interact with the database without writing it in SQL. Let's install with pip.
$ pip install sqlalchemy
Now that I'm ready, I'll scrape it.
First of all, if you specify the url of the note, create a function that extracts and returns only that sentence.
def scrape_evernote(url):
note_url = "file:///(Notebook directory)" + url.encode('utf-8')
html = urllib2.urlopen(note_url)
soup = BeautifulSoup(html)
all_items = soup.get_text()
return "".join(all_items)
The first three lines create a BeautifulSoup object.
ʻAll_items = soup.get_text ()to get the full text of the url destination. In the part after that, the characters that can be obtained by
get_text ()` are included in the array character by character, so all the arrays are combined into a character string.
Next, create a function to save the extracted text in SQLite.
def scrape_and_save2sql():
Base = sqlalchemy.ext.declarative.declarative_base()
class Evernote(Base):
__tablename__ = 'mynote'
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
title = sqlalchemy.Column(sqlalchemy.String)
note = sqlalchemy.Column(sqlalchemy.String)
db_url = "sqlite+pysqlite:///evernote.sqlite3"
engine = sqlalchemy.create_engine(db_url, echo=True)
Base.metadata.create_all(engine)
#Create a session
Session = sqlalchemy.orm.sessionmaker(bind=engine)
session = Session()
#Get the url of all notes from index
index_url = "file:///(Notebook directory)/index.html"
index_html = urllib2.urlopen(index_url)
index_soup = BeautifulSoup(index_html)
all_url = index_soup.find_all("a")
for note_url in all_url:
title = note_url.get_text()
note = scrape_evernote(note_url.get("href"))
evernote = Evernote(title=title, note=note)
session.add(evernote)
session.commit()
First, create Base
.
Then create a model of the notebook.
class Evernote(Base):
__tablename__ = 'mynote'
id = sqlalchemy.Column(sqlalchemy.Integer, primary_key=True)
title = sqlalchemy.Column(sqlalchemy.String)
note = sqlalchemy.Column(sqlalchemy.String)
This time, simply save the title and contents of the note.
Create a SQLite storage location and session.
After that, get the title and url of each note from ʻindex.html`. Links to each note in index.html
<a href="Note url">Note title</a>
Since it is configured as ʻindex_soup.find_all ("a") , all a tags are acquired. Since each tag is stored as an array, take it out and get the url and title of the link destination from the a tag. Extract the text from that url using the
scrape_evernote ()` created earlier.
Finally commit and save to SQLite.
This completes the extraction.
If you want to output to txt data instead of output with SQLite
def scrape_and_save2txt():
file = open('evernote_text.txt', 'w')
#Get the url of all notes from index
index_url = "file:///(Notebook directory)/index.html"
index_html = urllib2.urlopen(index_url)
index_soup = BeautifulSoup(index_html)
all_url = index_soup.find_all("a")
for note_url in all_url:
title = note_url.get_text()
file.write(title)
note = scrape_evernote(note_url.get("href"))
file.write(note)
file.close()
If so, it is possible. Of course, you can also output in csv format.
I wrote it in the beginning, but the general procedure is
It has become. This time it was only text, but the image also has a folder with the same name as the title in the note and is saved there. If you use this well, you can extract all the images in evernote.
Recommended Posts