I'm dealing with machine learning and so on, and somehow I was interested in text mining, so I will post it as a memorandum of what I tried. This time it's scraping, but in the end I'm thinking of analyzing the scraped text information, so I'll post it as needed when it's complete.
When I thought about scraping with Python, the first thing that came to my mind was requests and [BeautifulSoup](http: //). It was kondou.com/BS4/), so I'm going to use this combination this time.
By the way, I usually use JSer, so I often use puppeteer for scraping. Well, that's fine, let's actually start scraping.
Of course, when writing code in python, you can write it directly in the "~ .py" file, but if you use Jupyter Notebook, there are various convenient parts such as easy to see the output result, so use Jupyter Notebook for testing. It is recommended to do. Especially this time, I will test using Google Colaboratory provided by Google. You don't need to install the library, just a Google account to run it.
** Open Colaboratory and create a new notebook ** Open Colaboratory in your web browser and open Create a new notebook from File> New Notebook in Python 3.
I don't think I need to explain it anymore, but I will write the code in the cell in the central area.
** Import library **
import
import requests
from bs4 import BeautifulSoup
** Specify URL ** This time, I will scrape the latest 10 lines of headline news of ArchiFuture Web, which is a portal site of architecture x computing, which is my occupation. (ArchiFuture)
Specify url
url = "http://www.archifuture-web.jp/headline/457.html"
** Visit page using requests ** Let's see if we can actually access the page.
Visit page
res = requests.get(url)
res
If you do this, you probably
response
<Response [200]>
I think that will be returned. If you would like to know the HTTP response code, please refer to here.
If you want to see the contents of the page
res.text
response
'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja" dir="ltr" xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta http-equiv="Content-Style-Type" content="text/css" />\n<meta http-equiv="Content-Script-Type" content="text/javascript" />\n<!-- [if lt IE]><meta http-equiv="imagetoolbar" content="no" /><![endif] -->\n<title>"Archi Future 2019" is a great success with the highest number of visitors in history | Headline | Architecture x Computation Portal Site\u3000Archi Future Web</title>\n<meta name="Description" content="Architecture x Computation portal site "Archi Future Web"." />\n<meta name="keywords" content="Architecture,Computation,Archi Future">\n<meta property="og:site_name" content="Architecture × Computationのポータルサイト\u3000Archi Future Web">\n<meta property="og:title" content=""Archi Future 2019" is a great success with the highest number of visitors ever">\n<meta property="og:type" content="article">\n<meta property="og:description" content="The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in. On the day of the event, the number of visitors was high despite the unfortunate weather of heavy rain and wind....">\n<meta property="og:url" content="http://www.archifuture-web.jp/headline/457.html" />\n<meta property="og:image" content="http://www.archifuture-web.jp/headline/img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " />\n<meta property="og:image:width" content="700" />\n<meta property="og:image:height" content="467" />\n<meta property="og:locale" content="ja_JP">\n<link rel="stylesheet" type="text/css" href="../common/css/import.css" />\n<!-- [if lt IE 9]><link rel="stylesheet" type="text/css" href="../common/css/ie.css" /><![endif] -->\n<script type="text/javascript" src="../common/js/jquery.js"></script>\n<script type="text/javascript" src="../common/js/init.js"></script>\n<script type="text/javascript" src="../common/js/Nav.js"></script>\n</head>\n\n<body id="headline">\n<div id="container">\n<script type="text/javascript">header(\'../\');</script>\n<hr class="hide" />\n\n<h2><img src="img/title.jpg " width="960" height="60" alt="Headline"/></h2>\n\n<div id="content">\n<div class="section">\n<div id="mainContent">\n\n<!--========\u3000SNS\u3000========-->\n<div id="sns" class="clearfix">\n<div class="facebook"><div id="fb-root"></div>\n<script>(function(d, s, id) {\n var js, fjs = d.getElementsByTagName(s)[0];\n if (d.getElementById(id)) return;\n js = d.createElement(s); js.id = id;\n js.src = "//connect.facebook.net/ja_JP/sdk.js#xfbml=1&appId=152808698121811&version=v2.0";\n fjs.parentNode.insertBefore(js, fjs);\n}(document, \'script\', \'facebook-jssdk\'));</script><div class="fb-share-button" data-href="http://www.archifuture-web.jp/headline/457.html" data-layout="button"></div></div>\n<div class="twitter"><a href="https://twitter.com/share" class="twitter-share-button" data-lang="ja" data-count="none">Tweet</a> <script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?\'http\':\'https\';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+\'://platform.twitter.com/widgets.js\';fjs.parentNode.insertBefore(js,fjs);}}(document, \'script\', \'twitter-wjs\');</script></div>\n<!-- /#sns --></div>\n\n<!--========\u3000 text\u3000========-->\n\n<div class="page-title">\n<p><img src="img/icon_new.gif" width="112" height="20" alt="Latest 10 lines news"/></p>\n<h2>"Archi Future 2019" has the highest number of visitors ever<br />\r\n Collect and hold successfully</h2>\n</div>\n<p class="page-data">2019.10.28</p>\n\n<p>The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.<br />\r\n On the day of the event, despite the unfortunate weather of heavy rain and wind<span style="font-size:12px;">、</span>The number of visitors is 5 compared to the previous time.4% increase<br />\r\n5,With 509 people, it was a great event to attract the highest number of visitors in history.<br />\r\n Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors<br />\r\Panel design by 5 people in n-long class<span style="font-size:12px;">I</span>Ska<span style="font-size:12px;">Tsu</span>Shi<span style="font-size:12px;">Yo</span>Is<span style="font-size:12px;">、</span>The venue expanded to 600 seats is full<span style="font-size:12px;">、</span>Panel design<span style="font-size:12px;">I</span><br />\r\nSkaTsuShiYoIs席をさらに100席増設するほどの盛況ぶりだ<span style="font-size:12px;">Tsu</span>Ta. Which course is the lecture / seminar?<br />\r\n is almost full<span style="font-size:12px;">、</span>The exhibition hall is also visited by a large number of visitors<span style="font-size:12px;">、</span>The whole venue was very lively and a great success<br />\r\nな開催となTsuた。岡田氏と山梨氏の特別対談1、豊田氏と松島氏の特別対談2をはじめ、どの<br />\r\nセTsuShi<span style="font-size:12px;">Yo</span>ンもArchitectureの新しい方向性と明るい未来を感じさせてくれる、充実した内容であTsuた。<br />\r\The report of nArchi Future 2019 will be introduced on this site in the future.<br />\r\n<br />\r\n<a href="http://www.archifuture.jp/2019/" target="_blank"><p class="image al_center"><img src="./img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " alt="\u3000「Archi Future 2019」オフIShiャルサイトのトTsuプページ" width="600" height="400" /></p><p class="caption">\u3000「Archi Future 2019」オフIShiャルサイトのトTsuプページ</p></a></p>\n\n<!--========\u3000ナビゲーShiYoン\u3000========-->\n<div id="page-navi" class="clearfix clr">\n<ul>\n<li class="go-top"><a href="../index.html"><span>></span> トTsuプページに戻る</a></li>\n<li class="go-list"><a href="index.html"><span>></span>Return to article list</a></li>\n</ul>\n</div>\n\n<!--========\u3000 latest article\u3000========-->\n<div id="page-new" class="clr">\n<h2>Latest articles</h2>\n<ul class="list01">\n<li><a href="457.html" target="">"Archi Future 2019" is a great success with the highest number of visitors ever<br />\n2019.10.28\u3000<img src="img/icon_new.gif" width="112" height="20" alt="Latest 10 lines news"/></a></li>\n<li><a href="456.html" target="">Archi Future 2019 will finally be held on the 25th tomorrow on the largest scale ever<br />\n2019.10.24\u3000<img src="img/icon_new.gif" width="112" height="20" alt="Latest 10 lines news"/></a></li>\n<li><a href="455.html" target="">October 25th of this week(Money)`` Archi Future 2019'' will be held in<br />\n2019.10.21\u3000<img src="img/icon_new.gif" width="112" height="20" alt="Latest 10 lines news"/></a></li>\n</ul>\n</div>\n\n<!--========\u3000 premium banner\u3000========-->\n\r\n<div id="premiumbanner" class="al_center clr">\r\n<a href="http://www.archifuture.jp/2019/" class="premiumbanner-left banner" target="_blank" id="premium-24"><img src="../img_banner/premium/img/8/2/8280d8bb17d4c09e872441a1ba21eae0.png " width="270" height="180" alt="Archi Future 2019"/></a>\r\n<a href="http://www.archifuture.jp/2019/" class="premiumbanner-right banner" target="_blank" id="premium-25"><img src="../img_banner/premium/img/6/a/6ad8af52988f0560acbb6c08377d79f3.png " width="270" height="180" alt="Archi Future 2019"/></a>\r\n</div>\r\n\n\n<!--========\u3000 Rectangle Super Banner\u3000========-->\n\r\n<p id="superbanner" class="al_center"><a href="http://www.archifuture.jp/2019/" class="banner" id="super-14"><img src="../img_banner/super/img/9/9/99ae81e84701bf687561a0ca026bdef0.png " width="600" height="90" alt="Archi Future 2019"/></a></p>\r\n\n\n<!-- /#mainContent --></div>\n\n<div id="sidebar">\n<!--========\u3000 advertising banner\u3000========-->\n\r\n<ul id="banner" class="clr">\r\n<li><a href="https://www.cradle.co.jp/" target="_blank" class="banner" id="default-5"><img src="../img_banner/default/img/2/f/2f1b60f601b0f99e6094e32d7fd0b26d.gif" width="270" height="80" alt="Software cradle"/></a></li>\r\n<li><a href="https://product.metamoji.com/gemba/eyacho/" target="_blank" class="banner" id="default-24"><img src="../img_banner/default/img/2/8/280e8426c1fb78ee0e67b2d009d7c9d2.gif" width="270" height="80" alt="MetaMoJi"/></a></li>\r\n<li><a href="https://www.izumi-soft.jp/product-category/bim-%E7%A9%BA%E8%AA%BF%E8%A8%AD%E5%82%99%E8%A8%AD%E8%A8%88/" target="_blank" class="banner" id="default-16"><img src="../img_banner/default/img/1/8/18ef602ddf1e9f1e3c5f00a7674725a2.gif" width="270" height="80" alt="イズミShiステム設計様"/></a></li>\r\n<li><a href="http://www.nyk-systems.co.jp/" target="_blank" class="banner" id="default-6"><img src="../img_banner/default/img/3/b/3b747d65472ce7be37b8235fc703432d.gif" width="270" height="80" alt="NYKShiステムズ様"/></a></li>\r\n<li><a href="http://www.pivot.co.jp/" target="_blank" class="banner" id="default-12"><img src="../img_banner/default/img/6/d/6d10409aeb0b2d23bd73b9ccc70cc08d.gif" width="270" height="80" alt="ArchitectureピボTsuト様"/></a></li>\r\n<li><a href="http://www.applicraft.com/" target="_blank" class="banner" id="default-20"><img src="../img_banner/default/img/9/0/90cc824aac1eda2ba2c37046e55dd79c.gif" width="270" height="80" alt="Appcraft"/></a></li>\r\n<li><a href="http://bit.ly/2Bw8tEc" target="_blank" class="banner" id="default-3"><img src="../img_banner/default/img/7/d/7dbe65f17a1bf153277ba5b466580556.jpg " width="270" height="80" alt="グラフIソフトジャパン様"/></a></li>\r\n<li><a href="https://autode.sk/2TXDSqE" target="_blank" class="banner" id="default-11"><img src="../img_banner/default/img/e/c/ecb06a6b95c9e79935b6a7df88384ab3.jpg " width="270" height="80" alt="Autodesk"/></a></li>\r\n<li><a href="https://licensecounter.jp/aec-collection-bim/" target="_blank" class="banner" id="default-22"><img src="../img_banner/default/img/1/0/10bcd30f085e6ce4dab4b824c64817a6.gif" width="270" height="80" alt="SB C&Mr. S"/></a></li>\r\n<li><a href="https://www.nvidia.com/ja-jp/design-visualization/industries/architecture-engineering-construction/?nvid=nv-int-pcjp12rrdsfrqr-44523" target="_blank" class="banner" id="default-21"><img src="../img_banner/default/img/d/5/d53b3fe7fec2bc10858a26f88556c8fb.jpg " width="270" height="80" alt="エヌビデIア様"/></a></li>\r\n<li><a href=" https://www.aanda.co.jp/Vectorworks2019/index.html?utm_source=af&utm_medium=banner&utm_campaign=bnr_20190921" target="_blank" class="banner" id="default-9"><img src="../img_banner/default/img/7/2/72745a704c3abe2513357559102be116.jpg " width="270" height="80" alt="A & A"/></a></li>\r\n<li><a href="http://j-bim.gloobe.jp/" target="_blank" class="banner" id="default-4"><img src="../img_banner/default/img/a/9/a9f022f44cac2878ac5936fcf4b26175.gif" width="270" height="80" alt="Fukui Computer Architect"/></a></li>\r\n<li><a href="http://www.env-simulation.com" target="_blank" class="banner" id="default-18"><img src="../img_banner/default/img/e/3/e3d7f1694a53705271dc5e751519d0d8.gif" width="270" height="80" alt="環境ShiミュレーShiYoン様"/></a></li>\r\n<li><a href="http://www.f-cadewa.com/" target="_blank" class="banner" id="default-8"><img src="../img_banner/default/img/f/0/f0a5e6fe83e322a2d62a2461855a6c2a.gif" width="270" height="80" alt="富士通四国インフォテTsuク様"/></a></li>\r\n<li><a href="https://www.photoruction.com/?utm_source=afw&utm_medium=banner&utm_campaign=201903" target="_blank" class="banner" id="default-23"><img src="../img_banner/default/img/2/8/28dba36bb5e98ff9a4514ae00e93844b.png " width="270" height="80" alt="フォトラクShiYoン様"/></a></li>\r\n</ul>\r\n<script type="text/javascript">\r\n<!--\r\nvar top_url = \'/\';\r\n//-->\r\n</script>\r\n<script type="text/javascript" src="../common/js/banner_track.js"></script>\r\n\n\n<!-- /#sidebar --></div>\n<!-- /#section --></div>\n<!-- /#content --></div>\n\n<script type="text/javascript">footer(\'../\');</script>\n\n<!-- /#container --></div>\n</body>\n</html>'
However, in this case, HTML that is packed tightly as text data is returned, and I do not understand what it is. So let's use Beautiful Soup to parse the HTML.
** Use Beautiful Soup **
html perspective
soup = BeautifulSoup(res.text, 'html.parser')
soup
Let's see the result here
response
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="ja" xml:lang="ja" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
---(abridgement)---
<!--========Text========-->
<div class="page-title">
<p><img alt="Latest 10 lines news" height="20" src="img/icon_new.gif" width="112"/></p>
<h2>"Archi Future 2019" has the highest number of visitors ever<br/>
Collected and held successfully</h2>
</div>
<p class="page-data">2019.10.28</p>
<p>The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.<br/>
On the day of the event, despite the unfortunate weather of heavy rain and wind<span style="font-size:12px;">、</span>The number of visitors is 5 compared to the previous time.4% increase<br/>
5,With 509 people, it was a great event to attract the highest number of visitors in history.<br/>
Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors<br/>
Panel design by 5 long class students<span style="font-size:12px;">I</span>Ska<span style="font-size:12px;">Tsu</span>Shi<span style="font-size:12px;">Yo</span>Is<span style="font-size:12px;">、</span>The venue expanded to 600 seats is full<span style="font-size:12px;">、</span>Panel design<span style="font-size:12px;">I</span><br/>
Scution is so successful that it will add another 100 seats.<span style="font-size:12px;">Tsu</span>Ta. Which course is the lecture / seminar?<br/>
Is almost full<span style="font-size:12px;">、</span>The exhibition hall is also visited by a large number of visitors<span style="font-size:12px;">、</span>The whole venue was very lively and a great success<br/>
It was held. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which<br/>
Sessie<span style="font-size:12px;">Yo</span>It was a fulfilling content that made me feel a new direction of architecture and a bright future.<br/>
The report of Archi Future 2019 will be introduced on this site in the future.<br/>
<br/>
<a href="http://www.archifuture.jp/2019/" target="_blank"><p class="image al_center"><img alt="Top page of "Archi Future 2019" official site" height="400" src="./img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " width="600"/></p><p class="caption">Top page of "Archi Future 2019" official site</p></a></p>
---(abridgement)---
</html>
It parses and displays html nicely.
Then it is finally the main subject. This time we want to get the content of the 10-line article, so first find out where the article body is. You can find it by comparing it with what is written, but if you are on the Web, use the developer tools to find it.
Something like this. (Hmm ... id and class aren't assigned ...)
If id or class is assigned, you can easily get it by specifying it using css selector, but this time there is no such thing so I will get all the p tags where the article is written. ..
Get p tag
p_tags = soup.select('p')
p_tags
Acquisition result of p tag
[<p><img alt="Latest 10 lines news" height="20" src="img/icon_new.gif" width="112"/></p>,
<p class="page-data">2019.10.28</p>,
<p>The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.<br/>
On the day of the event, despite the unfortunate weather of heavy rain and wind<span style="font-size:12px;">、</span>The number of visitors is 5 compared to the previous time.4% increase<br/>
5,With 509 people, it was a great event to attract the highest number of visitors in history.<br/>
Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors<br/>
Panel design by 5 long class students<span style="font-size:12px;">I</span>Ska<span style="font-size:12px;">Tsu</span>Shi<span style="font-size:12px;">Yo</span>Is<span style="font-size:12px;">、</span>The venue expanded to 600 seats is full<span style="font-size:12px;">、</span>Panel design<span style="font-size:12px;">I</span><br/>
Scution is so successful that it will add another 100 seats.<span style="font-size:12px;">Tsu</span>Ta. Which course is the lecture / seminar?<br/>
Is almost full<span style="font-size:12px;">、</span>The exhibition hall is also visited by a large number of visitors<span style="font-size:12px;">、</span>The whole venue was very lively and a great success<br/>
It was held. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which<br/>
Sessie<span style="font-size:12px;">Yo</span>It was a fulfilling content that made me feel a new direction of architecture and a bright future.<br/>
The report of Archi Future 2019 will be introduced on this site in the future.<br/>
<br/>
<a href="http://www.archifuture.jp/2019/" target="_blank"><p class="image al_center"><img alt="Top page of "Archi Future 2019" official site" height="400" src="./img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " width="600"/></p><p class="caption">Top page of "Archi Future 2019" official site</p></a></p>,
<p class="image al_center"><img alt="Top page of "Archi Future 2019" official site" height="400" src="./img/4/c/4c57dc333a5c9d674ef327289a500800.jpg " width="600"/></p>,
<p class="caption">Top page of "Archi Future 2019" official site</p>,
<p class="al_center" id="superbanner"><a class="banner" href="http://www.archifuture.jp/2019/" id="super-14"><img alt="Archi Future 2019" height="90" src="../img_banner/super/img/9/9/99ae81e84701bf687561a0ca026bdef0.png " width="600"/></a></p>]
Apparently, it is the second (counting from 0) in the p tag, so we will extract the text of the second element from this.
Get articles
article = p_tags[2].get_text()
atricle
Article acquisition result
'The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.\r\n On the day of the event, the number of visitors was 5 compared to the previous time, despite the unfortunate weather of heavy rain and wind..4% increase\r\n5,With 509 people, it was a great event to attract the highest number of visitors in history.\r\n Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors\r\In the panel discussion by 5 people in the n-long class, the venue expanded to 600 seats was full and the panel day\r\n Scushion was so successful that it added 100 more seats. Which course is the lecture / seminar?\r\n was almost full, the exhibition hall was visited by a large number of visitors, and the entire venue was very lively, a great success.\r\It was held n. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which\r\The n-session was also a fulfilling content that made us feel a new direction of architecture and a bright future.\r\The report of nArchi Future 2019 will be introduced on this site in the future.\n\n\u3000 "Archi Future 2019" official site top page'
You're getting closer. So, let's erase unnecessary line feed codes.
Extract only the text of the article
lines = [line.strip() for line in text.splitlines()] #Get only characters without tags
ten_lines_news = lines[0:10] #Delete unnecessary parts
ten_lines_news
Contents of 10-line news
['The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in.',
'On the day of the event, the number of visitors was 5 compared to the previous time, despite the unfortunate weather of heavy rain and wind..4% increase',
'5,With 509 people, it was a great event to attract the highest number of visitors in history.',
'Diller Scofidio, a well-known US design firm+Keynote speech by Renfro and current location of major general contractors',
'In the panel discussion by 5 people in the long class, the venue expanded to 600 seats was full, and the panel day',
'Scassion was so successful that it added another 100 seats. Which course is the lecture / seminar?',
'The exhibition hall was almost full, and the exhibition hall was very lively with a large number of visitors.',
'It was held. Special talk 1 between Mr. Okada and Mr. Yamanashi, special talk 2 between Mr. Toyota and Mr. Matsushima, which',
'The session was also fulfilling and made us feel a new direction of architecture and a bright future.',
'The report of Archi Future 2019 will be introduced on this site in the future.']
You got it well. The excitement point is that the number of arrays is 10.
Finally, put it together in one line of text.
In one text
ten_lines_news_text = ""
for line in ten_lines_news:
ten_lines_news_text += line
ten_lines_news_text
The real thrill of scraping is getting a lot of information at once. When that happens, it is currently not possible to identify the information that has been acquired.
This time, as a two-step stance, I will get the date when the article was posted and the number assigned to the URL of the article and use it as the article ID. As you can see from the data that came out when the p tag was acquired all at once, the date part is assigned a class. Let's use this to get the date this time.
Get Post Date
date = soup.select('.page-data')[0].string
date
Acquisition result
'2019.10.28'
The rest is the id of the article, but the URL of the article page "http://www.archifuture-web.jp/headline/457.html" Make sure to use the name part of the html file. (This time it's troublesome ~~ Let's write the ID directly)
Article ID
id = 457
Let's turn the process created so far into a function.
--Access page --html perspective --Get p tag --Getting articles --Extract only the text of the article --In one text
The four processes of are combined into one function, and when the URL is entered, the text of the 10-line article is returned.
Functionalization of processing
def get_article(url):
res = requests(url)
soup = BeautifulSoup(res.text, ‘html.parser’)
#Get articles
p_tags = soup.select(‘p’)
article = p_tags[2].get_text()
lines = [line.strip() for line in text.splitlines()] #Get only characters without tags
ten_lines_news = lines[0:10] #Delete unnecessary parts
#Store in one text data
ten_lines_news_text = ""
for line in ten_lines_news:
ten_lines_news_text += line
date = soup.select('.page-data')[0].string #Post date and time
id = 457 #Article ID
return ten_lines_news_text
Check if this can be executed with a python script file (.py), and if it can be executed, let's describe the following steps in the python script file.
It's a waste to keep getting the acquired information, so let's write it in a csv file. There are several libraries that handle csv data in python, but this time I will use pandas. It's a library I personally like because it's very useful when working with row and column data.
Let's prepare the csv file to write to first (this time create it with the name txt_data.csv). ~~ articles.csv is more suitable ... ~~
txt_data.csv
id,date,text
Read and write files using pandas.
Write to csv
csv_file = 'csv/txt_data.csv'
df = pd.read_csv(csv_file)
text = value
results = pd.DataFrame([id, date, text], columns=['id', 'date', 'text'])
df = pd.concat([df, results])
df.to_csv(csv_file, index=False)
print("success writing to %s" % csv_file)
Csv data after writing is completed
id,date,text
457,2019.10.28,The 12th "Archi Future 2019" will be held on October 25th last week.(Money)Was held in. On the day of the event, the number of visitors was 5 compared to the previous time, despite the unfortunate weather of heavy rain and wind..4% increase 5,With 509 people, it was a great event to attract the highest number of visitors in history. Diller Scofidio, a well-known US design firm+The keynote speech by Renfro and the panel discussion by the current location manager class of five major general contractors were full, and the panel discussion was so successful that the number of seats was increased by 100. All of the lectures and seminars were almost full, and the exhibition hall was visited by a large number of visitors, and the entire venue was very lively and was a great success. Every session, including the special dialogue 1 between Mr. Okada and Mr. Yamanashi and the special dialogue 2 between Mr. Toyota and Mr. Matsushima, was fulfilling and made us feel a new direction of architecture and a bright future. The report of Archi Future 2019 will be introduced on this site in the future.
Now you can save the acquired information as csv data. Next time, I will explain how to get all the articles posted so far and save the text data.
Qiita: A general-purpose method for extracting only characters by scraping Python
Recommended Posts