python super débutant essaie de gratter

Qu'est-ce que le grattage?

Lorsque vous dites le mot «grattage», il y a à peu près deux choses, «ramper» et «gratter». J'étais confus, alors je vais régler le problème une fois.

Rampant
Suivez le lien de la page publiée sur le web et téléchargez la page web de la destination
Grattage
Travaillez pour extraire (en partie) les informations que vous souhaitez de la page Web téléchargée

Ainsi, par exemple, de la page de la Fédération Shogi, je vais extraire le titre de mon jeu d'échecs préféré. Toka est une traduction de «grattage».

scrapy

Essayons de le gratter. Quand j'y pense, je n'ai utilisé que PHP jusqu'à présent, donc J'ai essayé d'extraire les informations que je voulais de la page en utilisant Goutte et ainsi de suite.

Donc, j'ai appris que Python, que j'ai récemment introduit, a une bibliothèque (framework?) Appelée Scrapy, ce qui rend le scraping très facile.

Donc, cette fois, je vais utiliser cela pour collecter des informations sur mes échecs préférés sur la page de la Fédération Shogi.

Installation

$ pip install scrapy

Achevée

Didacticiel

Eh bien, je suis un super débutant qui ne comprend vraiment pas du tout Python, je vais donc essayer le tutoriel étape par étape pour en avoir une idée.

Il y avait un coin tutoriel dans la documentation. https://docs.scrapy.org/en/latest/intro/tutorial.html

C'est de l'anglais, mais c'est tout à fait vrai.

L'ordre de travail décrit dans le tutoriel

Créez un nouveau projet Scrapy
Écrivez une araignée pour explorer votre site et extraire les données dont vous avez besoin
Sortez les informations extraites de la ligne de commande
Changeons d'araignée pour suivre le lien (je ne comprenais pas l'anglais)
Utilisons des arguments d'araignée

J'aimerais faire quelque chose dans cet ordre.

1. Créez un nouveau projet Scrapy

scrapy startproject tutorial

Cela semble être bon.

[vagrant@localhost test]$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/lib64/python3.5/site-packages/scrapy/templates/project', created in:
    /home/vagrant/test/tutorial

You can start your first spider with:
    cd tutorial
    scrapy genspider example example.com
    
[vagrant@localhost test]$ ll
Total 0
drwxr-xr-x 3 vagabonds vagabonds 38 16 avril 04:15 tutorial

Un répertoire appelé tutoriel a été créé!

Donc, il y a plusieurs choses là-dedans, mais selon le document, chaque fichier a les rôles suivants.

tutorial/
    scrapy.cfg            #Fichier de configuration de déploiement

    tutorial/             # project's Python module, you'll import your code from here
        __init__.py

        items.py          # project items definition file

        pipelines.py      # project pipelines file

        settings.py       # project settings file

        spiders/          # a directory where you'll later put your spiders
            __init__.py

Je n'ai rien compris d'autre que le fichier de configuration du déploiement lol

2. Écrivez une araignée pour explorer votre site et extraire les données dont vous avez besoin

Créez un fichier appelé quotes_spider.py sous tutorial / spides / et créez-le car il y a quelque chose à copier et coller.

[vagrant@localhost tutorial]$ vi tutorial/spiders/quotes_spider.py

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

name
Identifiant de l'araignée? Il semble qu'il doit être unique dans le même projet
start_requests()
Il s'agit de l'URL de départ pour l'exploration. Il dit quelque chose comme renvoyer des demandes itérables
parse()
Sera-t-il appelé lorsque chaque page peut être téléchargée?
Et la réponse de ce deuxième argument est une instance de TextResponse.
Il semble avoir une méthode pour extraire les éléments de la page en les spécifiant avec selector, xpath, css, etc.

3. Sortez les informations extraites de la ligne de commande

scrapy crawl quotes

Il semble que vous puissiez y aller.

Après que quelque chose soit sorti, quotes-1.html et quotes-2.html ont été créés.

[vagrant@localhost tutorial]$ ll
32 au total
-rw-rw-r--1 vagabond vagabond 11053 16 avril 04:27 quotes-1.html
-rw-rw-r--1 vagabond errant 13734 16 avril 04:27 quotes-2.html
-rw-r--r--1 vagabond vagabond 260 16 avril 04:15 scrapy.cfg
drwxr-xr-x 4 vagabonds vagabonds 129 16 avril 04:15 tutorial

J'ai écrit ici "Sortons les informations extraites de la ligne de commande", En fait, quand j'ai regardé le contenu de la méthode d'analyse, je faisais juste quelque chose comme ↓

Extraire la partie numérique de l'URL du site exploré --Appliquez ce numéro à la partie% s de quotes-% s.html --Enfin, placez le corps de la réponse (TextResponse) dans ce fichier et enregistrez-le.

La méthode start_requests est facile à écrire

Après tout, cette méthode ne retourne que l'objet de scrapy.Request à la fin, mais il semble que cela puisse être réalisé en écrivant simplement start_urls.

    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
         'http://quotes.toscrape.com/page/2/',
    ]

C'est OK sans avoir à se soucier de définir la méthode start_requests

Essayez enfin d'extraire les données

Le didacticiel dit: "Pour savoir comment la scrapy se retire, utilisez la" coquille de scrapy "."

Je vais l'essayer immédiatement

[vagrant@localhost tutorial]$ scrapy shell 'http://quotes.toscrape.com/page/1/'

...Omission...

2017-04-16 04:36:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fbb13dd0080>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fbb129308d0>
[s]   spider     <DefaultSpider 'default' at 0x7fbb11f14828>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

Tout d'abord, extrayez les éléments en utilisant css et voyez

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

Oh, il semble que quelque chose comme un élément de titre puisse être extrait.

Lorsque cette reponse.css (xxx) est terminée, le XML appelé SelectorList est renvoyé. Ou un objet qui enveloppe le HTML. Donc, je vais extraire plus de données d'ici. Vous pouvez aussi le dire. Extrayez le texte du titre à titre d'essai.

>>> response.css('title::text').extract()
['Quotes to Scrape']

:: text signifie que seul l'élément de texte est extrait de cette balise .</li> <li>Si vous ne l'ajoutez pas, vous pouvez supprimer la balise <title>.</li> </ul> <pre><code>>>> response.css('title').extract() ['<title>Quotes to Scrape</title>'] </code></pre> <title> Vous pouvez voir que chaque balise est prise <h4>Obtenez l'un des éléments</h4> <p>Lorsque vous extrayez, il renvoie <a href="https://docs.scrapy.org/en/latest/topics/selectors.html#scrapy.selector.SelectorList">SelectorList</a>, donc fondamentalement le type de liste est retourné. (C'est pourquoi tout ce qui précède était entouré de «[]»)</p> <p>Si vous voulez en obtenir un spécifique, spécifiez le numéro de la liste ou obtenez le premier élément avec ʻextract_first`.</p> <ul> <li>Utilisez extract_first</li> </ul> <pre><code>>>> response.css('title::text').extract_first() 'Quotes to Scrape' </code></pre> <ul> <li>Spécifiez le numéro de liste</li> </ul> <pre><code>>>> response.css('title::text')[0].extract() 'Quotes to Scrape' ##Il n'y a qu'un seul titre dans cette page Web, donc si vous spécifiez le second, vous vous fâcherez >>> response.css('title::text')[1].extract() Traceback (most recent call last): File "<console>", line 1, in <module> File "/usr/lib/python3.5/site-packages/parsel/selector.py", line 58, in __getitem__ o = super(SelectorList, self).__getitem__(pos) IndexError: list index out of range </code></pre> <h2>Extraire en utilisant xpath</h2> <p>Qu'est-ce que xpath? J'ai pensé, mais l'article de @ merrill était très facile à comprendre.</p> <p>http://qiita.com/merrill/items/aa612e6e865c1701f43b</p> <p>Il semble que vous puissiez spécifier quelque chose comme atag<code>dans le quatrième td du</code>tbody à partir du HTML.</p> <p>Lorsque je l'utilise immédiatement dans cet exemple, cela ressemble à ceci</p> <pre><code>>>> response.xpath('//title') [<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>] >>> response.xpath('//title/text()').extract_first() 'Quotes to Scrape' </code></pre> <h3>Essayez d'en extraire plus</h3> <p>Extrayons la partie texte et l'auteur de http://quotes.toscrape.com/page/1/, qui est la cible du scraping maintenant.</p> <p><img src="https://qiita-image-store.s3.amazonaws.com/0/23276/c047e05f-dd6c-b813-9c2a-78bbacc49db1.png" alt="スクリーンショット 2017-04-16 12.12.55.png" title="スクリーンショット2017-04-1612.12.55.png " /></p> <p>Tout d'abord, mettez le premier div dans une variable appelée quote</p> <pre><code>>>> quote = response.css("div.quote")[0] </code></pre> <pre><code>>>> title = quote.css("span.text::text").extract_first() >>> title '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”' </code></pre> <p>Réussir l'extraction de la partie texte</p> <ul> <li>L'auteur conteste également</li> </ul> <pre><code>>>> autor = quote.css("small.author::text").extract_first() >>> autor 'Albert Einstein' </code></pre> <p>C'est incroyablement facile.</p> <ul> <li>Essayez d'obtenir la liste des balises</li> </ul> <pre><code>>>> tags = quote.css("div.tags a.tag::text").extract() >>> tags ['change', 'deep-thoughts', 'thinking', 'world'] </code></pre> <p>Je peux l'extraire correctement avec le type de liste</p> <pre><code>>>> for quote in response.css("div.quote"): >>> text = quote.css("span.text::text").extract_first() >>> author = quote.css("small.author::text").extract_first() >>> tags = quote.css("div.tags a.tag::text").extract() >>> print(dict(text=text, author=author, tags=tags)) {'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'} </code></pre> <h2>Essayez ceci avec l'araignée au lieu de la coquille</h2> <pre><code>import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', 'http://quotes.toscrape.com/page/2/', ] def parse(self, response): for quote in response.css("div.quote"): yield { 'text' : quote.css('span.text::text').extract_first(), 'author' : quote.css('small.author::text').extract_first(), 'tags' : quote.css('div.tags a.tag::text').extract() } </code></pre> <p>Je vais le réécrire comme ça et l'exécuter.</p> <pre><code>[vagrant@localhost tutorial]$ scrapy crawl quotes 2017-04-16 05:27:09 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tutorial) 2017-04-16 05:27:09 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'BOT_NAME': 'tutorial', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True} ...Omission... {'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'tags': ['abilities', 'choices']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Albert Einstein', 'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Jane Austen', 'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'tags': ['aliteracy', 'books', 'classic', 'humor']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Marilyn Monroe', 'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'tags': ['be-yourself', 'inspirational']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'Albert Einstein', 'text': '“Try not to become a man of success. Rather become a man of value.”', 'tags': ['adulthood', 'success', 'value']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> {'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'tags': ['life', 'love']} 2017-04-16 05:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/> ...Omission... </code></pre> <p>Il existe différentes choses, mais il semble qu'elles puissent être extraites.</p> <p>** Mettez-le dans un fichier et voyez-le **</p> <pre><code>[vagrant@localhost tutorial]$ scrapy crawl quotes -o result.json </code></pre> <p>Voyons le résultat</p> <pre><code>[vagrant@localhost tutorial]$ cat result.json [ {"tags": ["change", "deep-thoughts", "thinking", "world"], "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein"}, {"tags": ["abilities", "choices"], "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling"}, {"tags": ["inspirational", "life", "live", "miracle", "miracles"], "text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein"}, {"tags": ["aliteracy", "books", "classic", "humor"], "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"}, {"tags": ["be-yourself", "inspirational"], "text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe"}, {"tags": ["adulthood", "success", "value"], "text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein"}, {"tags": ["life", "love"], "text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide"}, {"tags": ["edison", "failure", "inspirational", "paraphrased"], "text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison"}, {"tags": ["misattributed-eleanor-roosevelt"], "text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt"}, {"tags": ["humor", "obvious", "simile"], "text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"}, {"tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe"}, {"tags": ["courage", "friends"], "text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling"}, {"tags": ["simplicity", "understand"], "text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein"}, {"tags": ["love"], "text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley"}, {"tags": ["fantasy"], "text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss"}, {"tags": ["life", "navigation"], "text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams"}, {"tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"], "text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel"}, {"tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"], "text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche"}, {"tags": ["books", "contentment", "friends", "friendship", "life"], "text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain"}, {"tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"], "text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders"} </code></pre> <p>Poi Poi! !! !! !! Très facile ww</p> <h2>4. Changeons d'araignée pour suivre le lien (je ne comprenais pas l'anglais)</h2> <p>Au fait, j'ai maintenant répertorié toutes les URL de destination de transition directement dans start_urls. Cependant, comme d'habitude, vous souhaiterez peut-être suivre un lien spécifique sur la page pour obtenir de manière récursive les données souhaitées.</p> <p>Dans un tel cas, il semble bon d'obtenir l'URL du lien et d'appeler votre propre analyse.</p> <pre><code>import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/page/1/', ] def parse(self, response): for quote in response.css('div.quote'): yield { 'text': quote.css('span.text::text').extract_first(), 'author': quote.css('small.author::text').extract_first(), 'tags': quote.css('div.tags a.tag::text').extract(), } next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not None: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) </code></pre> <p>Je me sens comme cela. S'il y a <code>next_page</code>, on a l'impression de recommencer.</p> <p>ʻUrl join` serait bien d'en faire une URL de patrouille?</p> <h3>Explorons plus et jouons</h3> <p>Ici, il y a un lien dans la partie auteur de http://quotes.toscrape.com, donc un tutoriel est introduit pour le suivre pour plus d'informations.</p> <pre><code class="language-python"> import scrapy class QuotesSpider(scrapy.Spider): name = "quotes" start_urls = [ 'http://quotes.toscrape.com/', ] def parse(self, response): #Obtenez un lien vers la page de détails de l'auteur for href in response.css('.author + a::attr(href)').extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_author) #Obtenir le lien de page next_page = response.css('li.next a::attr(href)').extract_first() if next_page is not NONE: next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) def parse_author(self, response): #Extrait de la réponse dans la requête et la bande reçues(Chose semblable à une garniture)Faire def extract_with_css(query): return response.css(query).extract_first().strip() yield { 'name' : extract_with_css('h3.author-title::text'), 'birthdate' : extract_with_css('.author-born-date::text'), 'bio': extract_with_css('.author-description::text'), } </code></pre> <p>Si tu le fais comme ça</p> <p>―― 1. Suivez le lien de l'auteur et faites <code>parse_author</code> (extrayez le nom, la date de naissance et la description). ―― 2. Si la pagination existe, analysez à nouveau pour la page suivante ―― 3. Répétez jusqu'à ce qu'il n'y ait plus de pagination</p> <p>Il est possible d'écrire autant en quelques dizaines de lignes ...</p> <h2>5. Utilisons des arguments d'araignée</h2> <p>Je ne savais pas comment l'utiliser, alors je l'ai passé.</p> <h2>Résumé</h2> <p>--Créer un projet à l'aide de scrapy</p> <ul> <li>Écrivez ce que vous voulez faire chez les araignées</li> <li>L'exploration est également possible en suivant le lien ――C'est super facile à retirer</li> </ul> <h2>Remarque-Problèmes non codés et illisibles</h2> <p>Lorsque je produis vers json avec <code>-o</code>, la chaîne de caractères est unicodée et ne peut pas être lue. Cela peut être résolu en ajoutant une ligne de <code>FEED_EXPORT_ENCODING = 'utf-8'</code> à<code> [nom_projet] / settings.py</code>.</p> <h2>Prime</h2> <p>J'ai fait quelque chose qui gratte les données de l'épéiste.</p> <p>Ce que j'ai fait</p> <ul> <li>À partir de la liste des joueurs de la Fédération Shogi --Suivez le lien sur la page de détails</li> <li>Extraire les données de <code>nom, date de naissance, maître</code></li> </ul> <p>Le code réel ressemble à ceci (c'est facile w)</p> <pre><code class="language-python">import scrapy class QuotesSpider(scrapy.Spider): name = "kisi" start_urls = [ 'https://www.shogi.or.jp/player/', ] def parse(self, response): #Obtenez un lien vers la page de détails de l'épéiste for href in response.css("p.ttl a::attr(href)").extract(): yield scrapy.Request(response.urljoin(href), callback=self.parse_kisi) def parse_kisi(self, response): def extract_with_xpath(query): return response.xpath(query).extract_first().strip() yield { 'name' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/div/div/h1/span[1]/text()'), 'birth' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/table/tbody/tr[2]/td/text()'), 'sisho' : extract_with_xpath('//*[@id="contents"]/div[2]/div/div[2]/table/tbody/tr[4]/td/text()'), } </code></pre> <h4>résultat</h4> <pre><code>[vagrant@localhost tutorial]$ head kisi.json [ {"name": "Akira Watanabe", "birth": "23 avril 1984(32 ans)", "sisho": "Kazuharu Toshiji 7e Dan"}, {"name": "Masahiko Urano", "birth": "14 mars 1964(53 ans)", "sisho": "(Fin) Nakai Ryukichi 8e Dan"}, {"name": "Masaki Izumi", "birth": "11 janvier 1961(56 ans)", "sisho": "Sekine Shigeru 9e Dan"}, {"name": "Koji Tosa", "birth": "30 mars 1955(62 ans)", "sisho": "(Fin) Shizuo Kiyono 8e Dan"}, {"name": "Hiroshi Kamiya", "birth": "21 avril 1961(55 ans)", "sisho": "(en retard)Hisao Hirotsu 9e Dan"}, {"name": "Kensuke Kitahama", "birth": "28 décembre 1975(41 ans)", "sisho": "Masayu Saeki 9e Dan"}, {"name": "Taxe principale Akutsu", "birth": "24 juin 1982(34 ans)", "sisho": "Seiichiro Taki 8e Dan"}, {"name": "Takayuki Yamazaki", "birth": "14 février 1981(36 ans)", "sisho": "Nobuo Mori 7e Dan"}, {"name": "Akihito Hirose", "birth": "18 janvier 1987(30 ans)", "sisho": "Katsuura Shu 9e Dan"}, </code></pre> <p>Vous pouvez voir que tout le monde l'obtient correctement. C'est vraiment simple.</p> <h2>Ce que je veux faire dans le futur</h2> <ul> <li>À partir d'une page spécifique --Spécifier les conditions de recherche</li> <li>Extraire les résultats de la recherche en fonction des règles</li> </ul> <p>J'écrirai un article si possible. (Eh bien, je ne comprends pas bien le rendement, je ne peux pas déboguer et je dois étudier python.)</p>  <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>  <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-6575041992772322" data-ad-slot="8191531813" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> <div style="margin-top: 30px;"> <div class="link-top" style="margin-top: 1px;"></div> <p> <font size="4">Recommended Posts</font>  <div style="margin-top: 10px;"> <a href="/fr/272d485e8a249d0d1bd7">python super débutant essaie de gratter</a> </div> <div style="margin-top: 10px;"> <a href="/fr/d0c36bd3e5d1c998d3cd">Web scraping débutant avec python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/8706bdb77eb75d09fd76">[Scraping] Scraping Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/01de993d4125c29136fb">Débutant ABC154 (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/0944d989e72fa8ac8f3a">Mémo de raclage Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/0cb9b41f32f99e2bc2a5">Scraping Python get_ranker_categories</a> </div> <div style="margin-top: 10px;"> <a href="/fr/136297ed22df0317bd89">Grattage avec Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/2112ba1c57d50161b6df">mémo débutant python (9.2-10)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/36cd0292b327fee417dc">Grattage avec Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/3dc4f906af7d7948e387">mémo débutant python (9.1)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/3f14dae4447af7cd04b2">Notes de débutant Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/40cac44524ed6d7bedc1">[Débutant] Scrapage Web Python facile à comprendre à l'aide de Google Colaboratory</a> </div> <div style="margin-top: 10px;"> <a href="/fr/552aabf11d53cd1f4096">[Débutant] Tableau Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/66fa6ceea66dc5a4d3a3">Python racle eBay</a> </div> <div style="margin-top: 10px;"> <a href="/fr/7b103afbcbbe78238276">Débutant ABC155 (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/91f9232ae28e4b30a73d">Grattage Python get_title</a> </div> <div style="margin-top: 10px;"> <a href="/fr/a8d3f16ec0e4c3c50b7c">Python: grattage partie 1</a> </div> <div style="margin-top: 10px;"> <a href="/fr/aa2ba944bb3688647a0c">[Débutant] Fonctions Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/b47da0eb043a6c173c97">Débutant ABC157 (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/ba720f44e5bcd2ae6b59">PyQ ~ Python Débutant ~</a> </div> <div style="margin-top: 10px;"> <a href="/fr/e28900e85fa8f25daf30">Mémo débutant Python (2)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/e3dd905fa536b69329ad">Scraping à l'aide de Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/f2b50634f8ed0b27fc34">Python débutant Zundokokiyoshi</a> </div> <div style="margin-top: 10px;"> <a href="/fr/fa7941ba5586d95398d7">Python: grattage, partie 2</a> </div> <div style="margin-top: 10px;"> <a href="/fr/0989a2daf169c19adada">Grattage en Python (préparation)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/0e41870de5f84b327d59">Essayez de gratter avec Python.</a> </div> <div style="margin-top: 10px;"> <a href="/fr/350773b741ea87c32c20">UnicodeEncodeError: 'cp932' pendant le scraping Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/377db82d6cc943b41495">[Python] Débogage super utile</a> </div> <div style="margin-top: 10px;"> <a href="/fr/42b947a77bba75ea6ce3">Principes de base du grattage Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/42dfe18c81af98bf0db3">[Python] Héritage de classe (super)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/4655a954e8e7e7c557a4">Grattage avec Python + PhantomJS</a> </div> <div style="margin-top: 10px;"> <a href="/fr/c161462844aef87e0f0d">Grattage avec du sélénium [Python]</a> </div> <div style="margin-top: 10px;"> <a href="/fr/cd51a00de026ef92080a">Scraping avec Python + PyQuery</a> </div> <div style="margin-top: 10px;"> <a href="/fr/e633b1422a49ed95177f">mémorandum python super basique</a> </div> <div style="margin-top: 10px;"> <a href="/fr/ef0ed3273907ea56e5cd">Scraping RSS avec Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/03229bfa161e6dc2ea61">Scraping à l'aide de Python 3.5 async / await</a> </div> <div style="margin-top: 10px;"> <a href="/fr/0888dff584666d948dd4">J'ai essayé de gratter avec Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/1911252d97321c1f9d9b">Web scraping avec python + JupyterLab</a> </div> <div style="margin-top: 10px;"> <a href="/fr/20002dfa12457064a910">Grattage au sélénium en Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/225f38c23a652459962f">Grattage avec Selenium + Python Partie 1</a> </div> <div style="margin-top: 10px;"> <a href="/fr/2714bcd6a56836cc9134">[Python] Scraping dans AWS Lambda</a> </div> <div style="margin-top: 10px;"> <a href="/fr/3088148a31f625bff095">Grattage avec chromedriver en python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/35905779504016085801">Grattage festif avec Python, scrapy</a> </div> <div style="margin-top: 10px;"> <a href="/fr/56415d41cae986ee2491">Un débutant en Python lance Discord Bot</a> </div> <div style="margin-top: 10px;"> <a href="/fr/5c5c9e653b3a13108d12">Scraping à l'aide de la syntaxe Python 3.5 Async</a> </div> <div style="margin-top: 10px;"> <a href="/fr/68e0ce1db7677cfebf63">Grattage avec du sélénium en Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/69049d560a0bb949d78e">Structure super minuscule en Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/81f4b893bb1406162ab3">Grattage avec Tor en Python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/901569974f040927fe9d">[python] super (), héritage, __init__, etc.</a> </div> <div style="margin-top: 10px;"> <a href="/fr/95750957b6ce266add50">Python #function 2 pour les super débutants</a> </div> <div style="margin-top: 10px;"> <a href="/fr/9d6d1169093f8db705df">Web scraping avec Selenium (Python)</a> </div> <div style="margin-top: 10px;"> <a href="/fr/a5cf2f755e1725dd0201">Scraping prévisions météorologiques avec python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/bcbc5b09170be2903ce9">Grattage avec Selenium + Python Partie 2</a> </div> <div style="margin-top: 10px;"> <a href="/fr/c403a2a997a0247adc96">Python #function 1 pour les super débutants</a> </div> <div style="margin-top: 10px;"> <a href="/fr/ca7a4d0525d6ea32ebe7">[Python + Selenium] Conseils pour le grattage</a> </div> <div style="margin-top: 10px;"> <a href="/fr/cb1927019aeff1158b33">J'ai essayé de gratter avec du python</a> </div> <div style="margin-top: 10px;"> <a href="/fr/ccdb61e0caf75c1d523c">#List Python pour les super débutants</a> </div> <div style="margin-top: 10px;"> <a href="/fr/e093ce01b5782d820997">[Python débutant] Mettre à jour pip lui-même</a> </div> <div style="margin-top: 10px;"> <a href="/fr/ece2d61af1d3653e4e83">Concours Atcoder Débutant 152 Kiroku (python)</a> </div>  </p> </div> </div> </div> <div class="footer text-center" style="margin-top: 40px;">  </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.4.1/dist/jquery.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/bootstrap@4.3.1/dist/js/bootstrap.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@10.1.2/build/highlight.min.js"></script>  <script data-ad-client="ca-pub-6575041992772322" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>  </body> </html>