scrapy rules do not call parsing method












1














I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*



crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?



import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()


spider.py



import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bid.items import BidItem

class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']

rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)

def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href


crawl: scrapy crawl getbid -o 012916.csv










share|improve this question





























    1














    I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*



    crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?



    import scrapy
    class BidItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()


    spider.py



    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule

    from bid.items import BidItem

    class GetbidSpider(CrawlSpider):
    name = 'getbid'
    allowed_domains = ['domain.de']
    start_urls = ['https://www.domain.de/']

    rules = (
    Rule(
    LinkExtractor(),
    follow=True
    ),
    Rule(
    LinkExtractor(allow=['example/.*']),
    callback='parse_item'
    ),
    )

    def parse_item(self, response):
    href = BidItem()
    href['url'] = response.url
    href['title'] = response.css("h1::text").extract()
    return href


    crawl: scrapy crawl getbid -o 012916.csv










    share|improve this question



























      1












      1








      1







      I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*



      crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?



      import scrapy
      class BidItem(scrapy.Item):
      url = scrapy.Field()
      title = scrapy.Field()


      spider.py



      import scrapy
      from scrapy.linkextractors import LinkExtractor
      from scrapy.spiders import CrawlSpider, Rule

      from bid.items import BidItem

      class GetbidSpider(CrawlSpider):
      name = 'getbid'
      allowed_domains = ['domain.de']
      start_urls = ['https://www.domain.de/']

      rules = (
      Rule(
      LinkExtractor(),
      follow=True
      ),
      Rule(
      LinkExtractor(allow=['example/.*']),
      callback='parse_item'
      ),
      )

      def parse_item(self, response):
      href = BidItem()
      href['url'] = response.url
      href['title'] = response.css("h1::text").extract()
      return href


      crawl: scrapy crawl getbid -o 012916.csv










      share|improve this question















      I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*



      crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?



      import scrapy
      class BidItem(scrapy.Item):
      url = scrapy.Field()
      title = scrapy.Field()


      spider.py



      import scrapy
      from scrapy.linkextractors import LinkExtractor
      from scrapy.spiders import CrawlSpider, Rule

      from bid.items import BidItem

      class GetbidSpider(CrawlSpider):
      name = 'getbid'
      allowed_domains = ['domain.de']
      start_urls = ['https://www.domain.de/']

      rules = (
      Rule(
      LinkExtractor(),
      follow=True
      ),
      Rule(
      LinkExtractor(allow=['example/.*']),
      callback='parse_item'
      ),
      )

      def parse_item(self, response):
      href = BidItem()
      href['url'] = response.url
      href['title'] = response.css("h1::text").extract()
      return href


      crawl: scrapy crawl getbid -o 012916.csv







      python scrapy scrapy-spider






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 22 at 19:45









      stranac

      13.5k31724




      13.5k31724










      asked Nov 22 at 19:12









      merlin

      5901821




      5901821
























          1 Answer
          1






          active

          oldest

          votes


















          2














          From the CrawlSpider docs:




          If multiple rules match the same link, the first one will be used,
          according to the order they’re defined in this attribute.




          Since your first rule will match all links, it will always be used and all other rules will be ignored.



          Fixing the problem is as simple as switching the order of the rules.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53436914%2fscrapy-rules-do-not-call-parsing-method%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            From the CrawlSpider docs:




            If multiple rules match the same link, the first one will be used,
            according to the order they’re defined in this attribute.




            Since your first rule will match all links, it will always be used and all other rules will be ignored.



            Fixing the problem is as simple as switching the order of the rules.






            share|improve this answer


























              2














              From the CrawlSpider docs:




              If multiple rules match the same link, the first one will be used,
              according to the order they’re defined in this attribute.




              Since your first rule will match all links, it will always be used and all other rules will be ignored.



              Fixing the problem is as simple as switching the order of the rules.






              share|improve this answer
























                2












                2








                2






                From the CrawlSpider docs:




                If multiple rules match the same link, the first one will be used,
                according to the order they’re defined in this attribute.




                Since your first rule will match all links, it will always be used and all other rules will be ignored.



                Fixing the problem is as simple as switching the order of the rules.






                share|improve this answer












                From the CrawlSpider docs:




                If multiple rules match the same link, the first one will be used,
                according to the order they’re defined in this attribute.




                Since your first rule will match all links, it will always be used and all other rules will be ignored.



                Fixing the problem is as simple as switching the order of the rules.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 22 at 19:52









                stranac

                13.5k31724




                13.5k31724






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53436914%2fscrapy-rules-do-not-call-parsing-method%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Lallio

                    Futebolista

                    Jornalista