scrapy rules do not call parsing method

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*

crawling works, but the scraping of the title does not since the output file is empty. Most likely I got the rules wrong. Is this the right syntax using the rules in order to achieve what I am looking for?

import scrapy

class BidItem(scrapy.Item):

    url = scrapy.Field()

    title = scrapy.Field()

spider.py

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule



from bid.items import BidItem



class GetbidSpider(CrawlSpider):

    name = 'getbid'

    allowed_domains = ['domain.de']

    start_urls = ['https://www.domain.de/']



    rules = (

        Rule(

            LinkExtractor(), 

            follow=True

        ),

        Rule(

            LinkExtractor(allow=['example/.*']), 

            callback='parse_item'

        ),

    )



    def parse_item(self, response):

         href = BidItem()

         href['url']    = response.url

         href['title']  = response.css("h1::text").extract()

         return href

crawl: scrapy crawl getbid -o 012916.csv

edited Nov 22 at 19:45

stranac

13.5k31724

asked Nov 22 at 19:12

merlin

5901821

add a comment |

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*

import scrapy

class BidItem(scrapy.Item):

    url = scrapy.Field()

    title = scrapy.Field()

spider.py

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule



from bid.items import BidItem



class GetbidSpider(CrawlSpider):

    name = 'getbid'

    allowed_domains = ['domain.de']

    start_urls = ['https://www.domain.de/']



    rules = (

        Rule(

            LinkExtractor(), 

            follow=True

        ),

        Rule(

            LinkExtractor(allow=['example/.*']), 

            callback='parse_item'

        ),

    )



    def parse_item(self, response):

         href = BidItem()

         href['url']    = response.url

         href['title']  = response.css("h1::text").extract()

         return href

crawl: scrapy crawl getbid -o 012916.csv

edited Nov 22 at 19:45

stranac

13.5k31724

asked Nov 22 at 19:12

merlin

5901821

add a comment |

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*

import scrapy

class BidItem(scrapy.Item):

    url = scrapy.Field()

    title = scrapy.Field()

spider.py

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule



from bid.items import BidItem



class GetbidSpider(CrawlSpider):

    name = 'getbid'

    allowed_domains = ['domain.de']

    start_urls = ['https://www.domain.de/']



    rules = (

        Rule(

            LinkExtractor(), 

            follow=True

        ),

        Rule(

            LinkExtractor(allow=['example/.*']), 

            callback='parse_item'

        ),

    )



    def parse_item(self, response):

         href = BidItem()

         href['url']    = response.url

         href['title']  = response.css("h1::text").extract()

         return href

crawl: scrapy crawl getbid -o 012916.csv

edited Nov 22 at 19:45

stranac

13.5k31724

asked Nov 22 at 19:12

merlin

5901821

I am new to scrapy and am trying to crawl a domain, following all internal links and scraping the title of url with the pattern /example/.*

import scrapy

class BidItem(scrapy.Item):

    url = scrapy.Field()

    title = scrapy.Field()

spider.py

import scrapy

from scrapy.linkextractors import LinkExtractor

from scrapy.spiders import CrawlSpider, Rule



from bid.items import BidItem



class GetbidSpider(CrawlSpider):

    name = 'getbid'

    allowed_domains = ['domain.de']

    start_urls = ['https://www.domain.de/']



    rules = (

        Rule(

            LinkExtractor(), 

            follow=True

        ),

        Rule(

            LinkExtractor(allow=['example/.*']), 

            callback='parse_item'

        ),

    )



    def parse_item(self, response):

         href = BidItem()

         href['url']    = response.url

         href['title']  = response.css("h1::text").extract()

         return href

crawl: scrapy crawl getbid -o 012916.csv

python scrapy scrapy-spider

edited Nov 22 at 19:45

stranac

13.5k31724

asked Nov 22 at 19:12

merlin

5901821

edited Nov 22 at 19:45

stranac

13.5k31724

asked Nov 22 at 19:12

merlin

5901821

edited Nov 22 at 19:45

stranac

13.5k31724

edited Nov 22 at 19:45

stranac

13.5k31724

edited Nov 22 at 19:45

stranac

13.5k31724

asked Nov 22 at 19:12

merlin

5901821

asked Nov 22 at 19:12

merlin

5901821

asked Nov 22 at 19:12

merlin

5901821

add a comment |

1 Answer
1

active

oldest

votes

From the CrawlSpider docs:

If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.

Since your first rule will match all links, it will always be used and all other rules will be ignored.

Fixing the problem is as simple as switching the order of the rules.

answered Nov 22 at 19:52

stranac

13.5k31724

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53436914%2fscrapy-rules-do-not-call-parsing-method%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

From the CrawlSpider docs:

If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.

Since your first rule will match all links, it will always be used and all other rules will be ignored.

Fixing the problem is as simple as switching the order of the rules.

answered Nov 22 at 19:52

stranac

13.5k31724

add a comment |

From the CrawlSpider docs:

If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.

Since your first rule will match all links, it will always be used and all other rules will be ignored.

Fixing the problem is as simple as switching the order of the rules.

answered Nov 22 at 19:52

stranac

13.5k31724

add a comment |

From the CrawlSpider docs:

If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.

Since your first rule will match all links, it will always be used and all other rules will be ignored.

Fixing the problem is as simple as switching the order of the rules.

answered Nov 22 at 19:52

stranac

13.5k31724

From the CrawlSpider docs:

If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.

Since your first rule will match all links, it will always be used and all other rules will be ignored.

Fixing the problem is as simple as switching the order of the rules.

answered Nov 22 at 19:52

stranac

13.5k31724

answered Nov 22 at 19:52

stranac

13.5k31724

answered Nov 22 at 19:52

stranac

13.5k31724

answered Nov 22 at 19:52

stranac

13.5k31724

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl