Robot allowed the website but being identified and rejected

I need to do web scrapping for a website which has allowed robot access. Below is the robot.txt file's content.

User-agent: *

Disallow:

Sitemap:https://www.sample.com/sitemap-index.xml

But when I try to fetch the website's content using nokogiri, it's being detected.

Nokogiri::HTML(open('https://www.sample.com/search?q=test', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))

Here the output:

> (Document:0x3fda40e7cf70 {

  name = "document",

  children = [

    #(DTD:0x3fda40e9591c { name = "html" }),

    #(Element:0x3fda40e8c95c {

      name = "html",

      attributes = [ #(Attr:0x3fda4071a598 { name = "style", value = "height:100%" })],

      children = [

        #(Element:0x3fda3fefa28c {

          name = "head",

          children = [

            #(Element:0x3fda401a3088 {

              name = "meta",

              attributes = [ #(Attr:0x3fda40ebd7a0 { name = "name", value = "ROBOTS" }), #(Attr:0x3fda40ebd778 { name = "content", value = "NOINDEX, NOFOLLOW" })]

              }),

            #(Element:0x3fda4074faf4 {

              name = "meta",

              attributes = [ #(Attr:0x3fda3ff0beec { name = "name", value = "format-detection" }), #(Attr:0x3fda3ff0bed8 { name = "content", value = "telephone=no" })]

              }),

            #(Element:0x3fda401ca700 {

              name = "meta",

              attributes = [ #(Attr:0x3fda401c2050 { name = "name", value = "viewport" }), #(Attr:0x3fda401c217c { name = "content", value = "initial-scale=1.0" })]

              }),

            #(Element:0x3fda4079a284 {

              name = "meta",

              attributes = [ #(Attr:0x3fda4078bfb8 { name = "http-equiv", value = "X-UA-Compatible" }), #(Attr:0x3fda4078bf04 { name = "content", value = "IE=edge,chrome=1" })]

              })]

          }),

        #(Element:0x3fda407e2e6c {

          name = "body",

          attributes = [ #(Attr:0x3fda430205f0 { name = "style", value = "margin:0px;height:100%" })],

          children = [

            #(Element:0x3fda4072e2a0 {

              name = "iframe",

              attributes = [

                #(Attr:0x3fda3ff45214 {

                  name = "src",

                  value = "/_Incapsula_Resource?SWUDNSAI=28&xinfo=5-66719320-0%200NNN%20RT%281543054979096%20247%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U2&incident_id=245000650118470008-256430953704260629&edet=12&cinfo=04000000"

                  }),

                #(Attr:0x3fda3ff451d8 { name = "frameborder", value = "0" }),

                #(Attr:0x3fda3ff451b0 { name = "width", value = "100%" }),

                #(Attr:0x3fda3ff45188 { name = "height", value = "100%" }),

                #(Attr:0x3fda3ff45174 { name = "marginheight", value = "0px" }),

                #(Attr:0x3fda3ff4514c { name = "marginwidth", value = "0px" })],

              children = [ #(Text "Request unsuccessful. Incapsula incident ID: 245000650118470008-256430953704260629")]

              })]

          })]

      })]

  })

How can I achieve this web scraping?

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

1

Nokogiri has zero to do with robots.txt. You're issue is bypassing incapsula's DDOS protection which it's unlikely anyone here will want to help you with.

– pguardiario
Nov 24 '18 at 12:20

1

This could be caused by a whole host of things. I can only hypothesise why your request is being blocked, and how you may try to bypass it.

– Tom Lord
Nov 24 '18 at 15:08

1

For example, perhaps: (1) After making too many requests, their server has blacklisted your IP. You could try using another host, or a proxy. (2) Similarly, perhaps they're throttling you. Try making fewer requests. (3) Perhaps you need to set a user-Agent header, or a Cookie header, to "trick" their server into thinking you're a human. (4) Perhaps you don't need to scrape this site in the first place. Do they have an API? .....

– Tom Lord
Nov 24 '18 at 15:11

2

Assuming you do need to achieve whatever it is you're doing via web scraping, and you're not violating their terms of use, I would start by investigating what's different between your browser's web request (which presumably works), and ruby's web request - in particular, check whether you can bypass the filter with appropriate headers or a different IP.

– Tom Lord
Nov 24 '18 at 15:12

Thanks @TomLord and pguardiario. I will try with your suggestions.

– Ramiz Raja
Nov 25 '18 at 8:09

add a comment |

I need to do web scrapping for a website which has allowed robot access. Below is the robot.txt file's content.

User-agent: *

Disallow:

Sitemap:https://www.sample.com/sitemap-index.xml

But when I try to fetch the website's content using nokogiri, it's being detected.

Nokogiri::HTML(open('https://www.sample.com/search?q=test', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))

Here the output:

> (Document:0x3fda40e7cf70 {

  name = "document",

  children = [

    #(DTD:0x3fda40e9591c { name = "html" }),

    #(Element:0x3fda40e8c95c {

      name = "html",

      attributes = [ #(Attr:0x3fda4071a598 { name = "style", value = "height:100%" })],

      children = [

        #(Element:0x3fda3fefa28c {

          name = "head",

          children = [

            #(Element:0x3fda401a3088 {

              name = "meta",

              attributes = [ #(Attr:0x3fda40ebd7a0 { name = "name", value = "ROBOTS" }), #(Attr:0x3fda40ebd778 { name = "content", value = "NOINDEX, NOFOLLOW" })]

              }),

            #(Element:0x3fda4074faf4 {

              name = "meta",

              attributes = [ #(Attr:0x3fda3ff0beec { name = "name", value = "format-detection" }), #(Attr:0x3fda3ff0bed8 { name = "content", value = "telephone=no" })]

              }),

            #(Element:0x3fda401ca700 {

              name = "meta",

              attributes = [ #(Attr:0x3fda401c2050 { name = "name", value = "viewport" }), #(Attr:0x3fda401c217c { name = "content", value = "initial-scale=1.0" })]

              }),

            #(Element:0x3fda4079a284 {

              name = "meta",

              attributes = [ #(Attr:0x3fda4078bfb8 { name = "http-equiv", value = "X-UA-Compatible" }), #(Attr:0x3fda4078bf04 { name = "content", value = "IE=edge,chrome=1" })]

              })]

          }),

        #(Element:0x3fda407e2e6c {

          name = "body",

          attributes = [ #(Attr:0x3fda430205f0 { name = "style", value = "margin:0px;height:100%" })],

          children = [

            #(Element:0x3fda4072e2a0 {

              name = "iframe",

              attributes = [

                #(Attr:0x3fda3ff45214 {

                  name = "src",

                  value = "/_Incapsula_Resource?SWUDNSAI=28&xinfo=5-66719320-0%200NNN%20RT%281543054979096%20247%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U2&incident_id=245000650118470008-256430953704260629&edet=12&cinfo=04000000"

                  }),

                #(Attr:0x3fda3ff451d8 { name = "frameborder", value = "0" }),

                #(Attr:0x3fda3ff451b0 { name = "width", value = "100%" }),

                #(Attr:0x3fda3ff45188 { name = "height", value = "100%" }),

                #(Attr:0x3fda3ff45174 { name = "marginheight", value = "0px" }),

                #(Attr:0x3fda3ff4514c { name = "marginwidth", value = "0px" })],

              children = [ #(Text "Request unsuccessful. Incapsula incident ID: 245000650118470008-256430953704260629")]

              })]

          })]

      })]

  })

How can I achieve this web scraping?

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

1

Nokogiri has zero to do with robots.txt. You're issue is bypassing incapsula's DDOS protection which it's unlikely anyone here will want to help you with.

– pguardiario
Nov 24 '18 at 12:20

1

This could be caused by a whole host of things. I can only hypothesise why your request is being blocked, and how you may try to bypass it.

– Tom Lord
Nov 24 '18 at 15:08

1

For example, perhaps: (1) After making too many requests, their server has blacklisted your IP. You could try using another host, or a proxy. (2) Similarly, perhaps they're throttling you. Try making fewer requests. (3) Perhaps you need to set a user-Agent header, or a Cookie header, to "trick" their server into thinking you're a human. (4) Perhaps you don't need to scrape this site in the first place. Do they have an API? .....

– Tom Lord
Nov 24 '18 at 15:11

2

Assuming you do need to achieve whatever it is you're doing via web scraping, and you're not violating their terms of use, I would start by investigating what's different between your browser's web request (which presumably works), and ruby's web request - in particular, check whether you can bypass the filter with appropriate headers or a different IP.

– Tom Lord
Nov 24 '18 at 15:12

Thanks @TomLord and pguardiario. I will try with your suggestions.

– Ramiz Raja
Nov 25 '18 at 8:09

add a comment |

I need to do web scrapping for a website which has allowed robot access. Below is the robot.txt file's content.

User-agent: *

Disallow:

Sitemap:https://www.sample.com/sitemap-index.xml

But when I try to fetch the website's content using nokogiri, it's being detected.

Nokogiri::HTML(open('https://www.sample.com/search?q=test', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))

Here the output:

> (Document:0x3fda40e7cf70 {

  name = "document",

  children = [

    #(DTD:0x3fda40e9591c { name = "html" }),

    #(Element:0x3fda40e8c95c {

      name = "html",

      attributes = [ #(Attr:0x3fda4071a598 { name = "style", value = "height:100%" })],

      children = [

        #(Element:0x3fda3fefa28c {

          name = "head",

          children = [

            #(Element:0x3fda401a3088 {

              name = "meta",

              attributes = [ #(Attr:0x3fda40ebd7a0 { name = "name", value = "ROBOTS" }), #(Attr:0x3fda40ebd778 { name = "content", value = "NOINDEX, NOFOLLOW" })]

              }),

            #(Element:0x3fda4074faf4 {

              name = "meta",

              attributes = [ #(Attr:0x3fda3ff0beec { name = "name", value = "format-detection" }), #(Attr:0x3fda3ff0bed8 { name = "content", value = "telephone=no" })]

              }),

            #(Element:0x3fda401ca700 {

              name = "meta",

              attributes = [ #(Attr:0x3fda401c2050 { name = "name", value = "viewport" }), #(Attr:0x3fda401c217c { name = "content", value = "initial-scale=1.0" })]

              }),

            #(Element:0x3fda4079a284 {

              name = "meta",

              attributes = [ #(Attr:0x3fda4078bfb8 { name = "http-equiv", value = "X-UA-Compatible" }), #(Attr:0x3fda4078bf04 { name = "content", value = "IE=edge,chrome=1" })]

              })]

          }),

        #(Element:0x3fda407e2e6c {

          name = "body",

          attributes = [ #(Attr:0x3fda430205f0 { name = "style", value = "margin:0px;height:100%" })],

          children = [

            #(Element:0x3fda4072e2a0 {

              name = "iframe",

              attributes = [

                #(Attr:0x3fda3ff45214 {

                  name = "src",

                  value = "/_Incapsula_Resource?SWUDNSAI=28&xinfo=5-66719320-0%200NNN%20RT%281543054979096%20247%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U2&incident_id=245000650118470008-256430953704260629&edet=12&cinfo=04000000"

                  }),

                #(Attr:0x3fda3ff451d8 { name = "frameborder", value = "0" }),

                #(Attr:0x3fda3ff451b0 { name = "width", value = "100%" }),

                #(Attr:0x3fda3ff45188 { name = "height", value = "100%" }),

                #(Attr:0x3fda3ff45174 { name = "marginheight", value = "0px" }),

                #(Attr:0x3fda3ff4514c { name = "marginwidth", value = "0px" })],

              children = [ #(Text "Request unsuccessful. Incapsula incident ID: 245000650118470008-256430953704260629")]

              })]

          })]

      })]

  })

How can I achieve this web scraping?

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

I need to do web scrapping for a website which has allowed robot access. Below is the robot.txt file's content.

User-agent: *

Disallow:

Sitemap:https://www.sample.com/sitemap-index.xml

But when I try to fetch the website's content using nokogiri, it's being detected.

Nokogiri::HTML(open('https://www.sample.com/search?q=test', :ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE))

Here the output:

> (Document:0x3fda40e7cf70 {

  name = "document",

  children = [

    #(DTD:0x3fda40e9591c { name = "html" }),

    #(Element:0x3fda40e8c95c {

      name = "html",

      attributes = [ #(Attr:0x3fda4071a598 { name = "style", value = "height:100%" })],

      children = [

        #(Element:0x3fda3fefa28c {

          name = "head",

          children = [

            #(Element:0x3fda401a3088 {

              name = "meta",

              attributes = [ #(Attr:0x3fda40ebd7a0 { name = "name", value = "ROBOTS" }), #(Attr:0x3fda40ebd778 { name = "content", value = "NOINDEX, NOFOLLOW" })]

              }),

            #(Element:0x3fda4074faf4 {

              name = "meta",

              attributes = [ #(Attr:0x3fda3ff0beec { name = "name", value = "format-detection" }), #(Attr:0x3fda3ff0bed8 { name = "content", value = "telephone=no" })]

              }),

            #(Element:0x3fda401ca700 {

              name = "meta",

              attributes = [ #(Attr:0x3fda401c2050 { name = "name", value = "viewport" }), #(Attr:0x3fda401c217c { name = "content", value = "initial-scale=1.0" })]

              }),

            #(Element:0x3fda4079a284 {

              name = "meta",

              attributes = [ #(Attr:0x3fda4078bfb8 { name = "http-equiv", value = "X-UA-Compatible" }), #(Attr:0x3fda4078bf04 { name = "content", value = "IE=edge,chrome=1" })]

              })]

          }),

        #(Element:0x3fda407e2e6c {

          name = "body",

          attributes = [ #(Attr:0x3fda430205f0 { name = "style", value = "margin:0px;height:100%" })],

          children = [

            #(Element:0x3fda4072e2a0 {

              name = "iframe",

              attributes = [

                #(Attr:0x3fda3ff45214 {

                  name = "src",

                  value = "/_Incapsula_Resource?SWUDNSAI=28&xinfo=5-66719320-0%200NNN%20RT%281543054979096%20247%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U2&incident_id=245000650118470008-256430953704260629&edet=12&cinfo=04000000"

                  }),

                #(Attr:0x3fda3ff451d8 { name = "frameborder", value = "0" }),

                #(Attr:0x3fda3ff451b0 { name = "width", value = "100%" }),

                #(Attr:0x3fda3ff45188 { name = "height", value = "100%" }),

                #(Attr:0x3fda3ff45174 { name = "marginheight", value = "0px" }),

                #(Attr:0x3fda3ff4514c { name = "marginwidth", value = "0px" })],

              children = [ #(Text "Request unsuccessful. Incapsula incident ID: 245000650118470008-256430953704260629")]

              })]

          })]

      })]

  })

How can I achieve this web scraping?

ruby-on-rails ruby web-scraping nokogiri

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

edited Nov 24 '18 at 15:34

lacostenycoder

3,67211227

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

asked Nov 24 '18 at 10:45

Ramiz Raja

3,25922029

1

Nokogiri has zero to do with robots.txt. You're issue is bypassing incapsula's DDOS protection which it's unlikely anyone here will want to help you with.

– pguardiario
Nov 24 '18 at 12:20

1

This could be caused by a whole host of things. I can only hypothesise why your request is being blocked, and how you may try to bypass it.

– Tom Lord
Nov 24 '18 at 15:08

1

For example, perhaps: (1) After making too many requests, their server has blacklisted your IP. You could try using another host, or a proxy. (2) Similarly, perhaps they're throttling you. Try making fewer requests. (3) Perhaps you need to set a user-Agent header, or a Cookie header, to "trick" their server into thinking you're a human. (4) Perhaps you don't need to scrape this site in the first place. Do they have an API? .....

– Tom Lord
Nov 24 '18 at 15:11

2

Assuming you do need to achieve whatever it is you're doing via web scraping, and you're not violating their terms of use, I would start by investigating what's different between your browser's web request (which presumably works), and ruby's web request - in particular, check whether you can bypass the filter with appropriate headers or a different IP.

– Tom Lord
Nov 24 '18 at 15:12

Thanks @TomLord and pguardiario. I will try with your suggestions.

– Ramiz Raja
Nov 25 '18 at 8:09

add a comment |

1

Nokogiri has zero to do with robots.txt. You're issue is bypassing incapsula's DDOS protection which it's unlikely anyone here will want to help you with.

– pguardiario
Nov 24 '18 at 12:20

1

This could be caused by a whole host of things. I can only hypothesise why your request is being blocked, and how you may try to bypass it.

– Tom Lord
Nov 24 '18 at 15:08

1

For example, perhaps: (1) After making too many requests, their server has blacklisted your IP. You could try using another host, or a proxy. (2) Similarly, perhaps they're throttling you. Try making fewer requests. (3) Perhaps you need to set a user-Agent header, or a Cookie header, to "trick" their server into thinking you're a human. (4) Perhaps you don't need to scrape this site in the first place. Do they have an API? .....

– Tom Lord
Nov 24 '18 at 15:11

2

Assuming you do need to achieve whatever it is you're doing via web scraping, and you're not violating their terms of use, I would start by investigating what's different between your browser's web request (which presumably works), and ruby's web request - in particular, check whether you can bypass the filter with appropriate headers or a different IP.

– Tom Lord
Nov 24 '18 at 15:12

Thanks @TomLord and pguardiario. I will try with your suggestions.

– Ramiz Raja
Nov 25 '18 at 8:09

Nokogiri has zero to do with robots.txt. You're issue is bypassing incapsula's DDOS protection which it's unlikely anyone here will want to help you with.

– pguardiario
Nov 24 '18 at 12:20

This could be caused by a whole host of things. I can only hypothesise why your request is being blocked, and how you may try to bypass it.

– Tom Lord
Nov 24 '18 at 15:08

For example, perhaps: (1) After making too many requests, their server has blacklisted your IP. You could try using another host, or a proxy. (2) Similarly, perhaps they're throttling you. Try making fewer requests. (3) Perhaps you need to set a user-Agent header, or a Cookie header, to "trick" their server into thinking you're a human. (4) Perhaps you don't need to scrape this site in the first place. Do they have an API? .....

– Tom Lord
Nov 24 '18 at 15:11

Assuming you do need to achieve whatever it is you're doing via web scraping, and you're not violating their terms of use, I would start by investigating what's different between your browser's web request (which presumably works), and ruby's web request - in particular, check whether you can bypass the filter with appropriate headers or a different IP.

– Tom Lord
Nov 24 '18 at 15:12

Thanks @TomLord and pguardiario. I will try with your suggestions.

– Ramiz Raja
Nov 25 '18 at 8:09

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53457336%2frobot-allowed-the-website-but-being-identified-and-rejected%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl