Web scraping python not returning any content

I am trying to web scrape from "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq". Specifically,under the div class = "socrata-table frozen-columns", all of the data-column name & data-column description. However, the code that I've written doesn't seem to be working(its not returning anything?)

import requests

from bs4 import BeautifulSoup

url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"

page = requests.get(url)

print(page.status_code)

soup=BeautifulSoup(page.content,'html.parser')





for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:

   for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):

      for data in tr.find_all("div",attrs={"class":"column-header-content"}):

        print(data.text)

is my code wrong?

asked Nov 25 '18 at 18:06

judebox

175

add a comment |

import requests

from bs4 import BeautifulSoup

url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"

page = requests.get(url)

print(page.status_code)

soup=BeautifulSoup(page.content,'html.parser')





for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:

   for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):

      for data in tr.find_all("div",attrs={"class":"column-header-content"}):

        print(data.text)

is my code wrong?

asked Nov 25 '18 at 18:06

judebox

175

add a comment |

import requests

from bs4 import BeautifulSoup

url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"

page = requests.get(url)

print(page.status_code)

soup=BeautifulSoup(page.content,'html.parser')





for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:

   for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):

      for data in tr.find_all("div",attrs={"class":"column-header-content"}):

        print(data.text)

is my code wrong?

asked Nov 25 '18 at 18:06

judebox

175

import requests

from bs4 import BeautifulSoup

url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"

page = requests.get(url)

print(page.status_code)

soup=BeautifulSoup(page.content,'html.parser')





for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:

   for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):

      for data in tr.find_all("div",attrs={"class":"column-header-content"}):

        print(data.text)

is my code wrong?

python web-scraping beautifulsoup

asked Nov 25 '18 at 18:06

judebox

175

asked Nov 25 '18 at 18:06

judebox

175

asked Nov 25 '18 at 18:06

judebox

175

asked Nov 25 '18 at 18:06

judebox

175

asked Nov 25 '18 at 18:06

judebox

175

add a comment |

3 Answers
3

active

oldest

votes

The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..

Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.

Use limit to determine # records retrieved at a time; use offset parameter to start next batch for new records. Example call here.

As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.

You could write a class based script that uses multiprocessing and grab these batches more efficiently.

import requests

import pandas as pd

from pandas.io.json import json_normalize



response  = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')

data = response.json()

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

Example record in JSON response:

enter image description here

edited Nov 25 '18 at 19:17

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?

– judebox
Nov 25 '18 at 18:57

thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.

– judebox
Nov 25 '18 at 19:14

Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.

– QHarr
Nov 25 '18 at 19:14

you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.

– QHarr
Nov 25 '18 at 19:16

1

Yeap offset its the starting point for the next batch. thanks again!

– judebox
Nov 25 '18 at 19:24

add a comment |

If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content

answered Nov 25 '18 at 18:17

Adrian

8825

add a comment |

This is because data is dynamically filled by ReactJs after page load.

If you download it via requests you can't see the data.

You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470370%2fweb-scraping-python-not-returning-any-content%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.

Use limit to determine # records retrieved at a time; use offset parameter to start next batch for new records. Example call here.

You could write a class based script that uses multiprocessing and grab these batches more efficiently.

import requests

import pandas as pd

from pandas.io.json import json_normalize



response  = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')

data = response.json()

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

Example record in JSON response:

enter image description here

edited Nov 25 '18 at 19:17

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?

– judebox
Nov 25 '18 at 18:57

thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.

– judebox
Nov 25 '18 at 19:14

Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.

– QHarr
Nov 25 '18 at 19:14

you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.

– QHarr
Nov 25 '18 at 19:16

1

Yeap offset its the starting point for the next batch. thanks again!

– judebox
Nov 25 '18 at 19:24

add a comment |

Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.

Use limit to determine # records retrieved at a time; use offset parameter to start next batch for new records. Example call here.

You could write a class based script that uses multiprocessing and grab these batches more efficiently.

import requests

import pandas as pd

from pandas.io.json import json_normalize



response  = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')

data = response.json()

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

Example record in JSON response:

enter image description here

edited Nov 25 '18 at 19:17

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?

– judebox
Nov 25 '18 at 18:57

thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.

– judebox
Nov 25 '18 at 19:14

Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.

– QHarr
Nov 25 '18 at 19:14

you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.

– QHarr
Nov 25 '18 at 19:16

1

Yeap offset its the starting point for the next batch. thanks again!

– judebox
Nov 25 '18 at 19:24

add a comment |

Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.

Use limit to determine # records retrieved at a time; use offset parameter to start next batch for new records. Example call here.

You could write a class based script that uses multiprocessing and grab these batches more efficiently.

import requests

import pandas as pd

from pandas.io.json import json_normalize



response  = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')

data = response.json()

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

Example record in JSON response:

enter image description here

edited Nov 25 '18 at 19:17

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.

Use limit to determine # records retrieved at a time; use offset parameter to start next batch for new records. Example call here.

You could write a class based script that uses multiprocessing and grab these batches more efficiently.

import requests

import pandas as pd

from pandas.io.json import json_normalize



response  = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')

data = response.json()

data = json_normalize(data)

df = pd.DataFrame(data)

print(df)

Example record in JSON response:

enter image description here

edited Nov 25 '18 at 19:17

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

edited Nov 25 '18 at 19:17

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

answered Nov 25 '18 at 18:45

QHarr

31.8k82042

oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?

– judebox
Nov 25 '18 at 18:57

thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.

– judebox
Nov 25 '18 at 19:14

Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.

– QHarr
Nov 25 '18 at 19:14

you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.

– QHarr
Nov 25 '18 at 19:16

1

Yeap offset its the starting point for the next batch. thanks again!

– judebox
Nov 25 '18 at 19:24

add a comment |

oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?

– judebox
Nov 25 '18 at 18:57

thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.

– judebox
Nov 25 '18 at 19:14

Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.

– QHarr
Nov 25 '18 at 19:14

you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.

– QHarr
Nov 25 '18 at 19:16

1

Yeap offset its the starting point for the next batch. thanks again!

– judebox
Nov 25 '18 at 19:24

oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?

– judebox
Nov 25 '18 at 18:57

thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.

– judebox
Nov 25 '18 at 19:14

Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.

– QHarr
Nov 25 '18 at 19:14

you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.

– QHarr
Nov 25 '18 at 19:16

Yeap offset its the starting point for the next batch. thanks again!

– judebox
Nov 25 '18 at 19:24

add a comment |

answered Nov 25 '18 at 18:17

Adrian

8825

add a comment |

answered Nov 25 '18 at 18:17

Adrian

8825

add a comment |

answered Nov 25 '18 at 18:17

Adrian

8825

answered Nov 25 '18 at 18:17

Adrian

8825

answered Nov 25 '18 at 18:17

Adrian

8825

answered Nov 25 '18 at 18:17

Adrian

8825

answered Nov 25 '18 at 18:17

Adrian

8825

add a comment |

This is because data is dynamically filled by ReactJs after page load.

If you download it via requests you can't see the data.

You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

add a comment |

This is because data is dynamically filled by ReactJs after page load.

If you download it via requests you can't see the data.

You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

add a comment |

This is because data is dynamically filled by ReactJs after page load.

If you download it via requests you can't see the data.

You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

This is because data is dynamically filled by ReactJs after page load.

If you download it via requests you can't see the data.

You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

answered Nov 25 '18 at 18:21

Vishnudev

1,156517

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl