Web scraping python not returning any content
I am trying to web scrape from "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq". Specifically,under the div class = "socrata-table frozen-columns", all of the data-column name & data-column description. However, the code that I've written doesn't seem to be working(its not returning anything?)
import requests
from bs4 import BeautifulSoup
url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"
page = requests.get(url)
print(page.status_code)
soup=BeautifulSoup(page.content,'html.parser')
for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:
for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):
for data in tr.find_all("div",attrs={"class":"column-header-content"}):
print(data.text)
is my code wrong?
python web-scraping beautifulsoup
add a comment |
I am trying to web scrape from "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq". Specifically,under the div class = "socrata-table frozen-columns", all of the data-column name & data-column description. However, the code that I've written doesn't seem to be working(its not returning anything?)
import requests
from bs4 import BeautifulSoup
url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"
page = requests.get(url)
print(page.status_code)
soup=BeautifulSoup(page.content,'html.parser')
for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:
for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):
for data in tr.find_all("div",attrs={"class":"column-header-content"}):
print(data.text)
is my code wrong?
python web-scraping beautifulsoup
add a comment |
I am trying to web scrape from "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq". Specifically,under the div class = "socrata-table frozen-columns", all of the data-column name & data-column description. However, the code that I've written doesn't seem to be working(its not returning anything?)
import requests
from bs4 import BeautifulSoup
url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"
page = requests.get(url)
print(page.status_code)
soup=BeautifulSoup(page.content,'html.parser')
for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:
for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):
for data in tr.find_all("div",attrs={"class":"column-header-content"}):
print(data.text)
is my code wrong?
python web-scraping beautifulsoup
I am trying to web scrape from "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq". Specifically,under the div class = "socrata-table frozen-columns", all of the data-column name & data-column description. However, the code that I've written doesn't seem to be working(its not returning anything?)
import requests
from bs4 import BeautifulSoup
url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"
page = requests.get(url)
print(page.status_code)
soup=BeautifulSoup(page.content,'html.parser')
for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:
for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):
for data in tr.find_all("div",attrs={"class":"column-header-content"}):
print(data.text)
is my code wrong?
python web-scraping beautifulsoup
python web-scraping beautifulsoup
asked Nov 25 '18 at 18:06
judeboxjudebox
175
175
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..
Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.
Use limit
to determine # records retrieved at a time; use offset
parameter to start next batch for new records. Example call here.
As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.
You could write a class based script that uses multiprocessing and grab these batches more efficiently.
import requests
import pandas as pd
from pandas.io.json import json_normalize
response = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')
data = response.json()
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
Example record in JSON response:
oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?
– judebox
Nov 25 '18 at 18:57
thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.
– judebox
Nov 25 '18 at 19:14
Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.
– QHarr
Nov 25 '18 at 19:14
you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.
– QHarr
Nov 25 '18 at 19:16
1
Yeap offset its the starting point for the next batch. thanks again!
– judebox
Nov 25 '18 at 19:24
add a comment |
If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">
. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content
add a comment |
This is because data is dynamically filled by ReactJs after page load.
If you download it via requests you can't see the data.
You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470370%2fweb-scraping-python-not-returning-any-content%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..
Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.
Use limit
to determine # records retrieved at a time; use offset
parameter to start next batch for new records. Example call here.
As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.
You could write a class based script that uses multiprocessing and grab these batches more efficiently.
import requests
import pandas as pd
from pandas.io.json import json_normalize
response = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')
data = response.json()
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
Example record in JSON response:
oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?
– judebox
Nov 25 '18 at 18:57
thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.
– judebox
Nov 25 '18 at 19:14
Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.
– QHarr
Nov 25 '18 at 19:14
you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.
– QHarr
Nov 25 '18 at 19:16
1
Yeap offset its the starting point for the next batch. thanks again!
– judebox
Nov 25 '18 at 19:24
add a comment |
The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..
Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.
Use limit
to determine # records retrieved at a time; use offset
parameter to start next batch for new records. Example call here.
As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.
You could write a class based script that uses multiprocessing and grab these batches more efficiently.
import requests
import pandas as pd
from pandas.io.json import json_normalize
response = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')
data = response.json()
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
Example record in JSON response:
oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?
– judebox
Nov 25 '18 at 18:57
thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.
– judebox
Nov 25 '18 at 19:14
Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.
– QHarr
Nov 25 '18 at 19:14
you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.
– QHarr
Nov 25 '18 at 19:16
1
Yeap offset its the starting point for the next batch. thanks again!
– judebox
Nov 25 '18 at 19:24
add a comment |
The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..
Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.
Use limit
to determine # records retrieved at a time; use offset
parameter to start next batch for new records. Example call here.
As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.
You could write a class based script that uses multiprocessing and grab these batches more efficiently.
import requests
import pandas as pd
from pandas.io.json import json_normalize
response = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')
data = response.json()
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
Example record in JSON response:
The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..
Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.
Use limit
to determine # records retrieved at a time; use offset
parameter to start next batch for new records. Example call here.
As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.
You could write a class based script that uses multiprocessing and grab these batches more efficiently.
import requests
import pandas as pd
from pandas.io.json import json_normalize
response = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')
data = response.json()
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
Example record in JSON response:
edited Nov 25 '18 at 19:17
answered Nov 25 '18 at 18:45
QHarrQHarr
31.8k82042
31.8k82042
oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?
– judebox
Nov 25 '18 at 18:57
thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.
– judebox
Nov 25 '18 at 19:14
Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.
– QHarr
Nov 25 '18 at 19:14
you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.
– QHarr
Nov 25 '18 at 19:16
1
Yeap offset its the starting point for the next batch. thanks again!
– judebox
Nov 25 '18 at 19:24
add a comment |
oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?
– judebox
Nov 25 '18 at 18:57
thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.
– judebox
Nov 25 '18 at 19:14
Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.
– QHarr
Nov 25 '18 at 19:14
you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.
– QHarr
Nov 25 '18 at 19:16
1
Yeap offset its the starting point for the next batch. thanks again!
– judebox
Nov 25 '18 at 19:24
oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?
– judebox
Nov 25 '18 at 18:57
oh whoa this is so much easier, I think I would change to using the api thank you so much.may I ask how does the offset parameter work? the dataset has over 1.8mil rows, if I limit each batch to 100,000, it means offset = 18?
– judebox
Nov 25 '18 at 18:57
thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.
– judebox
Nov 25 '18 at 19:14
thanks again. I've accepted your answer even though it wasn't a direct answer to my original question. I will read up on the API and I think I understood how does the offset works (i.e the starting point). Unfortunately, I am still incapable of writing a class based script.
– judebox
Nov 25 '18 at 19:14
Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.
– QHarr
Nov 25 '18 at 19:14
Check the documentation. My guess is it specifies the start position for record retrieval. So, if you have retrieved 1000 records in first run, you specify 1001 to start at next record. You should very this but that is often the logic.
– QHarr
Nov 25 '18 at 19:14
you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.
– QHarr
Nov 25 '18 at 19:16
you can always write a loop to process in batches. You can loop over list (of lists?) that contains the various parameters you insert into your main request URL. I will update answer with what I should have specified re dynamic loading.
– QHarr
Nov 25 '18 at 19:16
1
1
Yeap offset its the starting point for the next batch. thanks again!
– judebox
Nov 25 '18 at 19:24
Yeap offset its the starting point for the next batch. thanks again!
– judebox
Nov 25 '18 at 19:24
add a comment |
If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">
. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content
add a comment |
If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">
. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content
add a comment |
If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">
. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content
If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">
. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content
answered Nov 25 '18 at 18:17
AdrianAdrian
8825
8825
add a comment |
add a comment |
This is because data is dynamically filled by ReactJs after page load.
If you download it via requests you can't see the data.
You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.
add a comment |
This is because data is dynamically filled by ReactJs after page load.
If you download it via requests you can't see the data.
You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.
add a comment |
This is because data is dynamically filled by ReactJs after page load.
If you download it via requests you can't see the data.
You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.
This is because data is dynamically filled by ReactJs after page load.
If you download it via requests you can't see the data.
You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.
answered Nov 25 '18 at 18:21
VishnudevVishnudev
1,156517
1,156517
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53470370%2fweb-scraping-python-not-returning-any-content%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown