Web scraping a table and can't locate the table
I am scraping this website: https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc
And try to download all the zip files from the table. However, I can not locate the table from the 'soup'. It returns nothing.
req = Request(
'https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc',
headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()
soup = BeautifulSoup(page, "html.parser")
tables = soup.find('div', class_='table table-bordered docnav-metadata dataTable no-footer')
web-scraping beautifulsoup
add a comment |
I am scraping this website: https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc
And try to download all the zip files from the table. However, I can not locate the table from the 'soup'. It returns nothing.
req = Request(
'https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc',
headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()
soup = BeautifulSoup(page, "html.parser")
tables = soup.find('div', class_='table table-bordered docnav-metadata dataTable no-footer')
web-scraping beautifulsoup
The content gets loaded dynamically. Try usingselenium
orrequests_html
or something to fetch them.
– SIM
Nov 26 '18 at 19:03
Thank you so much. Can you be more explicit please?
– Chen. B
Nov 26 '18 at 19:16
If you disable javascript in your browser and reload the page, you wont see that tabular content. BeautifulSoup can't catch such content.
– SIM
Nov 26 '18 at 19:22
Thank you so much!
– Chen. B
Nov 26 '18 at 19:29
Why not just use firebug or chrome developer tools to check the ajax call and emulate it?
– Carlos Alves Jorge
Nov 26 '18 at 20:11
add a comment |
I am scraping this website: https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc
And try to download all the zip files from the table. However, I can not locate the table from the 'soup'. It returns nothing.
req = Request(
'https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc',
headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()
soup = BeautifulSoup(page, "html.parser")
tables = soup.find('div', class_='table table-bordered docnav-metadata dataTable no-footer')
web-scraping beautifulsoup
I am scraping this website: https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc
And try to download all the zip files from the table. However, I can not locate the table from the 'soup'. It returns nothing.
req = Request(
'https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc',
headers={'User-Agent': 'Mozilla/5.0'})
page = urlopen(req).read()
soup = BeautifulSoup(page, "html.parser")
tables = soup.find('div', class_='table table-bordered docnav-metadata dataTable no-footer')
web-scraping beautifulsoup
web-scraping beautifulsoup
asked Nov 26 '18 at 18:46
Chen. BChen. B
85
85
The content gets loaded dynamically. Try usingselenium
orrequests_html
or something to fetch them.
– SIM
Nov 26 '18 at 19:03
Thank you so much. Can you be more explicit please?
– Chen. B
Nov 26 '18 at 19:16
If you disable javascript in your browser and reload the page, you wont see that tabular content. BeautifulSoup can't catch such content.
– SIM
Nov 26 '18 at 19:22
Thank you so much!
– Chen. B
Nov 26 '18 at 19:29
Why not just use firebug or chrome developer tools to check the ajax call and emulate it?
– Carlos Alves Jorge
Nov 26 '18 at 20:11
add a comment |
The content gets loaded dynamically. Try usingselenium
orrequests_html
or something to fetch them.
– SIM
Nov 26 '18 at 19:03
Thank you so much. Can you be more explicit please?
– Chen. B
Nov 26 '18 at 19:16
If you disable javascript in your browser and reload the page, you wont see that tabular content. BeautifulSoup can't catch such content.
– SIM
Nov 26 '18 at 19:22
Thank you so much!
– Chen. B
Nov 26 '18 at 19:29
Why not just use firebug or chrome developer tools to check the ajax call and emulate it?
– Carlos Alves Jorge
Nov 26 '18 at 20:11
The content gets loaded dynamically. Try using
selenium
or requests_html
or something to fetch them.– SIM
Nov 26 '18 at 19:03
The content gets loaded dynamically. Try using
selenium
or requests_html
or something to fetch them.– SIM
Nov 26 '18 at 19:03
Thank you so much. Can you be more explicit please?
– Chen. B
Nov 26 '18 at 19:16
Thank you so much. Can you be more explicit please?
– Chen. B
Nov 26 '18 at 19:16
If you disable javascript in your browser and reload the page, you wont see that tabular content. BeautifulSoup can't catch such content.
– SIM
Nov 26 '18 at 19:22
If you disable javascript in your browser and reload the page, you wont see that tabular content. BeautifulSoup can't catch such content.
– SIM
Nov 26 '18 at 19:22
Thank you so much!
– Chen. B
Nov 26 '18 at 19:29
Thank you so much!
– Chen. B
Nov 26 '18 at 19:29
Why not just use firebug or chrome developer tools to check the ajax call and emulate it?
– Carlos Alves Jorge
Nov 26 '18 at 20:11
Why not just use firebug or chrome developer tools to check the ajax call and emulate it?
– Carlos Alves Jorge
Nov 26 '18 at 20:11
add a comment |
2 Answers
2
active
oldest
votes
as stated, you need something like selenium to load the page as it's dynamic. You'll also need to let it wait to load to get the table.
NOTE: I used time.sleep() for the wait, but I have read that is not the best solution. The suggestion is to use WebDriverWait
but I'm still in the pr0cess of understanding how that would work, so will update this once I play around. In the meantime, this should get you started.
import bs4
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
time.sleep(5)
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
This worked for me with WebDriverWait
:
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "table.table-bordered.docnav-metadata.dataTable.no-footer")))
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
add a comment |
To fetch the tabular content from that webpage using Requests-HTML
library, you can try the following script:
import requests_html
link = "https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc"
with requests_html.HTMLSession() as session:
r = session.get(link)
r.html.render(sleep=5,timeout=8)
for items in r.html.find("table.dataTable tr.desktop-row"):
data = [item.text for item in items.find("td")]
print(data)
1
Finally someone using requests_html 😁
– Kamikaze_goldfish
Nov 26 '18 at 20:09
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487244%2fweb-scraping-a-table-and-cant-locate-the-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
as stated, you need something like selenium to load the page as it's dynamic. You'll also need to let it wait to load to get the table.
NOTE: I used time.sleep() for the wait, but I have read that is not the best solution. The suggestion is to use WebDriverWait
but I'm still in the pr0cess of understanding how that would work, so will update this once I play around. In the meantime, this should get you started.
import bs4
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
time.sleep(5)
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
This worked for me with WebDriverWait
:
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "table.table-bordered.docnav-metadata.dataTable.no-footer")))
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
add a comment |
as stated, you need something like selenium to load the page as it's dynamic. You'll also need to let it wait to load to get the table.
NOTE: I used time.sleep() for the wait, but I have read that is not the best solution. The suggestion is to use WebDriverWait
but I'm still in the pr0cess of understanding how that would work, so will update this once I play around. In the meantime, this should get you started.
import bs4
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
time.sleep(5)
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
This worked for me with WebDriverWait
:
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "table.table-bordered.docnav-metadata.dataTable.no-footer")))
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
add a comment |
as stated, you need something like selenium to load the page as it's dynamic. You'll also need to let it wait to load to get the table.
NOTE: I used time.sleep() for the wait, but I have read that is not the best solution. The suggestion is to use WebDriverWait
but I'm still in the pr0cess of understanding how that would work, so will update this once I play around. In the meantime, this should get you started.
import bs4
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
time.sleep(5)
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
This worked for me with WebDriverWait
:
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "table.table-bordered.docnav-metadata.dataTable.no-footer")))
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
as stated, you need something like selenium to load the page as it's dynamic. You'll also need to let it wait to load to get the table.
NOTE: I used time.sleep() for the wait, but I have read that is not the best solution. The suggestion is to use WebDriverWait
but I'm still in the pr0cess of understanding how that would work, so will update this once I play around. In the meantime, this should get you started.
import bs4
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
time.sleep(5)
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
This worked for me with WebDriverWait
:
import bs4
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc')
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "table.table-bordered.docnav-metadata.dataTable.no-footer")))
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
tables = soup.findAll('table', {'class':'table table-bordered docnav-metadata dataTable no-footer'})
edited Nov 26 '18 at 20:04
answered Nov 26 '18 at 19:43
chitown88chitown88
3,7511522
3,7511522
add a comment |
add a comment |
To fetch the tabular content from that webpage using Requests-HTML
library, you can try the following script:
import requests_html
link = "https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc"
with requests_html.HTMLSession() as session:
r = session.get(link)
r.html.render(sleep=5,timeout=8)
for items in r.html.find("table.dataTable tr.desktop-row"):
data = [item.text for item in items.find("td")]
print(data)
1
Finally someone using requests_html 😁
– Kamikaze_goldfish
Nov 26 '18 at 20:09
add a comment |
To fetch the tabular content from that webpage using Requests-HTML
library, you can try the following script:
import requests_html
link = "https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc"
with requests_html.HTMLSession() as session:
r = session.get(link)
r.html.render(sleep=5,timeout=8)
for items in r.html.find("table.dataTable tr.desktop-row"):
data = [item.text for item in items.find("td")]
print(data)
1
Finally someone using requests_html 😁
– Kamikaze_goldfish
Nov 26 '18 at 20:09
add a comment |
To fetch the tabular content from that webpage using Requests-HTML
library, you can try the following script:
import requests_html
link = "https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc"
with requests_html.HTMLSession() as session:
r = session.get(link)
r.html.render(sleep=5,timeout=8)
for items in r.html.find("table.dataTable tr.desktop-row"):
data = [item.text for item in items.find("td")]
print(data)
To fetch the tabular content from that webpage using Requests-HTML
library, you can try the following script:
import requests_html
link = "https://www.misoenergy.org/markets-and-operations/market-reports/market-report-archives/#nt=%2FMarketReportType%3ABids%2FMarketReportName%3AArchived%20Cleared%20Bids%20%20(zip)&t=10&p=0&s=FileName&sd=desc"
with requests_html.HTMLSession() as session:
r = session.get(link)
r.html.render(sleep=5,timeout=8)
for items in r.html.find("table.dataTable tr.desktop-row"):
data = [item.text for item in items.find("td")]
print(data)
answered Nov 26 '18 at 19:52
SIMSIM
10.5k3744
10.5k3744
1
Finally someone using requests_html 😁
– Kamikaze_goldfish
Nov 26 '18 at 20:09
add a comment |
1
Finally someone using requests_html 😁
– Kamikaze_goldfish
Nov 26 '18 at 20:09
1
1
Finally someone using requests_html 😁
– Kamikaze_goldfish
Nov 26 '18 at 20:09
Finally someone using requests_html 😁
– Kamikaze_goldfish
Nov 26 '18 at 20:09
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53487244%2fweb-scraping-a-table-and-cant-locate-the-table%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
The content gets loaded dynamically. Try using
selenium
orrequests_html
or something to fetch them.– SIM
Nov 26 '18 at 19:03
Thank you so much. Can you be more explicit please?
– Chen. B
Nov 26 '18 at 19:16
If you disable javascript in your browser and reload the page, you wont see that tabular content. BeautifulSoup can't catch such content.
– SIM
Nov 26 '18 at 19:22
Thank you so much!
– Chen. B
Nov 26 '18 at 19:29
Why not just use firebug or chrome developer tools to check the ajax call and emulate it?
– Carlos Alves Jorge
Nov 26 '18 at 20:11