How to send crawler4j data to CrawlerManager?
up vote
1
down vote
favorite
I'm working with a project where user can search some websites and look for pictures which have unique identifier.
public class ImageCrawler extends WebCrawler {
private static final Pattern filters = Pattern.compile(
".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");
public ImageCrawler() {
}
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (imgPatterns.matcher(href).matches()) {
return true;
}
return false;
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
byte imageBytes = page.getContentData();
String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);
try {
SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));
DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);
URLScanResult urlScanResult = new URLScanResult();
urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());
urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());
urlScanResult.setPictureUrl(url);
urlScanResult.setUrlScan(urlScan);
urlScan.getResults().add(urlScanResult);
urlScanRepository.save(urlScan);
}
} catch (ResourceNotFoundException ex) {
//Picture is not in our database
}
}
Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.
public class ImageCrawlerManager {
private static ImageCrawlerManager instance = null;
private ImageCrawlerManager(){
}
public synchronized static ImageCrawlerManager getInstance()
{
if (instance == null)
{
instance = new ImageCrawlerManager();
}
return instance;
}
@Transactional(propagation=Propagation.REQUIRED)
@PersistenceContext(type = PersistenceContextType.EXTENDED)
public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){
try {
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder("/tmp");
config.setIncludeBinaryContentInCrawling(true);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(urlScan.getUrl());
controller.start(ImageCrawler.class, 1);
urlScan.setStatus(URLScanStatus.FINISHED);
urlScanRepository.save(urlScan);
} catch (Exception e) {
e.printStackTrace();
urlScan.setStatus(URLScanStatus.FAILED);
urlScan.setFailedReason(e.getMessage());
urlScanRepository.save(urlScan);
}
}
How to send every image data to manager which decode this image, get the initiator of search and save results to database? In code above I can run multiple crawlers and save it to database. But unfortunately when i run two crawlers simultaneously, I can store two search results but all of them are connected to the crawler which was run first.
spring asynchronous crawler4j
add a comment |
up vote
1
down vote
favorite
I'm working with a project where user can search some websites and look for pictures which have unique identifier.
public class ImageCrawler extends WebCrawler {
private static final Pattern filters = Pattern.compile(
".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");
public ImageCrawler() {
}
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (imgPatterns.matcher(href).matches()) {
return true;
}
return false;
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
byte imageBytes = page.getContentData();
String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);
try {
SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));
DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);
URLScanResult urlScanResult = new URLScanResult();
urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());
urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());
urlScanResult.setPictureUrl(url);
urlScanResult.setUrlScan(urlScan);
urlScan.getResults().add(urlScanResult);
urlScanRepository.save(urlScan);
}
} catch (ResourceNotFoundException ex) {
//Picture is not in our database
}
}
Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.
public class ImageCrawlerManager {
private static ImageCrawlerManager instance = null;
private ImageCrawlerManager(){
}
public synchronized static ImageCrawlerManager getInstance()
{
if (instance == null)
{
instance = new ImageCrawlerManager();
}
return instance;
}
@Transactional(propagation=Propagation.REQUIRED)
@PersistenceContext(type = PersistenceContextType.EXTENDED)
public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){
try {
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder("/tmp");
config.setIncludeBinaryContentInCrawling(true);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(urlScan.getUrl());
controller.start(ImageCrawler.class, 1);
urlScan.setStatus(URLScanStatus.FINISHED);
urlScanRepository.save(urlScan);
} catch (Exception e) {
e.printStackTrace();
urlScan.setStatus(URLScanStatus.FAILED);
urlScan.setFailedReason(e.getMessage());
urlScanRepository.save(urlScan);
}
}
How to send every image data to manager which decode this image, get the initiator of search and save results to database? In code above I can run multiple crawlers and save it to database. But unfortunately when i run two crawlers simultaneously, I can store two search results but all of them are connected to the crawler which was run first.
spring asynchronous crawler4j
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I'm working with a project where user can search some websites and look for pictures which have unique identifier.
public class ImageCrawler extends WebCrawler {
private static final Pattern filters = Pattern.compile(
".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");
public ImageCrawler() {
}
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (imgPatterns.matcher(href).matches()) {
return true;
}
return false;
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
byte imageBytes = page.getContentData();
String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);
try {
SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));
DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);
URLScanResult urlScanResult = new URLScanResult();
urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());
urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());
urlScanResult.setPictureUrl(url);
urlScanResult.setUrlScan(urlScan);
urlScan.getResults().add(urlScanResult);
urlScanRepository.save(urlScan);
}
} catch (ResourceNotFoundException ex) {
//Picture is not in our database
}
}
Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.
public class ImageCrawlerManager {
private static ImageCrawlerManager instance = null;
private ImageCrawlerManager(){
}
public synchronized static ImageCrawlerManager getInstance()
{
if (instance == null)
{
instance = new ImageCrawlerManager();
}
return instance;
}
@Transactional(propagation=Propagation.REQUIRED)
@PersistenceContext(type = PersistenceContextType.EXTENDED)
public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){
try {
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder("/tmp");
config.setIncludeBinaryContentInCrawling(true);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(urlScan.getUrl());
controller.start(ImageCrawler.class, 1);
urlScan.setStatus(URLScanStatus.FINISHED);
urlScanRepository.save(urlScan);
} catch (Exception e) {
e.printStackTrace();
urlScan.setStatus(URLScanStatus.FAILED);
urlScan.setFailedReason(e.getMessage());
urlScanRepository.save(urlScan);
}
}
How to send every image data to manager which decode this image, get the initiator of search and save results to database? In code above I can run multiple crawlers and save it to database. But unfortunately when i run two crawlers simultaneously, I can store two search results but all of them are connected to the crawler which was run first.
spring asynchronous crawler4j
I'm working with a project where user can search some websites and look for pictures which have unique identifier.
public class ImageCrawler extends WebCrawler {
private static final Pattern filters = Pattern.compile(
".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");
public ImageCrawler() {
}
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (imgPatterns.matcher(href).matches()) {
return true;
}
return false;
}
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
byte imageBytes = page.getContentData();
String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);
try {
SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));
DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);
URLScanResult urlScanResult = new URLScanResult();
urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());
urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());
urlScanResult.setPictureUrl(url);
urlScanResult.setUrlScan(urlScan);
urlScan.getResults().add(urlScanResult);
urlScanRepository.save(urlScan);
}
} catch (ResourceNotFoundException ex) {
//Picture is not in our database
}
}
Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.
public class ImageCrawlerManager {
private static ImageCrawlerManager instance = null;
private ImageCrawlerManager(){
}
public synchronized static ImageCrawlerManager getInstance()
{
if (instance == null)
{
instance = new ImageCrawlerManager();
}
return instance;
}
@Transactional(propagation=Propagation.REQUIRED)
@PersistenceContext(type = PersistenceContextType.EXTENDED)
public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){
try {
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder("/tmp");
config.setIncludeBinaryContentInCrawling(true);
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
controller.addSeed(urlScan.getUrl());
controller.start(ImageCrawler.class, 1);
urlScan.setStatus(URLScanStatus.FINISHED);
urlScanRepository.save(urlScan);
} catch (Exception e) {
e.printStackTrace();
urlScan.setStatus(URLScanStatus.FAILED);
urlScan.setFailedReason(e.getMessage());
urlScanRepository.save(urlScan);
}
}
How to send every image data to manager which decode this image, get the initiator of search and save results to database? In code above I can run multiple crawlers and save it to database. But unfortunately when i run two crawlers simultaneously, I can store two search results but all of them are connected to the crawler which was run first.
spring asynchronous crawler4j
spring asynchronous crawler4j
asked Nov 22 at 12:44
Przemek
104
104
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
You should inject your database service into your ẀebCrawler
instances and not use a singleton to manage the result of your web-crawl.
crawler4j
supports a custom CrawlController.WebCrawlerFactory
(see here for reference), which can be used with Spring to inject your database service into a ImageCrawler
instance.
Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):
decode this image, get the initiator of search and save results to
database
Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53431335%2fhow-to-send-crawler4j-data-to-crawlermanager%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
You should inject your database service into your ẀebCrawler
instances and not use a singleton to manage the result of your web-crawl.
crawler4j
supports a custom CrawlController.WebCrawlerFactory
(see here for reference), which can be used with Spring to inject your database service into a ImageCrawler
instance.
Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):
decode this image, get the initiator of search and save results to
database
Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.
add a comment |
up vote
2
down vote
accepted
You should inject your database service into your ẀebCrawler
instances and not use a singleton to manage the result of your web-crawl.
crawler4j
supports a custom CrawlController.WebCrawlerFactory
(see here for reference), which can be used with Spring to inject your database service into a ImageCrawler
instance.
Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):
decode this image, get the initiator of search and save results to
database
Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
You should inject your database service into your ẀebCrawler
instances and not use a singleton to manage the result of your web-crawl.
crawler4j
supports a custom CrawlController.WebCrawlerFactory
(see here for reference), which can be used with Spring to inject your database service into a ImageCrawler
instance.
Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):
decode this image, get the initiator of search and save results to
database
Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.
You should inject your database service into your ẀebCrawler
instances and not use a singleton to manage the result of your web-crawl.
crawler4j
supports a custom CrawlController.WebCrawlerFactory
(see here for reference), which can be used with Spring to inject your database service into a ImageCrawler
instance.
Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):
decode this image, get the initiator of search and save results to
database
Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.
answered Dec 7 at 13:42
rzo
3,11921652
3,11921652
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53431335%2fhow-to-send-crawler4j-data-to-crawlermanager%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown