How to send crawler4j data to CrawlerManager?

up vote
1
down vote

favorite

I'm working with a project where user can search some websites and look for pictures which have unique identifier.

public class ImageCrawler extends WebCrawler {



private static final Pattern filters = Pattern.compile(

        ".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +

                "|rm|smil|wmv|swf|wma|zip|rar|gz))$");



private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");



public ImageCrawler() {

}



@Override

public boolean shouldVisit(Page referringPage, WebURL url) {

    String href = url.getURL().toLowerCase();

    if (filters.matcher(href).matches()) {

        return false;

    }



    if (imgPatterns.matcher(href).matches()) {

        return true;

    }



    return false;

}



@Override

public void visit(Page page) {

    String url = page.getWebURL().getURL();



    byte imageBytes = page.getContentData();

    String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);

    try {

        SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));

        DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);

        URLScanResult urlScanResult = new URLScanResult();

        urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());

        urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());

        urlScanResult.setPictureUrl(url);

        urlScanResult.setUrlScan(urlScan);

        urlScan.getResults().add(urlScanResult);

        urlScanRepository.save(urlScan);

    }



    } catch (ResourceNotFoundException ex) {

        //Picture is not in our database

    }

}

Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.

public class ImageCrawlerManager {



private static ImageCrawlerManager instance = null;





private ImageCrawlerManager(){

}



public synchronized static ImageCrawlerManager getInstance()

{

    if (instance == null)

    {

        instance = new ImageCrawlerManager();

    }

    return instance;

}



@Transactional(propagation=Propagation.REQUIRED)

@PersistenceContext(type = PersistenceContextType.EXTENDED)

public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){



    try {

        CrawlConfig config = new CrawlConfig();

        config.setCrawlStorageFolder("/tmp");

        config.setIncludeBinaryContentInCrawling(true);



        PageFetcher pageFetcher = new PageFetcher(config);

        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);



        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        controller.addSeed(urlScan.getUrl());



        controller.start(ImageCrawler.class, 1);

        urlScan.setStatus(URLScanStatus.FINISHED);

        urlScanRepository.save(urlScan);

    } catch (Exception e) {

        e.printStackTrace();

        urlScan.setStatus(URLScanStatus.FAILED);

        urlScan.setFailedReason(e.getMessage());

        urlScanRepository.save(urlScan);

    }

}

How to send every image data to manager which decode this image, get the initiator of search and save results to database? In code above I can run multiple crawlers and save it to database. But unfortunately when i run two crawlers simultaneously, I can store two search results but all of them are connected to the crawler which was run first.

asked Nov 22 at 12:44

Przemek

104

add a comment |

up vote
1
down vote

favorite

I'm working with a project where user can search some websites and look for pictures which have unique identifier.

public class ImageCrawler extends WebCrawler {



private static final Pattern filters = Pattern.compile(

        ".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +

                "|rm|smil|wmv|swf|wma|zip|rar|gz))$");



private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");



public ImageCrawler() {

}



@Override

public boolean shouldVisit(Page referringPage, WebURL url) {

    String href = url.getURL().toLowerCase();

    if (filters.matcher(href).matches()) {

        return false;

    }



    if (imgPatterns.matcher(href).matches()) {

        return true;

    }



    return false;

}



@Override

public void visit(Page page) {

    String url = page.getWebURL().getURL();



    byte imageBytes = page.getContentData();

    String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);

    try {

        SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));

        DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);

        URLScanResult urlScanResult = new URLScanResult();

        urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());

        urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());

        urlScanResult.setPictureUrl(url);

        urlScanResult.setUrlScan(urlScan);

        urlScan.getResults().add(urlScanResult);

        urlScanRepository.save(urlScan);

    }



    } catch (ResourceNotFoundException ex) {

        //Picture is not in our database

    }

}

Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.

public class ImageCrawlerManager {



private static ImageCrawlerManager instance = null;





private ImageCrawlerManager(){

}



public synchronized static ImageCrawlerManager getInstance()

{

    if (instance == null)

    {

        instance = new ImageCrawlerManager();

    }

    return instance;

}



@Transactional(propagation=Propagation.REQUIRED)

@PersistenceContext(type = PersistenceContextType.EXTENDED)

public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){



    try {

        CrawlConfig config = new CrawlConfig();

        config.setCrawlStorageFolder("/tmp");

        config.setIncludeBinaryContentInCrawling(true);



        PageFetcher pageFetcher = new PageFetcher(config);

        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);



        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        controller.addSeed(urlScan.getUrl());



        controller.start(ImageCrawler.class, 1);

        urlScan.setStatus(URLScanStatus.FINISHED);

        urlScanRepository.save(urlScan);

    } catch (Exception e) {

        e.printStackTrace();

        urlScan.setStatus(URLScanStatus.FAILED);

        urlScan.setFailedReason(e.getMessage());

        urlScanRepository.save(urlScan);

    }

}

asked Nov 22 at 12:44

Przemek

104

add a comment |

up vote
1
down vote

favorite

I'm working with a project where user can search some websites and look for pictures which have unique identifier.

public class ImageCrawler extends WebCrawler {



private static final Pattern filters = Pattern.compile(

        ".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +

                "|rm|smil|wmv|swf|wma|zip|rar|gz))$");



private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");



public ImageCrawler() {

}



@Override

public boolean shouldVisit(Page referringPage, WebURL url) {

    String href = url.getURL().toLowerCase();

    if (filters.matcher(href).matches()) {

        return false;

    }



    if (imgPatterns.matcher(href).matches()) {

        return true;

    }



    return false;

}



@Override

public void visit(Page page) {

    String url = page.getWebURL().getURL();



    byte imageBytes = page.getContentData();

    String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);

    try {

        SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));

        DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);

        URLScanResult urlScanResult = new URLScanResult();

        urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());

        urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());

        urlScanResult.setPictureUrl(url);

        urlScanResult.setUrlScan(urlScan);

        urlScan.getResults().add(urlScanResult);

        urlScanRepository.save(urlScan);

    }



    } catch (ResourceNotFoundException ex) {

        //Picture is not in our database

    }

}

Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.

public class ImageCrawlerManager {



private static ImageCrawlerManager instance = null;





private ImageCrawlerManager(){

}



public synchronized static ImageCrawlerManager getInstance()

{

    if (instance == null)

    {

        instance = new ImageCrawlerManager();

    }

    return instance;

}



@Transactional(propagation=Propagation.REQUIRED)

@PersistenceContext(type = PersistenceContextType.EXTENDED)

public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){



    try {

        CrawlConfig config = new CrawlConfig();

        config.setCrawlStorageFolder("/tmp");

        config.setIncludeBinaryContentInCrawling(true);



        PageFetcher pageFetcher = new PageFetcher(config);

        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);



        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        controller.addSeed(urlScan.getUrl());



        controller.start(ImageCrawler.class, 1);

        urlScan.setStatus(URLScanStatus.FINISHED);

        urlScanRepository.save(urlScan);

    } catch (Exception e) {

        e.printStackTrace();

        urlScan.setStatus(URLScanStatus.FAILED);

        urlScan.setFailedReason(e.getMessage());

        urlScanRepository.save(urlScan);

    }

}

asked Nov 22 at 12:44

Przemek

104

I'm working with a project where user can search some websites and look for pictures which have unique identifier.

public class ImageCrawler extends WebCrawler {



private static final Pattern filters = Pattern.compile(

        ".*(\.(css|js|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +

                "|rm|smil|wmv|swf|wma|zip|rar|gz))$");



private static final Pattern imgPatterns = Pattern.compile(".*(\.(bmp|gif|jpe?g|png|tiff?))$");



public ImageCrawler() {

}



@Override

public boolean shouldVisit(Page referringPage, WebURL url) {

    String href = url.getURL().toLowerCase();

    if (filters.matcher(href).matches()) {

        return false;

    }



    if (imgPatterns.matcher(href).matches()) {

        return true;

    }



    return false;

}



@Override

public void visit(Page page) {

    String url = page.getWebURL().getURL();



    byte imageBytes = page.getContentData();

    String imageBase64 = Base64.getEncoder().encodeToString(imageBytes);

    try {

        SecurityContextHolder.getContext().setAuthentication(new UsernamePasswordAuthenticationToken(urlScan.getOwner(), null));

        DecodePictureResponse decodePictureResponse = decodePictureService.decodePicture(imageBase64);

        URLScanResult urlScanResult = new URLScanResult();

        urlScanResult.setPicture(pictureRepository.findByUuid(decodePictureResponse.getPictureDTO().getUuid()).get());

        urlScanResult.setIntegrity(decodePictureResponse.isIntegrity());

        urlScanResult.setPictureUrl(url);

        urlScanResult.setUrlScan(urlScan);

        urlScan.getResults().add(urlScanResult);

        urlScanRepository.save(urlScan);

    }



    } catch (ResourceNotFoundException ex) {

        //Picture is not in our database

    }

}

Crawlers will be run independently. ImageCrawlerManager class, which is singletone, run crawlers.

public class ImageCrawlerManager {



private static ImageCrawlerManager instance = null;





private ImageCrawlerManager(){

}



public synchronized static ImageCrawlerManager getInstance()

{

    if (instance == null)

    {

        instance = new ImageCrawlerManager();

    }

    return instance;

}



@Transactional(propagation=Propagation.REQUIRED)

@PersistenceContext(type = PersistenceContextType.EXTENDED)

public void startCrawler(URLScan urlScan, DecodePictureService decodePictureService, URLScanRepository urlScanRepository, PictureRepository pictureRepository){



    try {

        CrawlConfig config = new CrawlConfig();

        config.setCrawlStorageFolder("/tmp");

        config.setIncludeBinaryContentInCrawling(true);



        PageFetcher pageFetcher = new PageFetcher(config);

        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);



        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        controller.addSeed(urlScan.getUrl());



        controller.start(ImageCrawler.class, 1);

        urlScan.setStatus(URLScanStatus.FINISHED);

        urlScanRepository.save(urlScan);

    } catch (Exception e) {

        e.printStackTrace();

        urlScan.setStatus(URLScanStatus.FAILED);

        urlScan.setFailedReason(e.getMessage());

        urlScanRepository.save(urlScan);

    }

}

spring asynchronous crawler4j

asked Nov 22 at 12:44

Przemek

104

asked Nov 22 at 12:44

Przemek

104

asked Nov 22 at 12:44

Przemek

104

asked Nov 22 at 12:44

Przemek

104

asked Nov 22 at 12:44

Przemek

104

add a comment |

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

You should inject your database service into your ẀebCrawler instances and not use a singleton to manage the result of your web-crawl.

crawler4j supports a custom CrawlController.WebCrawlerFactory (see here for reference), which can be used with Spring to inject your database service into a ImageCrawler instance.

Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):

decode this image, get the initiator of search and save results to
database

Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.

answered Dec 7 at 13:42

rzo

3,11921652

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53431335%2fhow-to-send-crawler4j-data-to-crawlermanager%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

You should inject your database service into your ẀebCrawler instances and not use a singleton to manage the result of your web-crawl.

crawler4j supports a custom CrawlController.WebCrawlerFactory (see here for reference), which can be used with Spring to inject your database service into a ImageCrawler instance.

Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):

decode this image, get the initiator of search and save results to
database

Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.

answered Dec 7 at 13:42

rzo

3,11921652

add a comment |

up vote
2
down vote

accepted

You should inject your database service into your ẀebCrawler instances and not use a singleton to manage the result of your web-crawl.

crawler4j supports a custom CrawlController.WebCrawlerFactory (see here for reference), which can be used with Spring to inject your database service into a ImageCrawler instance.

Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):

decode this image, get the initiator of search and save results to
database

Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.

answered Dec 7 at 13:42

rzo

3,11921652

add a comment |

up vote
2
down vote

accepted

You should inject your database service into your ẀebCrawler instances and not use a singleton to manage the result of your web-crawl.

crawler4j supports a custom CrawlController.WebCrawlerFactory (see here for reference), which can be used with Spring to inject your database service into a ImageCrawler instance.

Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):

decode this image, get the initiator of search and save results to
database

Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.

answered Dec 7 at 13:42

rzo

3,11921652

You should inject your database service into your ẀebCrawler instances and not use a singleton to manage the result of your web-crawl.

crawler4j supports a custom CrawlController.WebCrawlerFactory (see here for reference), which can be used with Spring to inject your database service into a ImageCrawler instance.

Every single crawler thread should be responsible for the whole process you described with (e.g. by using some specific services for it):

decode this image, get the initiator of search and save results to
database

Setting it up like this, your database will be the only source of truth and you will not have to deal with synchronizing crawler-states between different instances or user-sessions.

answered Dec 7 at 13:42

rzo

3,11921652

answered Dec 7 at 13:42

rzo

3,11921652

answered Dec 7 at 13:42

rzo

3,11921652

answered Dec 7 at 13:42

rzo

3,11921652

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl