Metrics collection and analysis architecture












1















We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.



Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.



First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?



AWS has shared time-series data processing architecture:
enter image description here



Very simplified I think of it this way:




  1. Devices push data directly to DynamoDB using HTTP API

  2. Metrics are stored in one table per 24 hours

  3. At the end of the day some procedure runs on Elastic Map Reduce and
    produces ready JSON files with data required to show graphs per time
    period.

  4. Old tables are stored in RedShift for further applications.


Has anyone already done something similar before? Maybe there is simpler architecture?










share|improve this question





























    1















    We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.



    Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.



    First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?



    AWS has shared time-series data processing architecture:
    enter image description here



    Very simplified I think of it this way:




    1. Devices push data directly to DynamoDB using HTTP API

    2. Metrics are stored in one table per 24 hours

    3. At the end of the day some procedure runs on Elastic Map Reduce and
      produces ready JSON files with data required to show graphs per time
      period.

    4. Old tables are stored in RedShift for further applications.


    Has anyone already done something similar before? Maybe there is simpler architecture?










    share|improve this question



























      1












      1








      1








      We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.



      Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.



      First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?



      AWS has shared time-series data processing architecture:
      enter image description here



      Very simplified I think of it this way:




      1. Devices push data directly to DynamoDB using HTTP API

      2. Metrics are stored in one table per 24 hours

      3. At the end of the day some procedure runs on Elastic Map Reduce and
        produces ready JSON files with data required to show graphs per time
        period.

      4. Old tables are stored in RedShift for further applications.


      Has anyone already done something similar before? Maybe there is simpler architecture?










      share|improve this question
















      We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.



      Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.



      First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?



      AWS has shared time-series data processing architecture:
      enter image description here



      Very simplified I think of it this way:




      1. Devices push data directly to DynamoDB using HTTP API

      2. Metrics are stored in one table per 24 hours

      3. At the end of the day some procedure runs on Elastic Map Reduce and
        produces ready JSON files with data required to show graphs per time
        period.

      4. Old tables are stored in RedShift for further applications.


      Has anyone already done something similar before? Maybe there is simpler architecture?







      database amazon-web-services architecture bigdata iot






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 26 '18 at 10:01







      Nikita Zernov

















      asked Nov 26 '18 at 9:55









      Nikita ZernovNikita Zernov

      2,66162352




      2,66162352
























          1 Answer
          1






          active

          oldest

          votes


















          0














          This requires bigdata infrastructure like
          1) Hadoop cluster
          2) Spark
          3) HDFS
          4) HBase



          You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.



          IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.



          The reason I am suggesting this architecture is for
          scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.



          This is proven architecture in big data analytics application.






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53478569%2fmetrics-collection-and-analysis-architecture%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            This requires bigdata infrastructure like
            1) Hadoop cluster
            2) Spark
            3) HDFS
            4) HBase



            You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.



            IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.



            The reason I am suggesting this architecture is for
            scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.



            This is proven architecture in big data analytics application.






            share|improve this answer




























              0














              This requires bigdata infrastructure like
              1) Hadoop cluster
              2) Spark
              3) HDFS
              4) HBase



              You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.



              IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.



              The reason I am suggesting this architecture is for
              scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.



              This is proven architecture in big data analytics application.






              share|improve this answer


























                0












                0








                0







                This requires bigdata infrastructure like
                1) Hadoop cluster
                2) Spark
                3) HDFS
                4) HBase



                You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.



                IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.



                The reason I am suggesting this architecture is for
                scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.



                This is proven architecture in big data analytics application.






                share|improve this answer













                This requires bigdata infrastructure like
                1) Hadoop cluster
                2) Spark
                3) HDFS
                4) HBase



                You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.



                IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.



                The reason I am suggesting this architecture is for
                scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.



                This is proven architecture in big data analytics application.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Dec 1 '18 at 4:14









                challengerchallenger

                1,0981109




                1,0981109
































                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53478569%2fmetrics-collection-and-analysis-architecture%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                    Calculate evaluation metrics using cross_val_predict sklearn

                    Insert data from modal to MySQL (multiple modal on website)