Count number of true and false condition in spark data frame












0















I am coming from a MATLAB background, and I can simply do this



age_sum_error = sum(age > prediction - 4 & age < prediction + 4);


This will count the number of age values for which the prediction (+4/-4) is true, I want to do something similar in spark data frame.



Say that below is my spark data frame



+--------------------------+
|age | gender | prediction |
+----+--------+------------+
|35 | M | 30 |
|40 | F | 42 |
|45 | F | 38 |
|26 | F | 29 |
+----+--------+------------+


I want my result to look something like this



+------+----------+
|false | positive |
+------+----------+
|2 | 2 |
+------+----------+









share|improve this question



























    0















    I am coming from a MATLAB background, and I can simply do this



    age_sum_error = sum(age > prediction - 4 & age < prediction + 4);


    This will count the number of age values for which the prediction (+4/-4) is true, I want to do something similar in spark data frame.



    Say that below is my spark data frame



    +--------------------------+
    |age | gender | prediction |
    +----+--------+------------+
    |35 | M | 30 |
    |40 | F | 42 |
    |45 | F | 38 |
    |26 | F | 29 |
    +----+--------+------------+


    I want my result to look something like this



    +------+----------+
    |false | positive |
    +------+----------+
    |2 | 2 |
    +------+----------+









    share|improve this question

























      0












      0








      0








      I am coming from a MATLAB background, and I can simply do this



      age_sum_error = sum(age > prediction - 4 & age < prediction + 4);


      This will count the number of age values for which the prediction (+4/-4) is true, I want to do something similar in spark data frame.



      Say that below is my spark data frame



      +--------------------------+
      |age | gender | prediction |
      +----+--------+------------+
      |35 | M | 30 |
      |40 | F | 42 |
      |45 | F | 38 |
      |26 | F | 29 |
      +----+--------+------------+


      I want my result to look something like this



      +------+----------+
      |false | positive |
      +------+----------+
      |2 | 2 |
      +------+----------+









      share|improve this question














      I am coming from a MATLAB background, and I can simply do this



      age_sum_error = sum(age > prediction - 4 & age < prediction + 4);


      This will count the number of age values for which the prediction (+4/-4) is true, I want to do something similar in spark data frame.



      Say that below is my spark data frame



      +--------------------------+
      |age | gender | prediction |
      +----+--------+------------+
      |35 | M | 30 |
      |40 | F | 42 |
      |45 | F | 38 |
      |26 | F | 29 |
      +----+--------+------------+


      I want my result to look something like this



      +------+----------+
      |false | positive |
      +------+----------+
      |2 | 2 |
      +------+----------+






      python apache-spark pyspark apache-spark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 24 '18 at 21:13









      Jam1Jam1

      303313




      303313
























          2 Answers
          2






          active

          oldest

          votes


















          1














          First calculate the condition, and then aggregate the result by summing up the 1s and 0s:



          df.selectExpr(
          'cast(abs(age - prediction) < 4 as int) as condition'
          ).selectExpr(
          'sum(condition) as positive',
          'sum(1-condition) as negative'
          ).show()
          +--------+--------+
          |positive|negative|
          +--------+--------+
          | 2| 2|
          +--------+--------+





          share|improve this answer































            0














            Its a lot more code than matlab, but here's how I would do it.



            import numpy as np

            ages = [35, 40, 45, 26]
            pred = [30, 42, 38, 29]
            tolerance = 4

            # get boolean array of people older and younger than limits
            is_older = np.greater(ages, pred-tolerance) # a boolean array
            is_younger = np.less(ages, pred+tolerance) # a boolean array

            # convert these boolean arrays to ints then multiply. True = 1, False = 0.
            in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's

            # add upp the indixes that are still 1
            senior_count = np.sum(in_range)


            Hope this helps.






            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53462422%2fcount-number-of-true-and-false-condition-in-spark-data-frame%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              First calculate the condition, and then aggregate the result by summing up the 1s and 0s:



              df.selectExpr(
              'cast(abs(age - prediction) < 4 as int) as condition'
              ).selectExpr(
              'sum(condition) as positive',
              'sum(1-condition) as negative'
              ).show()
              +--------+--------+
              |positive|negative|
              +--------+--------+
              | 2| 2|
              +--------+--------+





              share|improve this answer




























                1














                First calculate the condition, and then aggregate the result by summing up the 1s and 0s:



                df.selectExpr(
                'cast(abs(age - prediction) < 4 as int) as condition'
                ).selectExpr(
                'sum(condition) as positive',
                'sum(1-condition) as negative'
                ).show()
                +--------+--------+
                |positive|negative|
                +--------+--------+
                | 2| 2|
                +--------+--------+





                share|improve this answer


























                  1












                  1








                  1







                  First calculate the condition, and then aggregate the result by summing up the 1s and 0s:



                  df.selectExpr(
                  'cast(abs(age - prediction) < 4 as int) as condition'
                  ).selectExpr(
                  'sum(condition) as positive',
                  'sum(1-condition) as negative'
                  ).show()
                  +--------+--------+
                  |positive|negative|
                  +--------+--------+
                  | 2| 2|
                  +--------+--------+





                  share|improve this answer













                  First calculate the condition, and then aggregate the result by summing up the 1s and 0s:



                  df.selectExpr(
                  'cast(abs(age - prediction) < 4 as int) as condition'
                  ).selectExpr(
                  'sum(condition) as positive',
                  'sum(1-condition) as negative'
                  ).show()
                  +--------+--------+
                  |positive|negative|
                  +--------+--------+
                  | 2| 2|
                  +--------+--------+






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 24 '18 at 21:57









                  PsidomPsidom

                  123k1283126




                  123k1283126

























                      0














                      Its a lot more code than matlab, but here's how I would do it.



                      import numpy as np

                      ages = [35, 40, 45, 26]
                      pred = [30, 42, 38, 29]
                      tolerance = 4

                      # get boolean array of people older and younger than limits
                      is_older = np.greater(ages, pred-tolerance) # a boolean array
                      is_younger = np.less(ages, pred+tolerance) # a boolean array

                      # convert these boolean arrays to ints then multiply. True = 1, False = 0.
                      in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's

                      # add upp the indixes that are still 1
                      senior_count = np.sum(in_range)


                      Hope this helps.






                      share|improve this answer




























                        0














                        Its a lot more code than matlab, but here's how I would do it.



                        import numpy as np

                        ages = [35, 40, 45, 26]
                        pred = [30, 42, 38, 29]
                        tolerance = 4

                        # get boolean array of people older and younger than limits
                        is_older = np.greater(ages, pred-tolerance) # a boolean array
                        is_younger = np.less(ages, pred+tolerance) # a boolean array

                        # convert these boolean arrays to ints then multiply. True = 1, False = 0.
                        in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's

                        # add upp the indixes that are still 1
                        senior_count = np.sum(in_range)


                        Hope this helps.






                        share|improve this answer


























                          0












                          0








                          0







                          Its a lot more code than matlab, but here's how I would do it.



                          import numpy as np

                          ages = [35, 40, 45, 26]
                          pred = [30, 42, 38, 29]
                          tolerance = 4

                          # get boolean array of people older and younger than limits
                          is_older = np.greater(ages, pred-tolerance) # a boolean array
                          is_younger = np.less(ages, pred+tolerance) # a boolean array

                          # convert these boolean arrays to ints then multiply. True = 1, False = 0.
                          in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's

                          # add upp the indixes that are still 1
                          senior_count = np.sum(in_range)


                          Hope this helps.






                          share|improve this answer













                          Its a lot more code than matlab, but here's how I would do it.



                          import numpy as np

                          ages = [35, 40, 45, 26]
                          pred = [30, 42, 38, 29]
                          tolerance = 4

                          # get boolean array of people older and younger than limits
                          is_older = np.greater(ages, pred-tolerance) # a boolean array
                          is_younger = np.less(ages, pred+tolerance) # a boolean array

                          # convert these boolean arrays to ints then multiply. True = 1, False = 0.
                          in_range = is_older.astype(int)*is_younger.astype(int) # 0's cancel 1's

                          # add upp the indixes that are still 1
                          senior_count = np.sum(in_range)


                          Hope this helps.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 24 '18 at 21:54









                          Charles StraussCharles Strauss

                          92




                          92






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53462422%2fcount-number-of-true-and-false-condition-in-spark-data-frame%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

                              Calculate evaluation metrics using cross_val_predict sklearn

                              Insert data from modal to MySQL (multiple modal on website)