pyspark convert row to json with nulls












1















Goal:
For a dataframe with schema



id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string


I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code



df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))


I am having one issue:



Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""



Any tips?










share|improve this question





























    1















    Goal:
    For a dataframe with schema



    id:string
    Cold:string
    Medium:string
    Hot:string
    IsNull:string
    annual_sales_c:string
    average_check_c:string
    credit_rating_c:string
    cuisine_c:string
    dayparts_c:string
    location_name_c:string
    market_category_c:string
    market_segment_list_c:string
    menu_items_c:string
    msa_name_c:string
    name:string
    number_of_employees_c:string
    number_of_rooms_c:string
    Months In Role:integer
    Tenured Status:string
    IsCustomer:integer
    units_c:string
    years_in_business_c:string
    medium_interactions_c:string
    hot_interactions_c:string
    cold_interactions_c:string
    is_null_interactions_c:string


    I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
    My code



    df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))


    I am having one issue:



    Issue:
    When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""



    Any tips?










    share|improve this question



























      1












      1








      1








      Goal:
      For a dataframe with schema



      id:string
      Cold:string
      Medium:string
      Hot:string
      IsNull:string
      annual_sales_c:string
      average_check_c:string
      credit_rating_c:string
      cuisine_c:string
      dayparts_c:string
      location_name_c:string
      market_category_c:string
      market_segment_list_c:string
      menu_items_c:string
      msa_name_c:string
      name:string
      number_of_employees_c:string
      number_of_rooms_c:string
      Months In Role:integer
      Tenured Status:string
      IsCustomer:integer
      units_c:string
      years_in_business_c:string
      medium_interactions_c:string
      hot_interactions_c:string
      cold_interactions_c:string
      is_null_interactions_c:string


      I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
      My code



      df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))


      I am having one issue:



      Issue:
      When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""



      Any tips?










      share|improve this question
















      Goal:
      For a dataframe with schema



      id:string
      Cold:string
      Medium:string
      Hot:string
      IsNull:string
      annual_sales_c:string
      average_check_c:string
      credit_rating_c:string
      cuisine_c:string
      dayparts_c:string
      location_name_c:string
      market_category_c:string
      market_segment_list_c:string
      menu_items_c:string
      msa_name_c:string
      name:string
      number_of_employees_c:string
      number_of_rooms_c:string
      Months In Role:integer
      Tenured Status:string
      IsCustomer:integer
      units_c:string
      years_in_business_c:string
      medium_interactions_c:string
      hot_interactions_c:string
      cold_interactions_c:string
      is_null_interactions_c:string


      I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
      My code



      df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))


      I am having one issue:



      Issue:
      When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""



      Any tips?







      json apache-spark pyspark apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 28 '18 at 18:24









      pault

      16.4k42652




      16.4k42652










      asked Nov 28 '18 at 18:00









      mikeyoungmikeyoung

      134




      134
























          1 Answer
          1






          active

          oldest

          votes


















          1















          You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.



          Consider the following example DataFrame:



          data = [
          ('one', 1, 10),
          (None, 2, 20),
          ('three', None, 30),
          (None, None, 40)
          ]

          sdf = spark.createDataFrame(data, ["A", "B", "C"])
          sdf.printSchema()
          #root
          # |-- A: string (nullable = true)
          # |-- B: long (nullable = true)
          # |-- C: long (nullable = true)


          Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.



          from pyspark.sql.functions import col, to_json, struct, when, lit
          sdf = sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [
          when(
          col(x).isNotNull(),
          col(x)
          ).otherwise(lit("")).alias(x)
          for x in sdf.columns
          ]
          )
          )
          )
          sdf.show()
          #+-----+----+---+-----------------------------+
          #|A |B |C |JSON |
          #+-----+----+---+-----------------------------+
          #|one |1 |10 |{"A":"one","B":"1","C":"10"} |
          #|null |2 |20 |{"A":"","B":"2","C":"20"} |
          #|three|null|30 |{"A":"three","B":"","C":"30"}|
          #|null |null|40 |{"A":"","B":"","C":"40"} |
          #+-----+----+---+-----------------------------+




          Another option is to use pyspark.sql.functions.coalesce instead of when:



          from pyspark.sql.functions import coalesce

          sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
          )
          )
          ).show(truncate=False)
          ## Same as above





          share|improve this answer


























          • Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

            – mikeyoung
            Nov 28 '18 at 18:58











          • @mikeyoung I added an update- you just need to alias the column with the original name.

            – pault
            Nov 28 '18 at 19:03











          • perfect, def a simple question ;)

            – mikeyoung
            Nov 28 '18 at 19:10












          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53525470%2fpyspark-convert-row-to-json-with-nulls%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1















          You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.



          Consider the following example DataFrame:



          data = [
          ('one', 1, 10),
          (None, 2, 20),
          ('three', None, 30),
          (None, None, 40)
          ]

          sdf = spark.createDataFrame(data, ["A", "B", "C"])
          sdf.printSchema()
          #root
          # |-- A: string (nullable = true)
          # |-- B: long (nullable = true)
          # |-- C: long (nullable = true)


          Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.



          from pyspark.sql.functions import col, to_json, struct, when, lit
          sdf = sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [
          when(
          col(x).isNotNull(),
          col(x)
          ).otherwise(lit("")).alias(x)
          for x in sdf.columns
          ]
          )
          )
          )
          sdf.show()
          #+-----+----+---+-----------------------------+
          #|A |B |C |JSON |
          #+-----+----+---+-----------------------------+
          #|one |1 |10 |{"A":"one","B":"1","C":"10"} |
          #|null |2 |20 |{"A":"","B":"2","C":"20"} |
          #|three|null|30 |{"A":"three","B":"","C":"30"}|
          #|null |null|40 |{"A":"","B":"","C":"40"} |
          #+-----+----+---+-----------------------------+




          Another option is to use pyspark.sql.functions.coalesce instead of when:



          from pyspark.sql.functions import coalesce

          sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
          )
          )
          ).show(truncate=False)
          ## Same as above





          share|improve this answer


























          • Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

            – mikeyoung
            Nov 28 '18 at 18:58











          • @mikeyoung I added an update- you just need to alias the column with the original name.

            – pault
            Nov 28 '18 at 19:03











          • perfect, def a simple question ;)

            – mikeyoung
            Nov 28 '18 at 19:10
















          1















          You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.



          Consider the following example DataFrame:



          data = [
          ('one', 1, 10),
          (None, 2, 20),
          ('three', None, 30),
          (None, None, 40)
          ]

          sdf = spark.createDataFrame(data, ["A", "B", "C"])
          sdf.printSchema()
          #root
          # |-- A: string (nullable = true)
          # |-- B: long (nullable = true)
          # |-- C: long (nullable = true)


          Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.



          from pyspark.sql.functions import col, to_json, struct, when, lit
          sdf = sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [
          when(
          col(x).isNotNull(),
          col(x)
          ).otherwise(lit("")).alias(x)
          for x in sdf.columns
          ]
          )
          )
          )
          sdf.show()
          #+-----+----+---+-----------------------------+
          #|A |B |C |JSON |
          #+-----+----+---+-----------------------------+
          #|one |1 |10 |{"A":"one","B":"1","C":"10"} |
          #|null |2 |20 |{"A":"","B":"2","C":"20"} |
          #|three|null|30 |{"A":"three","B":"","C":"30"}|
          #|null |null|40 |{"A":"","B":"","C":"40"} |
          #+-----+----+---+-----------------------------+




          Another option is to use pyspark.sql.functions.coalesce instead of when:



          from pyspark.sql.functions import coalesce

          sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
          )
          )
          ).show(truncate=False)
          ## Same as above





          share|improve this answer


























          • Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

            – mikeyoung
            Nov 28 '18 at 18:58











          • @mikeyoung I added an update- you just need to alias the column with the original name.

            – pault
            Nov 28 '18 at 19:03











          • perfect, def a simple question ;)

            – mikeyoung
            Nov 28 '18 at 19:10














          1












          1








          1








          You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.



          Consider the following example DataFrame:



          data = [
          ('one', 1, 10),
          (None, 2, 20),
          ('three', None, 30),
          (None, None, 40)
          ]

          sdf = spark.createDataFrame(data, ["A", "B", "C"])
          sdf.printSchema()
          #root
          # |-- A: string (nullable = true)
          # |-- B: long (nullable = true)
          # |-- C: long (nullable = true)


          Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.



          from pyspark.sql.functions import col, to_json, struct, when, lit
          sdf = sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [
          when(
          col(x).isNotNull(),
          col(x)
          ).otherwise(lit("")).alias(x)
          for x in sdf.columns
          ]
          )
          )
          )
          sdf.show()
          #+-----+----+---+-----------------------------+
          #|A |B |C |JSON |
          #+-----+----+---+-----------------------------+
          #|one |1 |10 |{"A":"one","B":"1","C":"10"} |
          #|null |2 |20 |{"A":"","B":"2","C":"20"} |
          #|three|null|30 |{"A":"three","B":"","C":"30"}|
          #|null |null|40 |{"A":"","B":"","C":"40"} |
          #+-----+----+---+-----------------------------+




          Another option is to use pyspark.sql.functions.coalesce instead of when:



          from pyspark.sql.functions import coalesce

          sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
          )
          )
          ).show(truncate=False)
          ## Same as above





          share|improve this answer
















          You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.



          Consider the following example DataFrame:



          data = [
          ('one', 1, 10),
          (None, 2, 20),
          ('three', None, 30),
          (None, None, 40)
          ]

          sdf = spark.createDataFrame(data, ["A", "B", "C"])
          sdf.printSchema()
          #root
          # |-- A: string (nullable = true)
          # |-- B: long (nullable = true)
          # |-- C: long (nullable = true)


          Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.



          from pyspark.sql.functions import col, to_json, struct, when, lit
          sdf = sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [
          when(
          col(x).isNotNull(),
          col(x)
          ).otherwise(lit("")).alias(x)
          for x in sdf.columns
          ]
          )
          )
          )
          sdf.show()
          #+-----+----+---+-----------------------------+
          #|A |B |C |JSON |
          #+-----+----+---+-----------------------------+
          #|one |1 |10 |{"A":"one","B":"1","C":"10"} |
          #|null |2 |20 |{"A":"","B":"2","C":"20"} |
          #|three|null|30 |{"A":"three","B":"","C":"30"}|
          #|null |null|40 |{"A":"","B":"","C":"40"} |
          #+-----+----+---+-----------------------------+




          Another option is to use pyspark.sql.functions.coalesce instead of when:



          from pyspark.sql.functions import coalesce

          sdf.withColumn(
          "JSON",
          to_json(
          struct(
          [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
          )
          )
          ).show(truncate=False)
          ## Same as above






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 28 '18 at 19:03

























          answered Nov 28 '18 at 18:17









          paultpault

          16.4k42652




          16.4k42652













          • Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

            – mikeyoung
            Nov 28 '18 at 18:58











          • @mikeyoung I added an update- you just need to alias the column with the original name.

            – pault
            Nov 28 '18 at 19:03











          • perfect, def a simple question ;)

            – mikeyoung
            Nov 28 '18 at 19:10



















          • Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

            – mikeyoung
            Nov 28 '18 at 18:58











          • @mikeyoung I added an update- you just need to alias the column with the original name.

            – pault
            Nov 28 '18 at 19:03











          • perfect, def a simple question ;)

            – mikeyoung
            Nov 28 '18 at 19:10

















          Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

          – mikeyoung
          Nov 28 '18 at 18:58





          Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

          – mikeyoung
          Nov 28 '18 at 18:58













          @mikeyoung I added an update- you just need to alias the column with the original name.

          – pault
          Nov 28 '18 at 19:03





          @mikeyoung I added an update- you just need to alias the column with the original name.

          – pault
          Nov 28 '18 at 19:03













          perfect, def a simple question ;)

          – mikeyoung
          Nov 28 '18 at 19:10





          perfect, def a simple question ;)

          – mikeyoung
          Nov 28 '18 at 19:10




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53525470%2fpyspark-convert-row-to-json-with-nulls%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

          Calculate evaluation metrics using cross_val_predict sklearn

          Insert data from modal to MySQL (multiple modal on website)