pyspark convert row to json with nulls

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""

Any tips?

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

add a comment |

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Any tips?

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

add a comment |

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Any tips?

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Any tips?

json apache-spark pyspark apache-spark-sql

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

edited Nov 28 '18 at 18:24

pault

16.4k42652

edited Nov 28 '18 at 18:24

pault

16.4k42652

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

asked Nov 28 '18 at 18:00

mikeyoung

134

asked Nov 28 '18 at 18:00

mikeyoung

134

add a comment |

1 Answer
1

active

oldest

votes

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53525470%2fpyspark-convert-row-to-json-with-nulls%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

answered Nov 28 '18 at 18:17

pault

16.4k42652

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl