pyspark convert row to json with nulls

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""

Any tips?

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

add a comment |

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Any tips?

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

add a comment |

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Any tips?

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

Goal:
For a dataframe with schema

id:string

Cold:string

Medium:string

Hot:string

IsNull:string

annual_sales_c:string

average_check_c:string

credit_rating_c:string

cuisine_c:string

dayparts_c:string

location_name_c:string

market_category_c:string

market_segment_list_c:string

menu_items_c:string

msa_name_c:string

name:string

number_of_employees_c:string

number_of_rooms_c:string

Months In Role:integer

Tenured Status:string

IsCustomer:integer

units_c:string

years_in_business_c:string

medium_interactions_c:string

hot_interactions_c:string

cold_interactions_c:string

is_null_interactions_c:string

I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code

df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))

I am having one issue:

Any tips?

json apache-spark pyspark apache-spark-sql

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

edited Nov 28 '18 at 18:24

pault

16.4k42652

edited Nov 28 '18 at 18:24

pault

16.4k42652

edited Nov 28 '18 at 18:24

pault

16.4k42652

asked Nov 28 '18 at 18:00

mikeyoung

134

asked Nov 28 '18 at 18:00

mikeyoung

134

asked Nov 28 '18 at 18:00

mikeyoung

134

add a comment |

1 Answer
1

active

oldest

votes

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53525470%2fpyspark-convert-row-to-json-with-nulls%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.

Consider the following example DataFrame:

data = [

    ('one', 1, 10),

    (None, 2, 20),

    ('three', None, 30),

    (None, None, 40)

]



sdf = spark.createDataFrame(data, ["A", "B", "C"])

sdf.printSchema()

#root

# |-- A: string (nullable = true)

# |-- B: long (nullable = true)

# |-- C: long (nullable = true)

Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.

from pyspark.sql.functions import col, to_json, struct, when, lit

sdf = sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [

                when(

                    col(x).isNotNull(),

                    col(x)

                ).otherwise(lit("")).alias(x) 

                for x in sdf.columns

            ]

        )

    )

)

sdf.show()

#+-----+----+---+-----------------------------+

#|A    |B   |C  |JSON                         |

#+-----+----+---+-----------------------------+

#|one  |1   |10 |{"A":"one","B":"1","C":"10"} |

#|null |2   |20 |{"A":"","B":"2","C":"20"}    |

#|three|null|30 |{"A":"three","B":"","C":"30"}|

#|null |null|40 |{"A":"","B":"","C":"40"}     |

#+-----+----+---+-----------------------------+

Another option is to use pyspark.sql.functions.coalesce instead of when:

from pyspark.sql.functions import coalesce



sdf.withColumn(

    "JSON",

    to_json(

        struct(

           [coalesce(col(x), lit("")).alias(x) for x in sdf.columns]

        )

    )

).show(truncate=False)

## Same as above

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

edited Nov 28 '18 at 19:03

answered Nov 28 '18 at 18:17

pault

16.4k42652

answered Nov 28 '18 at 18:17

pault

16.4k42652

answered Nov 28 '18 at 18:17

pault

16.4k42652

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

Thanks! That gets me close. Only issue now is that the json string doesn't preserve the column names. I get {"col1": "val1", "col2": "", etc.} vs {"id": "val1", "IsCustomer": "", etc.} This last point I'm 100% sure is a noob question... just can't figure it out

– mikeyoung
Nov 28 '18 at 18:58

@mikeyoung I added an update- you just need to alias the column with the original name.

– pault
Nov 28 '18 at 19:03

perfect, def a simple question ;)

– mikeyoung
Nov 28 '18 at 19:10

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

L ISzeu,b2Y0 od9 qm lIkKlwz6 IcB,EvBJ,wfIBphXm Ic DYPIhb60sscCdizZjeOHs07zqlDJ7 2Yb4wxqVf,3tSDtA2nOLU

搜尋此網誌

Btukfyl