How to replace white space with comma in Spark ( with Scala)?

I have a log file like this. I want to create a DataFrame in Scala.

2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2

I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.

Here is everything I tried:

Tried importing it as text file first to see if there is a replaceAll method.

Tried splitting on the basis of space.

Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..

asked Nov 26 '18 at 20:51

San

Possible duplicate of how to use Regexp_replace in spark

– user10465355
Nov 26 '18 at 20:54

add a comment |

I have a log file like this. I want to create a DataFrame in Scala.

2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2

I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.

Here is everything I tried:

Tried importing it as text file first to see if there is a replaceAll method.

Tried splitting on the basis of space.

Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..

asked Nov 26 '18 at 20:51

San

Possible duplicate of how to use Regexp_replace in spark

– user10465355
Nov 26 '18 at 20:54

add a comment |

I have a log file like this. I want to create a DataFrame in Scala.

2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2

I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.

Here is everything I tried:

Tried importing it as text file first to see if there is a replaceAll method.

Tried splitting on the basis of space.

Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..

asked Nov 26 '18 at 20:51

San

I have a log file like this. I want to create a DataFrame in Scala.

2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2

I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.

Here is everything I tried:

Tried importing it as text file first to see if there is a replaceAll method.

Tried splitting on the basis of space.

Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..

scala apache-spark apache-spark-sql databricks

asked Nov 26 '18 at 20:51

San

asked Nov 26 '18 at 20:51

San

asked Nov 26 '18 at 20:51

San

asked Nov 26 '18 at 20:51

San

asked Nov 26 '18 at 20:51

San

Possible duplicate of how to use Regexp_replace in spark

– user10465355
Nov 26 '18 at 20:54

add a comment |

Possible duplicate of how to use Regexp_replace in spark

– user10465355
Nov 26 '18 at 20:54

Possible duplicate of how to use Regexp_replace in spark

– user10465355
Nov 26 '18 at 20:54

add a comment |

3 Answers
3

active

oldest

votes

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")

answered Nov 27 '18 at 7:08

Oli

1,229213

add a comment |

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)

answered Nov 26 '18 at 21:24

benlaird

23029

Almost precise. Thank you.

– San
Nov 26 '18 at 21:30

It is also replacing the space inside the quotes. Looking for a way to overcome it.

– San
Nov 26 '18 at 22:09

Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

– benlaird
Nov 26 '18 at 22:23

Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

– benlaird
Nov 26 '18 at 22:26

I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

– San
Nov 26 '18 at 22:45

|
show 1 more comment

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._

val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""

val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)

val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)

println(http)

println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1

curl/7.38.0

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488866%2fhow-to-replace-white-space-with-comma-in-spark-with-scala%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")

answered Nov 27 '18 at 7:08

Oli

1,229213

add a comment |

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")

answered Nov 27 '18 at 7:08

Oli

1,229213

add a comment |

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")

answered Nov 27 '18 at 7:08

Oli

1,229213

You can simply tell spark that your delimiter is a white space like this:

val df = spark.read.option("delimiter", " ").csv("path/to/file")

answered Nov 27 '18 at 7:08

Oli

1,229213

answered Nov 27 '18 at 7:08

Oli

1,229213

answered Nov 27 '18 at 7:08

Oli

1,229213

answered Nov 27 '18 at 7:08

Oli

1,229213

add a comment |

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)

answered Nov 26 '18 at 21:24

benlaird

23029

Almost precise. Thank you.

– San
Nov 26 '18 at 21:30

It is also replacing the space inside the quotes. Looking for a way to overcome it.

– San
Nov 26 '18 at 22:09

Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

– benlaird
Nov 26 '18 at 22:23

Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

– benlaird
Nov 26 '18 at 22:26

I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

– San
Nov 26 '18 at 22:45

|
show 1 more comment

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)

answered Nov 26 '18 at 21:24

benlaird

23029

Almost precise. Thank you.

– San
Nov 26 '18 at 21:30

It is also replacing the space inside the quotes. Looking for a way to overcome it.

– San
Nov 26 '18 at 22:09

Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

– benlaird
Nov 26 '18 at 22:23

Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

– benlaird
Nov 26 '18 at 22:26

I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

– San
Nov 26 '18 at 22:45

|
show 1 more comment

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)

answered Nov 26 '18 at 21:24

benlaird

23029

Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:

val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))

Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.

A simpler way to get up and going though would be

spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)

answered Nov 26 '18 at 21:24

benlaird

23029

answered Nov 26 '18 at 21:24

benlaird

23029

answered Nov 26 '18 at 21:24

benlaird

23029

answered Nov 26 '18 at 21:24

benlaird

23029

Almost precise. Thank you.

– San
Nov 26 '18 at 21:30

It is also replacing the space inside the quotes. Looking for a way to overcome it.

– San
Nov 26 '18 at 22:09

Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

– benlaird
Nov 26 '18 at 22:23

Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

– benlaird
Nov 26 '18 at 22:26

I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

– San
Nov 26 '18 at 22:45

|
show 1 more comment

Almost precise. Thank you.

– San
Nov 26 '18 at 21:30

It is also replacing the space inside the quotes. Looking for a way to overcome it.

– San
Nov 26 '18 at 22:09

Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

– benlaird
Nov 26 '18 at 22:23

Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

– benlaird
Nov 26 '18 at 22:26

I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

– San
Nov 26 '18 at 22:45

Almost precise. Thank you.

– San
Nov 26 '18 at 21:30

It is also replacing the space inside the quotes. Looking for a way to overcome it.

– San
Nov 26 '18 at 22:09

Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that

– benlaird
Nov 26 '18 at 22:23

Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame

– benlaird
Nov 26 '18 at 22:26

I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.

– San
Nov 26 '18 at 22:45

|
show 1 more comment

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._

val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""

val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)

val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)

println(http)

println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1

curl/7.38.0

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

add a comment |

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._

val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""

val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)

val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)

println(http)

println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1

curl/7.38.0

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

add a comment |

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._

val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""

val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)

val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)

println(http)

println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1

curl/7.38.0

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.

import org.apache.commons.csv.CSVParser._

val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""

val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)

val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)

println(http)

println(curl)

Results:

GET https://www.example.com:443/ HTTP/1.1

curl/7.38.0

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

answered Nov 27 '18 at 13:06

stack0114106

3,7322419

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl