How to replace white space with comma in Spark ( with Scala)?
I have a log file like this. I want to create a DataFrame in Scala.
2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2
I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.
Here is everything I tried:
- Tried importing it as text file first to see if there is a replaceAll method.
- Tried splitting on the basis of space.
Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..
scala apache-spark apache-spark-sql databricks
add a comment |
I have a log file like this. I want to create a DataFrame in Scala.
2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2
I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.
Here is everything I tried:
- Tried importing it as text file first to see if there is a replaceAll method.
- Tried splitting on the basis of space.
Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..
scala apache-spark apache-spark-sql databricks
Possible duplicate of how to use Regexp_replace in spark
– user10465355
Nov 26 '18 at 20:54
add a comment |
I have a log file like this. I want to create a DataFrame in Scala.
2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2
I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.
Here is everything I tried:
- Tried importing it as text file first to see if there is a replaceAll method.
- Tried splitting on the basis of space.
Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..
scala apache-spark apache-spark-sql databricks
I have a log file like this. I want to create a DataFrame in Scala.
2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2
I want to replace all the spaces with commas so that I can use spark.sql but I am unable to do so.
Here is everything I tried:
- Tried importing it as text file first to see if there is a replaceAll method.
- Tried splitting on the basis of space.
Any suggestions. I went through the documentation and there is no mention of replace function like in Pandas..
scala apache-spark apache-spark-sql databricks
scala apache-spark apache-spark-sql databricks
asked Nov 26 '18 at 20:51
SanSan
74
74
Possible duplicate of how to use Regexp_replace in spark
– user10465355
Nov 26 '18 at 20:54
add a comment |
Possible duplicate of how to use Regexp_replace in spark
– user10465355
Nov 26 '18 at 20:54
Possible duplicate of how to use Regexp_replace in spark
– user10465355
Nov 26 '18 at 20:54
Possible duplicate of how to use Regexp_replace in spark
– user10465355
Nov 26 '18 at 20:54
add a comment |
3 Answers
3
active
oldest
votes
You can simply tell spark that your delimiter is a white space like this:
val df = spark.read.option("delimiter", " ").csv("path/to/file")
add a comment |
Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:
val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))
Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.
A simpler way to get up and going though would be
spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
Almost precise. Thank you.
– San
Nov 26 '18 at 21:30
It is also replacing the space inside the quotes. Looking for a way to overcome it.
– San
Nov 26 '18 at 22:09
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that
– benlaird
Nov 26 '18 at 22:23
Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame
– benlaird
Nov 26 '18 at 22:26
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.
– San
Nov 26 '18 at 22:45
|
show 1 more comment
If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.
import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)
Results:
GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488866%2fhow-to-replace-white-space-with-comma-in-spark-with-scala%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can simply tell spark that your delimiter is a white space like this:
val df = spark.read.option("delimiter", " ").csv("path/to/file")
add a comment |
You can simply tell spark that your delimiter is a white space like this:
val df = spark.read.option("delimiter", " ").csv("path/to/file")
add a comment |
You can simply tell spark that your delimiter is a white space like this:
val df = spark.read.option("delimiter", " ").csv("path/to/file")
You can simply tell spark that your delimiter is a white space like this:
val df = spark.read.option("delimiter", " ").csv("path/to/file")
answered Nov 27 '18 at 7:08
OliOli
1,229213
1,229213
add a comment |
add a comment |
Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:
val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))
Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.
A simpler way to get up and going though would be
spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
Almost precise. Thank you.
– San
Nov 26 '18 at 21:30
It is also replacing the space inside the quotes. Looking for a way to overcome it.
– San
Nov 26 '18 at 22:09
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that
– benlaird
Nov 26 '18 at 22:23
Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame
– benlaird
Nov 26 '18 at 22:26
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.
– San
Nov 26 '18 at 22:45
|
show 1 more comment
Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:
val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))
Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.
A simpler way to get up and going though would be
spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
Almost precise. Thank you.
– San
Nov 26 '18 at 21:30
It is also replacing the space inside the quotes. Looking for a way to overcome it.
– San
Nov 26 '18 at 22:09
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that
– benlaird
Nov 26 '18 at 22:23
Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame
– benlaird
Nov 26 '18 at 22:26
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.
– San
Nov 26 '18 at 22:45
|
show 1 more comment
Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:
val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))
Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.
A simpler way to get up and going though would be
spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
Since you don't have typed columns yet, I'd start as an RDD, split the text with a map then convert to a Dataframe with a schema.
Roughly:
val rdd = sc.textFile({logline path}).map(line=>line.split("\s+"))
Then you need to turn your RDD (where each record is an array of tokens) to a Dataframe. The most robust way would be to map your arrays to Row objects, as an RDD[Row] is what underlies a dataframe.
A simpler way to get up and going though would be
spark.createDataFrame(rdd).toDF("datetime", "host", "ip", ...)
answered Nov 26 '18 at 21:24
benlairdbenlaird
23029
23029
Almost precise. Thank you.
– San
Nov 26 '18 at 21:30
It is also replacing the space inside the quotes. Looking for a way to overcome it.
– San
Nov 26 '18 at 22:09
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that
– benlaird
Nov 26 '18 at 22:23
Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame
– benlaird
Nov 26 '18 at 22:26
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.
– San
Nov 26 '18 at 22:45
|
show 1 more comment
Almost precise. Thank you.
– San
Nov 26 '18 at 21:30
It is also replacing the space inside the quotes. Looking for a way to overcome it.
– San
Nov 26 '18 at 22:09
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that
– benlaird
Nov 26 '18 at 22:23
Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame
– benlaird
Nov 26 '18 at 22:26
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.
– San
Nov 26 '18 at 22:45
Almost precise. Thank you.
– San
Nov 26 '18 at 21:30
Almost precise. Thank you.
– San
Nov 26 '18 at 21:30
It is also replacing the space inside the quotes. Looking for a way to overcome it.
– San
Nov 26 '18 at 22:09
It is also replacing the space inside the quotes. Looking for a way to overcome it.
– San
Nov 26 '18 at 22:09
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that
– benlaird
Nov 26 '18 at 22:23
Now that I think of it, Spark dataframes have a CSV reader, it probably makes sense to just use that
– benlaird
Nov 26 '18 at 22:23
Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame
– benlaird
Nov 26 '18 at 22:26
Scala CSV reader: spark.apache.org/docs/2.1.0/api/scala/…*):org.apache.spark.sql.DataFrame
– benlaird
Nov 26 '18 at 22:26
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.
– San
Nov 26 '18 at 22:45
I want to use it but my dataset is a group of arrays, I mean each row is an array as shown in the above log. So I am looking to split everything based on space, give names to columns and then do SQL on it.
– San
Nov 26 '18 at 22:45
|
show 1 more comment
If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.
import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)
Results:
GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0
add a comment |
If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.
import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)
Results:
GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0
add a comment |
If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.
import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)
Results:
GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0
If you just want to split on space and retain the string within double quotes, you can use apache.commons.csv library.
import org.apache.commons.csv.CSVParser._
val str = """2015-05-13T23:39:43.945958Z my-loadbalancer 192.168.131.39:2817 10.0.0.1:80 0.000086 0.001048 0.001337 200 200 0 57 "GET https://www.example.com:443/ HTTP/1.1" "curl/7.38.0" DHE-RSA-AES128-SHA TLSv1.2"""
val http = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(11)
val curl = csv.CSVParser.parse(str,CSVFormat.newFormat(' ').withQuote('"')).getRecords.get(0).get(12)
println(http)
println(curl)
Results:
GET https://www.example.com:443/ HTTP/1.1
curl/7.38.0
answered Nov 27 '18 at 13:06
stack0114106stack0114106
3,7322419
3,7322419
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488866%2fhow-to-replace-white-space-with-comma-in-spark-with-scala%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of how to use Regexp_replace in spark
– user10465355
Nov 26 '18 at 20:54