R feature extraction for text
My question is about text mining, and text processing.
I would like to build a dataframe from my text.
My data is:
text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")
My expected output is :
expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))
My code is:
coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}
########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}
########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}
########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}
##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}
########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}
########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}
So I would like to extract the reference in a single line
Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.
Thank you :)
r text nlp text-mining feature-extraction
add a comment |
My question is about text mining, and text processing.
I would like to build a dataframe from my text.
My data is:
text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")
My expected output is :
expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))
My code is:
coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}
########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}
########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}
########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}
##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}
########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}
########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}
So I would like to extract the reference in a single line
Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.
Thank you :)
r text nlp text-mining feature-extraction
What have you tried so far?
– Manuel Bickel
Nov 26 '18 at 20:12
I have tried to extract with grep, by can't concatenate id_ref in only row.
– cincinnatus
Nov 26 '18 at 20:16
See my comment to below answer...
– Manuel Bickel
Nov 26 '18 at 21:19
add a comment |
My question is about text mining, and text processing.
I would like to build a dataframe from my text.
My data is:
text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")
My expected output is :
expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))
My code is:
coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}
########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}
########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}
########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}
##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}
########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}
########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}
So I would like to extract the reference in a single line
Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.
Thank you :)
r text nlp text-mining feature-extraction
My question is about text mining, and text processing.
I would like to build a dataframe from my text.
My data is:
text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")
My expected output is :
expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))
My code is:
coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}
########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}
########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}
########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}
##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}
########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}
########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}
So I would like to extract the reference in a single line
Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.
Thank you :)
r text nlp text-mining feature-extraction
r text nlp text-mining feature-extraction
edited Dec 1 '18 at 16:14
cincinnatus
asked Nov 26 '18 at 20:07
cincinnatuscincinnatus
629
629
What have you tried so far?
– Manuel Bickel
Nov 26 '18 at 20:12
I have tried to extract with grep, by can't concatenate id_ref in only row.
– cincinnatus
Nov 26 '18 at 20:16
See my comment to below answer...
– Manuel Bickel
Nov 26 '18 at 21:19
add a comment |
What have you tried so far?
– Manuel Bickel
Nov 26 '18 at 20:12
I have tried to extract with grep, by can't concatenate id_ref in only row.
– cincinnatus
Nov 26 '18 at 20:16
See my comment to below answer...
– Manuel Bickel
Nov 26 '18 at 21:19
What have you tried so far?
– Manuel Bickel
Nov 26 '18 at 20:12
What have you tried so far?
– Manuel Bickel
Nov 26 '18 at 20:12
I have tried to extract with grep, by can't concatenate id_ref in only row.
– cincinnatus
Nov 26 '18 at 20:16
I have tried to extract with grep, by can't concatenate id_ref in only row.
– cincinnatus
Nov 26 '18 at 20:16
See my comment to below answer...
– Manuel Bickel
Nov 26 '18 at 21:19
See my comment to below answer...
– Manuel Bickel
Nov 26 '18 at 21:19
add a comment |
2 Answers
2
active
oldest
votes
New and improved
text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n
text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])
patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")
tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)
tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)
str(tex.df)
# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."
You could uselist()
orpaste0(..., collapse = ", ")
to concatenate multiple elements and store them as a single entry.
– Manuel Bickel
Nov 26 '18 at 21:16
The size of the data frame will be that of the title. Since each article necessarily has a title.
– cincinnatus
Nov 26 '18 at 21:20
@ManuelBickel: But then we'd just end up with a vector.
– AkselA
Nov 26 '18 at 21:34
1
@ManuelBickel: Thanks, but I already figured out a way.
– AkselA
Nov 26 '18 at 22:34
1
@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.
– AkselA
Nov 26 '18 at 22:55
|
show 3 more comments
Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)
#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]
# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}
# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)
# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]
1
Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.
– cincinnatus
Nov 27 '18 at 5:18
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488266%2fr-feature-extraction-for-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
New and improved
text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n
text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])
patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")
tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)
tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)
str(tex.df)
# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."
You could uselist()
orpaste0(..., collapse = ", ")
to concatenate multiple elements and store them as a single entry.
– Manuel Bickel
Nov 26 '18 at 21:16
The size of the data frame will be that of the title. Since each article necessarily has a title.
– cincinnatus
Nov 26 '18 at 21:20
@ManuelBickel: But then we'd just end up with a vector.
– AkselA
Nov 26 '18 at 21:34
1
@ManuelBickel: Thanks, but I already figured out a way.
– AkselA
Nov 26 '18 at 22:34
1
@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.
– AkselA
Nov 26 '18 at 22:55
|
show 3 more comments
New and improved
text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n
text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])
patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")
tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)
tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)
str(tex.df)
# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."
You could uselist()
orpaste0(..., collapse = ", ")
to concatenate multiple elements and store them as a single entry.
– Manuel Bickel
Nov 26 '18 at 21:16
The size of the data frame will be that of the title. Since each article necessarily has a title.
– cincinnatus
Nov 26 '18 at 21:20
@ManuelBickel: But then we'd just end up with a vector.
– AkselA
Nov 26 '18 at 21:34
1
@ManuelBickel: Thanks, but I already figured out a way.
– AkselA
Nov 26 '18 at 22:34
1
@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.
– AkselA
Nov 26 '18 at 22:55
|
show 3 more comments
New and improved
text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n
text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])
patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")
tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)
tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)
str(tex.df)
# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."
New and improved
text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n
text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])
patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")
tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)
tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)
str(tex.df)
# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."
edited Nov 26 '18 at 22:46
answered Nov 26 '18 at 21:01
AkselAAkselA
4,51421325
4,51421325
You could uselist()
orpaste0(..., collapse = ", ")
to concatenate multiple elements and store them as a single entry.
– Manuel Bickel
Nov 26 '18 at 21:16
The size of the data frame will be that of the title. Since each article necessarily has a title.
– cincinnatus
Nov 26 '18 at 21:20
@ManuelBickel: But then we'd just end up with a vector.
– AkselA
Nov 26 '18 at 21:34
1
@ManuelBickel: Thanks, but I already figured out a way.
– AkselA
Nov 26 '18 at 22:34
1
@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.
– AkselA
Nov 26 '18 at 22:55
|
show 3 more comments
You could uselist()
orpaste0(..., collapse = ", ")
to concatenate multiple elements and store them as a single entry.
– Manuel Bickel
Nov 26 '18 at 21:16
The size of the data frame will be that of the title. Since each article necessarily has a title.
– cincinnatus
Nov 26 '18 at 21:20
@ManuelBickel: But then we'd just end up with a vector.
– AkselA
Nov 26 '18 at 21:34
1
@ManuelBickel: Thanks, but I already figured out a way.
– AkselA
Nov 26 '18 at 22:34
1
@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.
– AkselA
Nov 26 '18 at 22:55
You could use
list()
or paste0(..., collapse = ", ")
to concatenate multiple elements and store them as a single entry.– Manuel Bickel
Nov 26 '18 at 21:16
You could use
list()
or paste0(..., collapse = ", ")
to concatenate multiple elements and store them as a single entry.– Manuel Bickel
Nov 26 '18 at 21:16
The size of the data frame will be that of the title. Since each article necessarily has a title.
– cincinnatus
Nov 26 '18 at 21:20
The size of the data frame will be that of the title. Since each article necessarily has a title.
– cincinnatus
Nov 26 '18 at 21:20
@ManuelBickel: But then we'd just end up with a vector.
– AkselA
Nov 26 '18 at 21:34
@ManuelBickel: But then we'd just end up with a vector.
– AkselA
Nov 26 '18 at 21:34
1
1
@ManuelBickel: Thanks, but I already figured out a way.
– AkselA
Nov 26 '18 at 22:34
@ManuelBickel: Thanks, but I already figured out a way.
– AkselA
Nov 26 '18 at 22:34
1
1
@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.
– AkselA
Nov 26 '18 at 22:55
@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.
– AkselA
Nov 26 '18 at 22:55
|
show 3 more comments
Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)
#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]
# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}
# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)
# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]
1
Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.
– cincinnatus
Nov 27 '18 at 5:18
add a comment |
Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)
#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]
# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}
# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)
# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]
1
Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.
– cincinnatus
Nov 27 '18 at 5:18
add a comment |
Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)
#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]
# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}
# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)
# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]
Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)
#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]
# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}
# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)
# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]
answered Nov 26 '18 at 22:28
Manuel BickelManuel Bickel
1,8092617
1,8092617
1
Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.
– cincinnatus
Nov 27 '18 at 5:18
add a comment |
1
Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.
– cincinnatus
Nov 27 '18 at 5:18
1
1
Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.
– cincinnatus
Nov 27 '18 at 5:18
Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.
– cincinnatus
Nov 27 '18 at 5:18
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488266%2fr-feature-extraction-for-text%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What have you tried so far?
– Manuel Bickel
Nov 26 '18 at 20:12
I have tried to extract with grep, by can't concatenate id_ref in only row.
– cincinnatus
Nov 26 '18 at 20:16
See my comment to below answer...
– Manuel Bickel
Nov 26 '18 at 21:19