R feature extraction for text












3















My question is about text mining, and text processing.



I would like to build a dataframe from my text.



My data is:



text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")


My expected output is :



expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))


My code is:



coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}


########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}

########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}

########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}

##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}

########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}

########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}


So I would like to extract the reference in a single line



Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.



Thank you :)










share|improve this question

























  • What have you tried so far?

    – Manuel Bickel
    Nov 26 '18 at 20:12











  • I have tried to extract with grep, by can't concatenate id_ref in only row.

    – cincinnatus
    Nov 26 '18 at 20:16











  • See my comment to below answer...

    – Manuel Bickel
    Nov 26 '18 at 21:19
















3















My question is about text mining, and text processing.



I would like to build a dataframe from my text.



My data is:



text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")


My expected output is :



expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))


My code is:



coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}


########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}

########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}

########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}

##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}

########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}

########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}


So I would like to extract the reference in a single line



Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.



Thank you :)










share|improve this question

























  • What have you tried so far?

    – Manuel Bickel
    Nov 26 '18 at 20:12











  • I have tried to extract with grep, by can't concatenate id_ref in only row.

    – cincinnatus
    Nov 26 '18 at 20:16











  • See my comment to below answer...

    – Manuel Bickel
    Nov 26 '18 at 21:19














3












3








3








My question is about text mining, and text processing.



I would like to build a dataframe from my text.



My data is:



text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")


My expected output is :



expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))


My code is:



coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}


########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}

########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}

########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}

##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}

########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}

########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}


So I would like to extract the reference in a single line



Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.



Thank you :)










share|improve this question
















My question is about text mining, and text processing.



I would like to build a dataframe from my text.



My data is:



text <- c("#*TeX: The Program,
#@Donald E. Knuth,
#t1986,
#c,
#index68,
""
#*Foundations of Databases.,
#@Serge Abiteboul,Richard Hull,Victor Vianu,
#t1995,
#c,
#index69,
#%1118192,
#%189,
#%1088975,
#%971271,
#%832272,
#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")


My expected output is :



expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),
id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))


My code is:



coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")
title_index <- grep("^#[*]", text)
authors_index <- grep("#@", text)
year_index <- grep("#t", text)
revue_index <- grep("#c", text)
id_paper_index <- grep("#index", text)
id_refindex <- grep("#%", text)
abstract_index <- grep("#!", text)
df <- matrix(NA, nrow=length(title_index), ncol=length(coln))
colnames(df) <- coln
stoc_index <- grep("#cSTOC", text)
sigir_index <- grep("#cSIGIR", text)}


########## titre
{der_pos <- length(title_index)
tit_position <- c(title_index , der_pos)
for(i in 1:length(title_position)){
if(i != length(title_position)){
df[i, "title"] <- text[title_position[i]]
}
}
}

########## author
{der_pos <- length(authors_index)
authors_position <- c(authors_index )
for(i in 1:length(auteur_position)){
if(i != length(auteur_position)){
df[i, "auteur"] <- text[auteur_position[i]]
}
}
}

########## year
{der_pos <- length(year_index)
year_position <- c(year_index , der_pos)
for(i in 1:length(year_position)){
if(i != length(year_position)){
df[i, "année"] <- text[year_position[i]]
}
}
}

##########??? revue
{der_pos <- length(revue_index)
revue_position <- c(revue_index )
for(i in 1:length(revue_position)){
if(i != length(revue_position)){
df[i, "revue"] <- text[revue_position[i]]
}
}
}

########## id_paper
{der_pos <- length(id_paper_index)
id_paper_position <- c(id_paper_index , dern_pos)
for(i in 1:length(id_paper_position)){
if(i != length(id_paper_position)){
df[i, "id_paper"] <- text[id_paper_position[i]]
}
}
}

########## id_ref
{der_pos <- length(id_ref_index)
id_ref_position <- c(id_ref_index , der_pos)
for(i in 1:length(id_ref_position)){
if(i != length(id_ref_position)){
df[i, "id_ref"] <- text[id_ref_position[i]]
}
}
}
########## abstract
{der_pos <- length(abstract_index)
abstract_position <- c(abstract_index , der_pos)
for(i in 1:length(abstract_position)){
if(i != length(abstract_position)){
df[i, "abstract"] <- text[abstract_position[i]]
}
}
}


So I would like to extract the reference in a single line



Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.



Thank you :)







r text nlp text-mining feature-extraction






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 1 '18 at 16:14







cincinnatus

















asked Nov 26 '18 at 20:07









cincinnatuscincinnatus

629




629













  • What have you tried so far?

    – Manuel Bickel
    Nov 26 '18 at 20:12











  • I have tried to extract with grep, by can't concatenate id_ref in only row.

    – cincinnatus
    Nov 26 '18 at 20:16











  • See my comment to below answer...

    – Manuel Bickel
    Nov 26 '18 at 21:19



















  • What have you tried so far?

    – Manuel Bickel
    Nov 26 '18 at 20:12











  • I have tried to extract with grep, by can't concatenate id_ref in only row.

    – cincinnatus
    Nov 26 '18 at 20:16











  • See my comment to below answer...

    – Manuel Bickel
    Nov 26 '18 at 21:19

















What have you tried so far?

– Manuel Bickel
Nov 26 '18 at 20:12





What have you tried so far?

– Manuel Bickel
Nov 26 '18 at 20:12













I have tried to extract with grep, by can't concatenate id_ref in only row.

– cincinnatus
Nov 26 '18 at 20:16





I have tried to extract with grep, by can't concatenate id_ref in only row.

– cincinnatus
Nov 26 '18 at 20:16













See my comment to below answer...

– Manuel Bickel
Nov 26 '18 at 21:19





See my comment to below answer...

– Manuel Bickel
Nov 26 '18 at 21:19












2 Answers
2






active

oldest

votes


















1














New and improved



text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n

text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])

patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")

tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)

tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)

str(tex.df)

# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."





share|improve this answer


























  • You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

    – Manuel Bickel
    Nov 26 '18 at 21:16











  • The size of the data frame will be that of the title. Since each article necessarily has a title.

    – cincinnatus
    Nov 26 '18 at 21:20











  • @ManuelBickel: But then we'd just end up with a vector.

    – AkselA
    Nov 26 '18 at 21:34






  • 1





    @ManuelBickel: Thanks, but I already figured out a way.

    – AkselA
    Nov 26 '18 at 22:34






  • 1





    @ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

    – AkselA
    Nov 26 '18 at 22:55





















1














Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)



#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]

# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}

# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)

# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]





share|improve this answer



















  • 1





    Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

    – cincinnatus
    Nov 27 '18 at 5:18











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488266%2fr-feature-extraction-for-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














New and improved



text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n

text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])

patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")

tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)

tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)

str(tex.df)

# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."





share|improve this answer


























  • You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

    – Manuel Bickel
    Nov 26 '18 at 21:16











  • The size of the data frame will be that of the title. Since each article necessarily has a title.

    – cincinnatus
    Nov 26 '18 at 21:20











  • @ManuelBickel: But then we'd just end up with a vector.

    – AkselA
    Nov 26 '18 at 21:34






  • 1





    @ManuelBickel: Thanks, but I already figured out a way.

    – AkselA
    Nov 26 '18 at 22:34






  • 1





    @ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

    – AkselA
    Nov 26 '18 at 22:55


















1














New and improved



text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n

text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])

patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")

tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)

tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)

str(tex.df)

# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."





share|improve this answer


























  • You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

    – Manuel Bickel
    Nov 26 '18 at 21:16











  • The size of the data frame will be that of the title. Since each article necessarily has a title.

    – cincinnatus
    Nov 26 '18 at 21:20











  • @ManuelBickel: But then we'd just end up with a vector.

    – AkselA
    Nov 26 '18 at 21:34






  • 1





    @ManuelBickel: Thanks, but I already figured out a way.

    – AkselA
    Nov 26 '18 at 22:34






  • 1





    @ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

    – AkselA
    Nov 26 '18 at 22:55
















1












1








1







New and improved



text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n

text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])

patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")

tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)

tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)

str(tex.df)

# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."





share|improve this answer















New and improved



text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n

text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])

patterns <- list(title="^#\*",
autors="^#@",
year="^#t",
revue="^#c",
id_paper="^#index",
id_ref="^#%",
abstract="^#!")

tex.l <- lapply(text.s, function(x)
lapply(patterns, function(y)
paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")
)
)

tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)
tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)
colnames(tex.df) <- names(patterns)

str(tex.df)

# 'data.frame': 2 obs. of 7 variables:
# $ title : chr "TeX: The Program" "Foundations of Databases."
# $ autors : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"
# $ year : chr "1986" "1995"
# $ revue : chr "" ""
# $ id_paper: chr "68" "69"
# $ id_ref : chr "" "1118192,189,1088975,971271,832272"
# $ abstract: chr "" "From the Book: This book will teach you how to write
# specifications of computer systems, using the language TLA+."






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 26 '18 at 22:46

























answered Nov 26 '18 at 21:01









AkselAAkselA

4,51421325




4,51421325













  • You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

    – Manuel Bickel
    Nov 26 '18 at 21:16











  • The size of the data frame will be that of the title. Since each article necessarily has a title.

    – cincinnatus
    Nov 26 '18 at 21:20











  • @ManuelBickel: But then we'd just end up with a vector.

    – AkselA
    Nov 26 '18 at 21:34






  • 1





    @ManuelBickel: Thanks, but I already figured out a way.

    – AkselA
    Nov 26 '18 at 22:34






  • 1





    @ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

    – AkselA
    Nov 26 '18 at 22:55





















  • You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

    – Manuel Bickel
    Nov 26 '18 at 21:16











  • The size of the data frame will be that of the title. Since each article necessarily has a title.

    – cincinnatus
    Nov 26 '18 at 21:20











  • @ManuelBickel: But then we'd just end up with a vector.

    – AkselA
    Nov 26 '18 at 21:34






  • 1





    @ManuelBickel: Thanks, but I already figured out a way.

    – AkselA
    Nov 26 '18 at 22:34






  • 1





    @ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

    – AkselA
    Nov 26 '18 at 22:55



















You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16





You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16













The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20





The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20













@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34





@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34




1




1





@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34





@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34




1




1





@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55







@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55















1














Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)



#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]

# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}

# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)

# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]





share|improve this answer



















  • 1





    Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

    – cincinnatus
    Nov 27 '18 at 5:18
















1














Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)



#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]

# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}

# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)

# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]





share|improve this answer



















  • 1





    Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

    – cincinnatus
    Nov 27 '18 at 5:18














1












1








1







Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)



#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]

# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}

# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)

# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]





share|improve this answer













Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)



#split into individual docs
text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]

# function to extract information from individual docs
extract_info = function(x, patterns = list(title="^*#\*",
autors="^*#@",
year="^*#t",
revue="^*#c",
id_paper="^*#index",
id_ref="^*#%",
abstract="^*#!")) {
lapply(patterns, function(p) {
extract = grep(p, x, value = T)
# here you check the length of the potential output
# and modify the type according to your needs
if (length(extract) > 1) {
extract = list(extract)
} else if (length(extract) == 0) {
extract = NA
}
return(extract)
})
}

# apply the function to the data
# and rbind it into a data.frame
do.call(rbind,
lapply(text.s, function(x) {
x = strsplit(x, "\n")[[1]]
extract_info(x)
})
)

# title autors year revue id_paper id_ref
# [1,] "#*TeX: The Program" "#@Donald E. Knuth" "#t1986" "#c" "#index68" NA
# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c" "#index69" List,1
# abstract
# [1,] NA
# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 26 '18 at 22:28









Manuel BickelManuel Bickel

1,8092617




1,8092617








  • 1





    Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

    – cincinnatus
    Nov 27 '18 at 5:18














  • 1





    Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

    – cincinnatus
    Nov 27 '18 at 5:18








1




1





Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18





Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488266%2fr-feature-extraction-for-text%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

A CLEAN and SIMPLE way to add appendices to Table of Contents and bookmarks

Calculate evaluation metrics using cross_val_predict sklearn

Insert data from modal to MySQL (multiple modal on website)