R feature extraction for text

My question is about text mining, and text processing.

I would like to build a dataframe from my text.

My data is:

text <- c("#*TeX: The Program,

#@Donald E. Knuth,

#t1986,

#c,

#index68,

""

#*Foundations of Databases.,

#@Serge Abiteboul,Richard Hull,Victor Vianu,

#t1995,

#c,

#index69,

#%1118192,

#%189,

#%1088975,

#%971271,

#%832272,

#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")

My expected output is :

expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),

                       id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))

My code is:

coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")

      title_index <- grep("^#[*]", text)

      authors_index <- grep("#@", text)

      year_index <- grep("#t", text)

      revue_index <- grep("#c", text)

      id_paper_index <- grep("#index", text)

      id_refindex <- grep("#%", text)

      abstract_index <- grep("#!", text)

      df <- matrix(NA, nrow=length(title_index), ncol=length(coln))

      colnames(df) <- coln

      stoc_index <- grep("#cSTOC", text)

      sigir_index <- grep("#cSIGIR", text)}





  ########## titre

  {der_pos <- length(title_index)

    tit_position  <- c(title_index , der_pos)

    for(i in 1:length(title_position)){

      if(i != length(title_position)){

        df[i, "title"] <- text[title_position[i]]

      }

    }

  }



  ########## author 

{der_pos <- length(authors_index)

    authors_position  <- c(authors_index )

    for(i in 1:length(auteur_position)){

      if(i != length(auteur_position)){

        df[i, "auteur"] <- text[auteur_position[i]]

      }

    }

  }



  ########## year

{der_pos <- length(year_index)

    year_position  <- c(year_index , der_pos)

    for(i in 1:length(year_position)){

      if(i != length(year_position)){

        df[i, "année"] <- text[year_position[i]]

      }

    }

  }



  ##########??? revue

  {der_pos <- length(revue_index)

    revue_position  <- c(revue_index )

    for(i in 1:length(revue_position)){

      if(i != length(revue_position)){

        df[i, "revue"] <- text[revue_position[i]]

      }

    }

  }



  ########## id_paper

  {der_pos <- length(id_paper_index)

    id_paper_position  <- c(id_paper_index , dern_pos)

    for(i in 1:length(id_paper_position)){

      if(i != length(id_paper_position)){

        df[i, "id_paper"] <- text[id_paper_position[i]]

      }

    }

  }



  ########## id_ref

  {der_pos <- length(id_ref_index)

    id_ref_position  <- c(id_ref_index , der_pos)

    for(i in 1:length(id_ref_position)){

      if(i != length(id_ref_position)){

        df[i, "id_ref"] <- text[id_ref_position[i]]

      }

    }

  }

  ########## abstract

  {der_pos <- length(abstract_index)

    abstract_position  <- c(abstract_index , der_pos)

    for(i in 1:length(abstract_position)){

      if(i != length(abstract_position)){

        df[i, "abstract"] <- text[abstract_position[i]]

      }

    }

  }

So I would like to extract the reference in a single line

Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.

Thank you :)

edited Dec 1 '18 at 16:14

asked Nov 26 '18 at 20:07

cincinnatus

629

What have you tried so far?

– Manuel Bickel
Nov 26 '18 at 20:12

I have tried to extract with grep, by can't concatenate id_ref in only row.

– cincinnatus
Nov 26 '18 at 20:16

See my comment to below answer...

– Manuel Bickel
Nov 26 '18 at 21:19

add a comment |

My question is about text mining, and text processing.

I would like to build a dataframe from my text.

My data is:

text <- c("#*TeX: The Program,

#@Donald E. Knuth,

#t1986,

#c,

#index68,

""

#*Foundations of Databases.,

#@Serge Abiteboul,Richard Hull,Victor Vianu,

#t1995,

#c,

#index69,

#%1118192,

#%189,

#%1088975,

#%971271,

#%832272,

#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")

My expected output is :

expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),

                       id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))

My code is:

coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")

      title_index <- grep("^#[*]", text)

      authors_index <- grep("#@", text)

      year_index <- grep("#t", text)

      revue_index <- grep("#c", text)

      id_paper_index <- grep("#index", text)

      id_refindex <- grep("#%", text)

      abstract_index <- grep("#!", text)

      df <- matrix(NA, nrow=length(title_index), ncol=length(coln))

      colnames(df) <- coln

      stoc_index <- grep("#cSTOC", text)

      sigir_index <- grep("#cSIGIR", text)}





  ########## titre

  {der_pos <- length(title_index)

    tit_position  <- c(title_index , der_pos)

    for(i in 1:length(title_position)){

      if(i != length(title_position)){

        df[i, "title"] <- text[title_position[i]]

      }

    }

  }



  ########## author 

{der_pos <- length(authors_index)

    authors_position  <- c(authors_index )

    for(i in 1:length(auteur_position)){

      if(i != length(auteur_position)){

        df[i, "auteur"] <- text[auteur_position[i]]

      }

    }

  }



  ########## year

{der_pos <- length(year_index)

    year_position  <- c(year_index , der_pos)

    for(i in 1:length(year_position)){

      if(i != length(year_position)){

        df[i, "année"] <- text[year_position[i]]

      }

    }

  }



  ##########??? revue

  {der_pos <- length(revue_index)

    revue_position  <- c(revue_index )

    for(i in 1:length(revue_position)){

      if(i != length(revue_position)){

        df[i, "revue"] <- text[revue_position[i]]

      }

    }

  }



  ########## id_paper

  {der_pos <- length(id_paper_index)

    id_paper_position  <- c(id_paper_index , dern_pos)

    for(i in 1:length(id_paper_position)){

      if(i != length(id_paper_position)){

        df[i, "id_paper"] <- text[id_paper_position[i]]

      }

    }

  }



  ########## id_ref

  {der_pos <- length(id_ref_index)

    id_ref_position  <- c(id_ref_index , der_pos)

    for(i in 1:length(id_ref_position)){

      if(i != length(id_ref_position)){

        df[i, "id_ref"] <- text[id_ref_position[i]]

      }

    }

  }

  ########## abstract

  {der_pos <- length(abstract_index)

    abstract_position  <- c(abstract_index , der_pos)

    for(i in 1:length(abstract_position)){

      if(i != length(abstract_position)){

        df[i, "abstract"] <- text[abstract_position[i]]

      }

    }

  }

So I would like to extract the reference in a single line

Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.

Thank you :)

edited Dec 1 '18 at 16:14

asked Nov 26 '18 at 20:07

cincinnatus

629

What have you tried so far?

– Manuel Bickel
Nov 26 '18 at 20:12

I have tried to extract with grep, by can't concatenate id_ref in only row.

– cincinnatus
Nov 26 '18 at 20:16

See my comment to below answer...

– Manuel Bickel
Nov 26 '18 at 21:19

add a comment |

My question is about text mining, and text processing.

I would like to build a dataframe from my text.

My data is:

text <- c("#*TeX: The Program,

#@Donald E. Knuth,

#t1986,

#c,

#index68,

""

#*Foundations of Databases.,

#@Serge Abiteboul,Richard Hull,Victor Vianu,

#t1995,

#c,

#index69,

#%1118192,

#%189,

#%1088975,

#%971271,

#%832272,

#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")

My expected output is :

expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),

                       id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))

My code is:

coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")

      title_index <- grep("^#[*]", text)

      authors_index <- grep("#@", text)

      year_index <- grep("#t", text)

      revue_index <- grep("#c", text)

      id_paper_index <- grep("#index", text)

      id_refindex <- grep("#%", text)

      abstract_index <- grep("#!", text)

      df <- matrix(NA, nrow=length(title_index), ncol=length(coln))

      colnames(df) <- coln

      stoc_index <- grep("#cSTOC", text)

      sigir_index <- grep("#cSIGIR", text)}





  ########## titre

  {der_pos <- length(title_index)

    tit_position  <- c(title_index , der_pos)

    for(i in 1:length(title_position)){

      if(i != length(title_position)){

        df[i, "title"] <- text[title_position[i]]

      }

    }

  }



  ########## author 

{der_pos <- length(authors_index)

    authors_position  <- c(authors_index )

    for(i in 1:length(auteur_position)){

      if(i != length(auteur_position)){

        df[i, "auteur"] <- text[auteur_position[i]]

      }

    }

  }



  ########## year

{der_pos <- length(year_index)

    year_position  <- c(year_index , der_pos)

    for(i in 1:length(year_position)){

      if(i != length(year_position)){

        df[i, "année"] <- text[year_position[i]]

      }

    }

  }



  ##########??? revue

  {der_pos <- length(revue_index)

    revue_position  <- c(revue_index )

    for(i in 1:length(revue_position)){

      if(i != length(revue_position)){

        df[i, "revue"] <- text[revue_position[i]]

      }

    }

  }



  ########## id_paper

  {der_pos <- length(id_paper_index)

    id_paper_position  <- c(id_paper_index , dern_pos)

    for(i in 1:length(id_paper_position)){

      if(i != length(id_paper_position)){

        df[i, "id_paper"] <- text[id_paper_position[i]]

      }

    }

  }



  ########## id_ref

  {der_pos <- length(id_ref_index)

    id_ref_position  <- c(id_ref_index , der_pos)

    for(i in 1:length(id_ref_position)){

      if(i != length(id_ref_position)){

        df[i, "id_ref"] <- text[id_ref_position[i]]

      }

    }

  }

  ########## abstract

  {der_pos <- length(abstract_index)

    abstract_position  <- c(abstract_index , der_pos)

    for(i in 1:length(abstract_position)){

      if(i != length(abstract_position)){

        df[i, "abstract"] <- text[abstract_position[i]]

      }

    }

  }

So I would like to extract the reference in a single line

Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.

Thank you :)

edited Dec 1 '18 at 16:14

asked Nov 26 '18 at 20:07

cincinnatus

629

My question is about text mining, and text processing.

I would like to build a dataframe from my text.

My data is:

text <- c("#*TeX: The Program,

#@Donald E. Knuth,

#t1986,

#c,

#index68,

""

#*Foundations of Databases.,

#@Serge Abiteboul,Richard Hull,Victor Vianu,

#t1995,

#c,

#index69,

#%1118192,

#%189,

#%1088975,

#%971271,

#%832272,

#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+.")

My expected output is :

expected <- data.frame(title=c("#*TeX: The Program", "#*Foundations of Databases."), authors=c("#@Donald E. Knuth", "#@Serge Abiteboul,Richard Hull,Victor Vianu"), year=c("#t1986", "#t1995"), revue=c("#c", "#c"), id_paper=c("#index68", "#index69"),

                       id_ref=c(NA,"#%1118192, #%189, #%1088975, #%971271, #%832272"), abstract=c(NA, "#!From the Book: This book will teach you how to write specifications of computer systems, using the language TLA+."))

My code is:

coln <- c("title", "authors", "year", "revue","id_paper", "id_ref", "abstract")

      title_index <- grep("^#[*]", text)

      authors_index <- grep("#@", text)

      year_index <- grep("#t", text)

      revue_index <- grep("#c", text)

      id_paper_index <- grep("#index", text)

      id_refindex <- grep("#%", text)

      abstract_index <- grep("#!", text)

      df <- matrix(NA, nrow=length(title_index), ncol=length(coln))

      colnames(df) <- coln

      stoc_index <- grep("#cSTOC", text)

      sigir_index <- grep("#cSIGIR", text)}





  ########## titre

  {der_pos <- length(title_index)

    tit_position  <- c(title_index , der_pos)

    for(i in 1:length(title_position)){

      if(i != length(title_position)){

        df[i, "title"] <- text[title_position[i]]

      }

    }

  }



  ########## author 

{der_pos <- length(authors_index)

    authors_position  <- c(authors_index )

    for(i in 1:length(auteur_position)){

      if(i != length(auteur_position)){

        df[i, "auteur"] <- text[auteur_position[i]]

      }

    }

  }



  ########## year

{der_pos <- length(year_index)

    year_position  <- c(year_index , der_pos)

    for(i in 1:length(year_position)){

      if(i != length(year_position)){

        df[i, "année"] <- text[year_position[i]]

      }

    }

  }



  ##########??? revue

  {der_pos <- length(revue_index)

    revue_position  <- c(revue_index )

    for(i in 1:length(revue_position)){

      if(i != length(revue_position)){

        df[i, "revue"] <- text[revue_position[i]]

      }

    }

  }



  ########## id_paper

  {der_pos <- length(id_paper_index)

    id_paper_position  <- c(id_paper_index , dern_pos)

    for(i in 1:length(id_paper_position)){

      if(i != length(id_paper_position)){

        df[i, "id_paper"] <- text[id_paper_position[i]]

      }

    }

  }



  ########## id_ref

  {der_pos <- length(id_ref_index)

    id_ref_position  <- c(id_ref_index , der_pos)

    for(i in 1:length(id_ref_position)){

      if(i != length(id_ref_position)){

        df[i, "id_ref"] <- text[id_ref_position[i]]

      }

    }

  }

  ########## abstract

  {der_pos <- length(abstract_index)

    abstract_position  <- c(abstract_index , der_pos)

    for(i in 1:length(abstract_position)){

      if(i != length(abstract_position)){

        df[i, "abstract"] <- text[abstract_position[i]]

      }

    }

  }

So I would like to extract the reference in a single line

Thank you in advance if you have solution for concatenate many citation in one column separated by coma for one article.

Thank you :)

r text nlp text-mining feature-extraction

edited Dec 1 '18 at 16:14

asked Nov 26 '18 at 20:07

cincinnatus

629

edited Dec 1 '18 at 16:14

asked Nov 26 '18 at 20:07

cincinnatus

629

edited Dec 1 '18 at 16:14

asked Nov 26 '18 at 20:07

cincinnatus

629

asked Nov 26 '18 at 20:07

cincinnatus

629

asked Nov 26 '18 at 20:07

cincinnatus

629

What have you tried so far?

– Manuel Bickel
Nov 26 '18 at 20:12

I have tried to extract with grep, by can't concatenate id_ref in only row.

– cincinnatus
Nov 26 '18 at 20:16

See my comment to below answer...

– Manuel Bickel
Nov 26 '18 at 21:19

add a comment |

What have you tried so far?

– Manuel Bickel
Nov 26 '18 at 20:12

I have tried to extract with grep, by can't concatenate id_ref in only row.

– cincinnatus
Nov 26 '18 at 20:16

See my comment to below answer...

– Manuel Bickel
Nov 26 '18 at 21:19

What have you tried so far?

– Manuel Bickel
Nov 26 '18 at 20:12

I have tried to extract with grep, by can't concatenate id_ref in only row.

– cincinnatus
Nov 26 '18 at 20:16

See my comment to below answer...

– Manuel Bickel
Nov 26 '18 at 21:19

add a comment |

2 Answers
2

active

oldest

votes

New and improved

text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n



text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])



patterns <- list(title="^#\*", 

                autors="^#@",

                  year="^#t",

                 revue="^#c",

              id_paper="^#index",

                id_ref="^#%",

              abstract="^#!")



tex.l <- lapply(text.s, function(x)

  lapply(patterns, function(y)

    paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")

  )

) 



tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)

tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)

colnames(tex.df) <- names(patterns)



str(tex.df)



# 'data.frame': 2 obs. of  7 variables:

# $ title   : chr "TeX: The Program" "Foundations of Databases."

# $ autors  : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"

# $ year    : chr "1986" "1995"

# $ revue   : chr "" ""

# $ id_paper: chr "68" "69"

# $ id_ref  : chr "" "1118192,189,1088975,971271,832272"

# $ abstract: chr "" "From the Book: This book will teach you how to write 

#                     specifications of computer systems, using the language TLA+."

edited Nov 26 '18 at 22:46

answered Nov 26 '18 at 21:01

AkselA

4,51421325

You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16

The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20

@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34

1

@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34

1

@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55

|
show 3 more comments

Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)

#split into individual docs

text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]



# function to extract information from individual docs

extract_info = function(x, patterns = list(title="^*#\*", 

                                           autors="^*#@",

                                           year="^*#t",

                                           revue="^*#c",

                                           id_paper="^*#index",

                                           id_ref="^*#%",

                                           abstract="^*#!")) {

  lapply(patterns, function(p) {

    extract = grep(p, x, value = T)

    # here you check the length of the potential output

    # and modify the type according to your needs

    if (length(extract) > 1) {

     extract = list(extract)

    } else if (length(extract) == 0) {

     extract = NA

    }

    return(extract)

    })

}



# apply the function to the data

# and rbind it into a data.frame

do.call(rbind, 

        lapply(text.s, function(x) {

  x = strsplit(x, "\n")[[1]]

  extract_info(x)

})

)



# title                         autors                                        year     revue id_paper   id_ref

# [1,] "#*TeX: The Program"          "#@Donald E. Knuth"                           "#t1986" "#c"  "#index68" NA    

# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c"  "#index69" List,1

# abstract                                                                                                         

# [1,] NA                                                                                                               

# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

1

Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53488266%2fr-feature-extraction-for-text%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

New and improved

text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n



text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])



patterns <- list(title="^#\*", 

                autors="^#@",

                  year="^#t",

                 revue="^#c",

              id_paper="^#index",

                id_ref="^#%",

              abstract="^#!")



tex.l <- lapply(text.s, function(x)

  lapply(patterns, function(y)

    paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")

  )

) 



tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)

tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)

colnames(tex.df) <- names(patterns)



str(tex.df)



# 'data.frame': 2 obs. of  7 variables:

# $ title   : chr "TeX: The Program" "Foundations of Databases."

# $ autors  : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"

# $ year    : chr "1986" "1995"

# $ revue   : chr "" ""

# $ id_paper: chr "68" "69"

# $ id_ref  : chr "" "1118192,189,1088975,971271,832272"

# $ abstract: chr "" "From the Book: This book will teach you how to write 

#                     specifications of computer systems, using the language TLA+."

edited Nov 26 '18 at 22:46

answered Nov 26 '18 at 21:01

AkselA

4,51421325

You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16

The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20

@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34

1

@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34

1

@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55

|
show 3 more comments

New and improved

text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n



text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])



patterns <- list(title="^#\*", 

                autors="^#@",

                  year="^#t",

                 revue="^#c",

              id_paper="^#index",

                id_ref="^#%",

              abstract="^#!")



tex.l <- lapply(text.s, function(x)

  lapply(patterns, function(y)

    paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")

  )

) 



tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)

tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)

colnames(tex.df) <- names(patterns)



str(tex.df)



# 'data.frame': 2 obs. of  7 variables:

# $ title   : chr "TeX: The Program" "Foundations of Databases."

# $ autors  : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"

# $ year    : chr "1986" "1995"

# $ revue   : chr "" ""

# $ id_paper: chr "68" "69"

# $ id_ref  : chr "" "1118192,189,1088975,971271,832272"

# $ abstract: chr "" "From the Book: This book will teach you how to write 

#                     specifications of computer systems, using the language TLA+."

edited Nov 26 '18 at 22:46

answered Nov 26 '18 at 21:01

AkselA

4,51421325

You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16

The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20

@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34

1

@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34

1

@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55

|
show 3 more comments

New and improved

text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n



text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])



patterns <- list(title="^#\*", 

                autors="^#@",

                  year="^#t",

                 revue="^#c",

              id_paper="^#index",

                id_ref="^#%",

              abstract="^#!")



tex.l <- lapply(text.s, function(x)

  lapply(patterns, function(y)

    paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")

  )

) 



tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)

tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)

colnames(tex.df) <- names(patterns)



str(tex.df)



# 'data.frame': 2 obs. of  7 variables:

# $ title   : chr "TeX: The Program" "Foundations of Databases."

# $ autors  : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"

# $ year    : chr "1986" "1995"

# $ revue   : chr "" ""

# $ id_paper: chr "68" "69"

# $ id_ref  : chr "" "1118192,189,1088975,971271,832272"

# $ abstract: chr "" "From the Book: This book will teach you how to write 

#                     specifications of computer systems, using the language TLA+."

edited Nov 26 '18 at 22:46

answered Nov 26 '18 at 21:01

AkselA

4,51421325

New and improved

text.n <- strsplit(text, "n(?=#\*)", perl=TRUE)[[1]]; text.n



text.s <- lapply(text.n, function(x) strsplit(x, "n")[[1]])



patterns <- list(title="^#\*", 

                autors="^#@",

                  year="^#t",

                 revue="^#c",

              id_paper="^#index",

                id_ref="^#%",

              abstract="^#!")



tex.l <- lapply(text.s, function(x)

  lapply(patterns, function(y)

    paste(sub(y, "", grep(y, x, value=TRUE)), collapse=",")

  )

) 



tex.m <- matrix(unlist(tex.l), ncol=length(tex.l[[1]]), byrow=TRUE)

tex.df <- as.data.frame(tex.m, stringsAsFactors=FALSE)

colnames(tex.df) <- names(patterns)



str(tex.df)



# 'data.frame': 2 obs. of  7 variables:

# $ title   : chr "TeX: The Program" "Foundations of Databases."

# $ autors  : chr "Donald E. Knuth" "Serge Abiteboul,Richard Hull,Victor Vianu"

# $ year    : chr "1986" "1995"

# $ revue   : chr "" ""

# $ id_paper: chr "68" "69"

# $ id_ref  : chr "" "1118192,189,1088975,971271,832272"

# $ abstract: chr "" "From the Book: This book will teach you how to write 

#                     specifications of computer systems, using the language TLA+."

edited Nov 26 '18 at 22:46

answered Nov 26 '18 at 21:01

AkselA

4,51421325

edited Nov 26 '18 at 22:46

answered Nov 26 '18 at 21:01

AkselA

4,51421325

answered Nov 26 '18 at 21:01

AkselA

4,51421325

answered Nov 26 '18 at 21:01

AkselA

4,51421325

You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16

The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20

@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34

1

@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34

1

@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55

|
show 3 more comments

You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16

The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20

@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34

1

@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34

1

@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55

You could use list() or paste0(..., collapse = ", ") to concatenate multiple elements and store them as a single entry.

– Manuel Bickel
Nov 26 '18 at 21:16

The size of the data frame will be that of the title. Since each article necessarily has a title.

– cincinnatus
Nov 26 '18 at 21:20

@ManuelBickel: But then we'd just end up with a vector.

– AkselA
Nov 26 '18 at 21:34

@ManuelBickel: Thanks, but I already figured out a way.

– AkselA
Nov 26 '18 at 22:34

@ManuelBickel: No trouble, just had to pause and look at it anew. Thanks for the regex pattern, what I had was a bit less than optimal.

– AkselA
Nov 26 '18 at 22:55

|
show 3 more comments

Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)

#split into individual docs

text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]



# function to extract information from individual docs

extract_info = function(x, patterns = list(title="^*#\*", 

                                           autors="^*#@",

                                           year="^*#t",

                                           revue="^*#c",

                                           id_paper="^*#index",

                                           id_ref="^*#%",

                                           abstract="^*#!")) {

  lapply(patterns, function(p) {

    extract = grep(p, x, value = T)

    # here you check the length of the potential output

    # and modify the type according to your needs

    if (length(extract) > 1) {

     extract = list(extract)

    } else if (length(extract) == 0) {

     extract = NA

    }

    return(extract)

    })

}



# apply the function to the data

# and rbind it into a data.frame

do.call(rbind, 

        lapply(text.s, function(x) {

  x = strsplit(x, "\n")[[1]]

  extract_info(x)

})

)



# title                         autors                                        year     revue id_paper   id_ref

# [1,] "#*TeX: The Program"          "#@Donald E. Knuth"                           "#t1986" "#c"  "#index68" NA    

# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c"  "#index69" List,1

# abstract                                                                                                         

# [1,] NA                                                                                                               

# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

1

Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18

add a comment |

Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)

#split into individual docs

text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]



# function to extract information from individual docs

extract_info = function(x, patterns = list(title="^*#\*", 

                                           autors="^*#@",

                                           year="^*#t",

                                           revue="^*#c",

                                           id_paper="^*#index",

                                           id_ref="^*#%",

                                           abstract="^*#!")) {

  lapply(patterns, function(p) {

    extract = grep(p, x, value = T)

    # here you check the length of the potential output

    # and modify the type according to your needs

    if (length(extract) > 1) {

     extract = list(extract)

    } else if (length(extract) == 0) {

     extract = NA

    }

    return(extract)

    })

}



# apply the function to the data

# and rbind it into a data.frame

do.call(rbind, 

        lapply(text.s, function(x) {

  x = strsplit(x, "\n")[[1]]

  extract_info(x)

})

)



# title                         autors                                        year     revue id_paper   id_ref

# [1,] "#*TeX: The Program"          "#@Donald E. Knuth"                           "#t1986" "#c"  "#index68" NA    

# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c"  "#index69" List,1

# abstract                                                                                                         

# [1,] NA                                                                                                               

# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

1

Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18

add a comment |

Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)

#split into individual docs

text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]



# function to extract information from individual docs

extract_info = function(x, patterns = list(title="^*#\*", 

                                           autors="^*#@",

                                           year="^*#t",

                                           revue="^*#c",

                                           id_paper="^*#index",

                                           id_ref="^*#%",

                                           abstract="^*#!")) {

  lapply(patterns, function(p) {

    extract = grep(p, x, value = T)

    # here you check the length of the potential output

    # and modify the type according to your needs

    if (length(extract) > 1) {

     extract = list(extract)

    } else if (length(extract) == 0) {

     extract = NA

    }

    return(extract)

    })

}



# apply the function to the data

# and rbind it into a data.frame

do.call(rbind, 

        lapply(text.s, function(x) {

  x = strsplit(x, "\n")[[1]]

  extract_info(x)

})

)



# title                         autors                                        year     revue id_paper   id_ref

# [1,] "#*TeX: The Program"          "#@Donald E. Knuth"                           "#t1986" "#c"  "#index68" NA    

# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c"  "#index69" List,1

# abstract                                                                                                         

# [1,] NA                                                                                                               

# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

Here a solution based on the answer of @AkselA. I could not deal with this only in comments, therefore, an additional answer (I know I could have formatted it more nicely...)

#split into individual docs

text.s = strsplit(text, "n(?=#\*)", perl = T)[[1]]



# function to extract information from individual docs

extract_info = function(x, patterns = list(title="^*#\*", 

                                           autors="^*#@",

                                           year="^*#t",

                                           revue="^*#c",

                                           id_paper="^*#index",

                                           id_ref="^*#%",

                                           abstract="^*#!")) {

  lapply(patterns, function(p) {

    extract = grep(p, x, value = T)

    # here you check the length of the potential output

    # and modify the type according to your needs

    if (length(extract) > 1) {

     extract = list(extract)

    } else if (length(extract) == 0) {

     extract = NA

    }

    return(extract)

    })

}



# apply the function to the data

# and rbind it into a data.frame

do.call(rbind, 

        lapply(text.s, function(x) {

  x = strsplit(x, "\n")[[1]]

  extract_info(x)

})

)



# title                         autors                                        year     revue id_paper   id_ref

# [1,] "#*TeX: The Program"          "#@Donald E. Knuth"                           "#t1986" "#c"  "#index68" NA    

# [2,] "#*Foundations of Databases." "#@Serge Abiteboul,Richard Hull,Victor Vianu" "#t1995" "#c"  "#index69" List,1

# abstract                                                                                                         

# [1,] NA                                                                                                               

# [2,] "#!From the Book: This book will teach you how to write specifications of computer systems, using th" [truncated]

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

answered Nov 26 '18 at 22:28

Manuel Bickel

1,8092617

1

Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18

add a comment |

1

Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18

Thank you very much, your answer is correct. But I could only give one solution. Thank you, you are a genius.

– cincinnatus
Nov 27 '18 at 5:18

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Btukfyl