Btukfyl

Question

I am struggling with some data transformation. I have some huge xlsx files with insurance data. The data is structured a little bit like a "pyramid". The first line represents the quarter in which the survey took place. The next line is a breakdown by age categories. There are 4 categories: in total values, up to 17, 18-64 and 65+. One sheet contains 4 quarters so basically 48 unique variables and the column with country names. One excel file contains 3 sheets (2016, 2017 and 2018). The screenshot (INPUT DATA) comes from a excel file where the name is "sick-leave blue collar workers". I have also two other files: "sick-leave worker" and "sick-leave self-employment". The goal is combine all three files and create a file with a structure like in the RESULT DATA. Could you please help me?

INPUT DATA: enter image description here

RESULT DATA: enter image description here

Where is the information on type of leave and group stored?
– Len Greski
Nov 23 '18 at 14:09 — Nov 23 '18 at 14:09

score 2 · Accepted Answer · 2018-12-03 00:16:16Z

Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.

# download Excel file from github repository



sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"

destinationFile <- "./soQuestion53446800.xlsx"

download.file(sourceFile,destinationFile,mode="wb")





library(readxl)

library(tidyr)



# set constants 

typeOfLeave <- "sick"

group <- "self employed"



# read date and extract the value

theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]



# setup column names using underscore so we can separate key column into Sex and Age columns 

theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")

theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)



# use tidyr / dplyr to transform the data

theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData



# assign constants



tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$date <- theDate



tidyData

...and the output:

> tidyData

# A tibble: 30 x 7

   Country    Sex   Age   Amount typeOfLeave group         date               

   <chr>      <chr> <chr>  <dbl> <chr>       <chr>         <dttm>             

 1 Total      both  all   151708 sick        self employed 2016-03-31 00:00:00

 2 Afganistan both  all      269 sick        self employed 2016-03-31 00:00:00

 3 Albania    both  all      129 sick        self employed 2016-03-31 00:00:00

 4 Algeria    both  all      308 sick        self employed 2016-03-31 00:00:00

 5 Andora     both  all      815 sick        self employed 2016-03-31 00:00:00

 6 Total      women all    49919 sick        self employed 2016-03-31 00:00:00

 7 Afganistan women all      104 sick        self employed 2016-03-31 00:00:00

 8 Albania    women all       30 sick        self employed 2016-03-31 00:00:00

 9 Algeria    women all       18 sick        self employed 2016-03-31 00:00:00

10 Andora     women all      197 sick        self employed 2016-03-31 00:00:00

# ... with 20 more rows

Key elements in the solution

Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.

In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:

Reading Excel files via exact cell references with readxl::read_excel()

Reading a single cell into a constant

Setting column names for ease of use with tidyr::separate()

Restructuring to narrow format Tidy Data

Assigning constants

1. Reading exact cell references

The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.

readxl::read_excel() enables reading of exact cell references with the range= argument.

2. Reading a constant from one cell

If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script.

Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.

theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]

3. Setting column names for ease of use with `tidyr::separate()`

One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.

The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.

To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.

    theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")

4. Restructuring the data to narrow format Tidy Data

The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.

theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData

Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.

5. Assigning constants

The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.

Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.

tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$date <- theDate

Additional considerations

If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().

Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.

The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.

UPDATE: reading multiple quarters from a specific worksheet

On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.

Specify a worksheet with the sheet= parameter

Add _Q1 to distinguish each quarter's read, and save the quarter as a key variable

Set worksheet names to years

The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.

# download file from github to make script completely reproducible



sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"

destinationFile <- "./soQuestion53446800.xlsx"

download.file(sourceFile,destinationFile,mode="wb")



# set constants 

typeOfLeave <- "sick"

group <- "self employed"

year <- "2018"



# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns 

# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names 

sex <- rep(c("both","women","men"),16)

age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)

quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))

data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData

theCols <- unlist(c("Country",columnsData["excelColNames"]))



theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)



# use tidyr / dplyr to transform the data

theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData



# assign constants



tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$year <- year



tidyData

...and the output, reading from the 2018 sheet in the workbook.

> tidyData

# A tibble: 240 x 8

   Country    Sex   Age   Quarter Amount typeOfLeave group         year 

   <chr>      <chr> <chr> <chr>    <dbl> <chr>       <chr>         <chr>

 1 Total      both  all   Q1        2100 sick        self employed 2018 

 2 Afganistan both  all   Q1        2100 sick        self employed 2018 

 3 Albania    both  all   Q1        2100 sick        self employed 2018 

 4 Algeria    both  all   Q1        2100 sick        self employed 2018 

 5 Andora     both  all   Q1        2100 sick        self employed 2018 

 6 Total      women all   Q1         900 sick        self employed 2018 

 7 Afganistan women all   Q1         900 sick        self employed 2018 

 8 Albania    women all   Q1         900 sick        self employed 2018 

 9 Algeria    women all   Q1         900 sick        self employed 2018 

10 Andora     women all   Q1         900 sick        self employed 2018 

# ... with 230 more rows

>

If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.

# read second worksheet to illustrate multiple reads 



# set constants 

typeOfLeave <- "sick"

group <- "self employed"

year <- "2017"



theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)



# use tidyr / dplyr to transform the data

theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData



# assign constants



tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$year <- year



tidyData

...and the output:

> tidyData

# A tibble: 240 x 8

   Country    Sex   Age   Quarter Amount typeOfLeave group         year 

   <chr>      <chr> <chr> <chr>    <dbl> <chr>       <chr>         <chr>

 1 Total      both  all   Q1       33000 sick        self employed 2017 

 2 Afganistan both  all   Q1       33000 sick        self employed 2017 

 3 Albania    both  all   Q1       33000 sick        self employed 2017 

 4 Algeria    both  all   Q1       33000 sick        self employed 2017 

 5 Andora     both  all   Q1       33000 sick        self employed 2017 

 6 Total      women all   Q1       15000 sick        self employed 2017 

 7 Afganistan women all   Q1       15000 sick        self employed 2017 

 8 Albania    women all   Q1       15000 sick        self employed 2017 

 9 Algeria    women all   Q1       15000 sick        self employed 2017 

10 Andora     women all   Q1       15000 sick        self employed 2017 

# ... with 230 more rows

>

Pulling it all together...

At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().

## version that combines multiple years into a single narrow format tidy data file

# download file from github to make script completely reproducible



sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"

destinationFile <- "./soQuestion53446800.xlsx"

download.file(sourceFile,destinationFile,mode="wb")





library(readxl)

library(tidyr)



# set constants

years <- c("2017","2018")

typeOfLeave <- "sick"

group <- "self employed"



# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns 

# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names 

sex <- rep(c("both","women","men"),16)

age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)

quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))

data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData

theCols <- unlist(c("Country",columnsData["excelColNames"]))





lapply(years,function(x){

  theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)



  # use tidyr / dplyr to transform the data

  theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData



  # assign constants



  tidyData$typeOfLeave <- typeOfLeave

  tidyData$group <- group

  tidyData$year <- x



  tidyData

}) %>% do.call(rbind,.) -> combinedData

...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.

> head(combinedData)

# A tibble: 6 x 8

  Country    Sex   Age   Quarter Amount typeOfLeave group         year 

  <chr>      <chr> <chr> <chr>    <dbl> <chr>       <chr>         <chr>

1 Total      both  all   Q1       33000 sick        self employed 2017 

2 Afganistan both  all   Q1       33000 sick        self employed 2017 

3 Albania    both  all   Q1       33000 sick        self employed 2017 

4 Algeria    both  all   Q1       33000 sick        self employed 2017 

5 Andora     both  all   Q1       33000 sick        self employed 2017 

6 Total      women all   Q1       15000 sick        self employed 2017 

> tail(combinedData)

# A tibble: 6 x 8

  Country    Sex   Age         Quarter Amount typeOfLeave group         year 

  <chr>      <chr> <chr>       <chr>    <dbl> <chr>       <chr>         <chr>

1 Andora     women 65 and over Q4        2300 sick        self employed 2018 

2 Total      men   65 and over Q4        2400 sick        self employed 2018 

3 Afganistan men   65 and over Q4        2400 sick        self employed 2018 

4 Albania    men   65 and over Q4        2400 sick        self employed 2018 

5 Algeria    men   65 and over Q4        2400 sick        self employed 2018 

6 Andora     men   65 and over Q4        2400 sick        self employed 2018 

>

I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k — Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly. — Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow. — Nov 27 '18 at 17:29
I accepted your answer. Many thanks for helping me to understand the problem! — Dec 3 '18 at 10:58

score 2 · Accepted Answer · 2018-12-03 00:16:16Z

Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.

# download Excel file from github repository



sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"

destinationFile <- "./soQuestion53446800.xlsx"

download.file(sourceFile,destinationFile,mode="wb")





library(readxl)

library(tidyr)



# set constants 

typeOfLeave <- "sick"

group <- "self employed"



# read date and extract the value

theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]



# setup column names using underscore so we can separate key column into Sex and Age columns 

theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")

theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)



# use tidyr / dplyr to transform the data

theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData



# assign constants



tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$date <- theDate



tidyData

...and the output:

> tidyData

# A tibble: 30 x 7

   Country    Sex   Age   Amount typeOfLeave group         date               

   <chr>      <chr> <chr>  <dbl> <chr>       <chr>         <dttm>             

 1 Total      both  all   151708 sick        self employed 2016-03-31 00:00:00

 2 Afganistan both  all      269 sick        self employed 2016-03-31 00:00:00

 3 Albania    both  all      129 sick        self employed 2016-03-31 00:00:00

 4 Algeria    both  all      308 sick        self employed 2016-03-31 00:00:00

 5 Andora     both  all      815 sick        self employed 2016-03-31 00:00:00

 6 Total      women all    49919 sick        self employed 2016-03-31 00:00:00

 7 Afganistan women all      104 sick        self employed 2016-03-31 00:00:00

 8 Albania    women all       30 sick        self employed 2016-03-31 00:00:00

 9 Algeria    women all       18 sick        self employed 2016-03-31 00:00:00

10 Andora     women all      197 sick        self employed 2016-03-31 00:00:00

# ... with 20 more rows

Key elements in the solution

Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.

In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:

Reading Excel files via exact cell references with readxl::read_excel()

Reading a single cell into a constant

Setting column names for ease of use with tidyr::separate()

Restructuring to narrow format Tidy Data

Assigning constants

1. Reading exact cell references

The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.

readxl::read_excel() enables reading of exact cell references with the range= argument.

2. Reading a constant from one cell

If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script.

Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.

theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]

3. Setting column names for ease of use with `tidyr::separate()`

One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.

The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.

To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.

    theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")

4. Restructuring the data to narrow format Tidy Data

The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.

theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData

Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.

5. Assigning constants

The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.

Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.

tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$date <- theDate

Additional considerations

If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().

Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.

The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.

UPDATE: reading multiple quarters from a specific worksheet

On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.

Specify a worksheet with the sheet= parameter

Add _Q1 to distinguish each quarter's read, and save the quarter as a key variable

Set worksheet names to years

The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.

# download file from github to make script completely reproducible



sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"

destinationFile <- "./soQuestion53446800.xlsx"

download.file(sourceFile,destinationFile,mode="wb")



# set constants 

typeOfLeave <- "sick"

group <- "self employed"

year <- "2018"



# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns 

# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names 

sex <- rep(c("both","women","men"),16)

age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)

quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))

data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData

theCols <- unlist(c("Country",columnsData["excelColNames"]))



theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)



# use tidyr / dplyr to transform the data

theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData



# assign constants



tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$year <- year



tidyData

...and the output, reading from the 2018 sheet in the workbook.

> tidyData

# A tibble: 240 x 8

   Country    Sex   Age   Quarter Amount typeOfLeave group         year 

   <chr>      <chr> <chr> <chr>    <dbl> <chr>       <chr>         <chr>

 1 Total      both  all   Q1        2100 sick        self employed 2018 

 2 Afganistan both  all   Q1        2100 sick        self employed 2018 

 3 Albania    both  all   Q1        2100 sick        self employed 2018 

 4 Algeria    both  all   Q1        2100 sick        self employed 2018 

 5 Andora     both  all   Q1        2100 sick        self employed 2018 

 6 Total      women all   Q1         900 sick        self employed 2018 

 7 Afganistan women all   Q1         900 sick        self employed 2018 

 8 Albania    women all   Q1         900 sick        self employed 2018 

 9 Algeria    women all   Q1         900 sick        self employed 2018 

10 Andora     women all   Q1         900 sick        self employed 2018 

# ... with 230 more rows

>

If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.

# read second worksheet to illustrate multiple reads 



# set constants 

typeOfLeave <- "sick"

group <- "self employed"

year <- "2017"



theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)



# use tidyr / dplyr to transform the data

theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData



# assign constants



tidyData$typeOfLeave <- typeOfLeave

tidyData$group <- group

tidyData$year <- year



tidyData

...and the output:

> tidyData

# A tibble: 240 x 8

   Country    Sex   Age   Quarter Amount typeOfLeave group         year 

   <chr>      <chr> <chr> <chr>    <dbl> <chr>       <chr>         <chr>

 1 Total      both  all   Q1       33000 sick        self employed 2017 

 2 Afganistan both  all   Q1       33000 sick        self employed 2017 

 3 Albania    both  all   Q1       33000 sick        self employed 2017 

 4 Algeria    both  all   Q1       33000 sick        self employed 2017 

 5 Andora     both  all   Q1       33000 sick        self employed 2017 

 6 Total      women all   Q1       15000 sick        self employed 2017 

 7 Afganistan women all   Q1       15000 sick        self employed 2017 

 8 Albania    women all   Q1       15000 sick        self employed 2017 

 9 Algeria    women all   Q1       15000 sick        self employed 2017 

10 Andora     women all   Q1       15000 sick        self employed 2017 

# ... with 230 more rows

>

Pulling it all together...

At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().

## version that combines multiple years into a single narrow format tidy data file

# download file from github to make script completely reproducible



sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"

destinationFile <- "./soQuestion53446800.xlsx"

download.file(sourceFile,destinationFile,mode="wb")





library(readxl)

library(tidyr)



# set constants

years <- c("2017","2018")

typeOfLeave <- "sick"

group <- "self employed"



# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns 

# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names 

sex <- rep(c("both","women","men"),16)

age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)

quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))

data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData

theCols <- unlist(c("Country",columnsData["excelColNames"]))





lapply(years,function(x){

  theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)



  # use tidyr / dplyr to transform the data

  theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData



  # assign constants



  tidyData$typeOfLeave <- typeOfLeave

  tidyData$group <- group

  tidyData$year <- x



  tidyData

}) %>% do.call(rbind,.) -> combinedData

...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.

> head(combinedData)

# A tibble: 6 x 8

  Country    Sex   Age   Quarter Amount typeOfLeave group         year 

  <chr>      <chr> <chr> <chr>    <dbl> <chr>       <chr>         <chr>

1 Total      both  all   Q1       33000 sick        self employed 2017 

2 Afganistan both  all   Q1       33000 sick        self employed 2017 

3 Albania    both  all   Q1       33000 sick        self employed 2017 

4 Algeria    both  all   Q1       33000 sick        self employed 2017 

5 Andora     both  all   Q1       33000 sick        self employed 2017 

6 Total      women all   Q1       15000 sick        self employed 2017 

> tail(combinedData)

# A tibble: 6 x 8

  Country    Sex   Age         Quarter Amount typeOfLeave group         year 

  <chr>      <chr> <chr>       <chr>    <dbl> <chr>       <chr>         <chr>

1 Andora     women 65 and over Q4        2300 sick        self employed 2018 

2 Total      men   65 and over Q4        2400 sick        self employed 2018 

3 Afganistan men   65 and over Q4        2400 sick        self employed 2018 

4 Albania    men   65 and over Q4        2400 sick        self employed 2018 

5 Algeria    men   65 and over Q4        2400 sick        self employed 2018 

6 Andora     men   65 and over Q4        2400 sick        self employed 2018 

>

I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k — Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly. — Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow. — Nov 27 '18 at 17:29
I accepted your answer. Many thanks for helping me to understand the problem! — Dec 3 '18 at 10:58

R - read Excel file and switch variables into observations

1 Answer 1

Key elements in the solution

1. Reading exact cell references

2. Reading a constant from one cell

3. Setting column names for ease of use with tidyr::separate()

4. Restructuring the data to narrow format Tidy Data

5. Assigning constants

Additional considerations

UPDATE: reading multiple quarters from a specific worksheet

Pulling it all together...

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Key elements in the solution

1. Reading exact cell references

2. Reading a constant from one cell

3. Setting column names for ease of use with tidyr::separate()

4. Restructuring the data to narrow format Tidy Data

5. Assigning constants

Additional considerations

UPDATE: reading multiple quarters from a specific worksheet

Pulling it all together...

Key elements in the solution

1. Reading exact cell references

2. Reading a constant from one cell

3. Setting column names for ease of use with tidyr::separate()

4. Restructuring the data to narrow format Tidy Data

5. Assigning constants

Additional considerations

UPDATE: reading multiple quarters from a specific worksheet

Pulling it all together...

Key elements in the solution

1. Reading exact cell references

2. Reading a constant from one cell

3. Setting column names for ease of use with tidyr::separate()

4. Restructuring the data to narrow format Tidy Data

5. Assigning constants

Additional considerations

UPDATE: reading multiple quarters from a specific worksheet

Pulling it all together...

Key elements in the solution

1. Reading exact cell references

2. Reading a constant from one cell

3. Setting column names for ease of use with tidyr::separate()

4. Restructuring the data to narrow format Tidy Data

5. Assigning constants

Additional considerations

UPDATE: reading multiple quarters from a specific worksheet

Pulling it all together...

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Futebolista

F# list compare

Jornalista

1 Answer
1

3. Setting column names for ease of use with `tidyr::separate()`

1 Answer
1

1 Answer
1

3. Setting column names for ease of use with `tidyr::separate()`

3. Setting column names for ease of use with `tidyr::separate()`

3. Setting column names for ease of use with `tidyr::separate()`

3. Setting column names for ease of use with `tidyr::separate()`