R - read Excel file and switch variables into observations
I am struggling with some data transformation. I have some huge xlsx files with insurance data. The data is structured a little bit like a "pyramid". The first line represents the quarter in which the survey took place. The next line is a breakdown by age categories. There are 4 categories: in total values, up to 17, 18-64 and 65+. One sheet contains 4 quarters so basically 48 unique variables and the column with country names. One excel file contains 3 sheets (2016, 2017 and 2018). The screenshot (INPUT DATA) comes from a excel file where the name is "sick-leave blue collar workers". I have also two other files: "sick-leave worker" and "sick-leave self-employment". The goal is combine all three files and create a file with a structure like in the RESULT DATA. Could you please help me?
INPUT DATA:
RESULT DATA: 
r excel tidyr
add a comment |
I am struggling with some data transformation. I have some huge xlsx files with insurance data. The data is structured a little bit like a "pyramid". The first line represents the quarter in which the survey took place. The next line is a breakdown by age categories. There are 4 categories: in total values, up to 17, 18-64 and 65+. One sheet contains 4 quarters so basically 48 unique variables and the column with country names. One excel file contains 3 sheets (2016, 2017 and 2018). The screenshot (INPUT DATA) comes from a excel file where the name is "sick-leave blue collar workers". I have also two other files: "sick-leave worker" and "sick-leave self-employment". The goal is combine all three files and create a file with a structure like in the RESULT DATA. Could you please help me?
INPUT DATA:
RESULT DATA: 
r excel tidyr
Where is the information on type of leave and group stored?
– Len Greski
Nov 23 '18 at 14:09
add a comment |
I am struggling with some data transformation. I have some huge xlsx files with insurance data. The data is structured a little bit like a "pyramid". The first line represents the quarter in which the survey took place. The next line is a breakdown by age categories. There are 4 categories: in total values, up to 17, 18-64 and 65+. One sheet contains 4 quarters so basically 48 unique variables and the column with country names. One excel file contains 3 sheets (2016, 2017 and 2018). The screenshot (INPUT DATA) comes from a excel file where the name is "sick-leave blue collar workers". I have also two other files: "sick-leave worker" and "sick-leave self-employment". The goal is combine all three files and create a file with a structure like in the RESULT DATA. Could you please help me?
INPUT DATA:
RESULT DATA: 
r excel tidyr
I am struggling with some data transformation. I have some huge xlsx files with insurance data. The data is structured a little bit like a "pyramid". The first line represents the quarter in which the survey took place. The next line is a breakdown by age categories. There are 4 categories: in total values, up to 17, 18-64 and 65+. One sheet contains 4 quarters so basically 48 unique variables and the column with country names. One excel file contains 3 sheets (2016, 2017 and 2018). The screenshot (INPUT DATA) comes from a excel file where the name is "sick-leave blue collar workers". I have also two other files: "sick-leave worker" and "sick-leave self-employment". The goal is combine all three files and create a file with a structure like in the RESULT DATA. Could you please help me?
INPUT DATA:
RESULT DATA: 
r excel tidyr
r excel tidyr
edited Nov 29 '18 at 20:14
asked Nov 23 '18 at 12:33
Michel
376
376
Where is the information on type of leave and group stored?
– Len Greski
Nov 23 '18 at 14:09
add a comment |
Where is the information on type of leave and group stored?
– Len Greski
Nov 23 '18 at 14:09
Where is the information on type of leave and group stored?
– Len Greski
Nov 23 '18 at 14:09
Where is the information on type of leave and group stored?
– Len Greski
Nov 23 '18 at 14:09
add a comment |
1 Answer
1
active
oldest
votes
Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.
# download Excel file from github repository
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
typeOfLeave <- "sick"
group <- "self employed"
# read date and extract the value
theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]
# setup column names using underscore so we can separate key column into Sex and Age columns
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
tidyData
...and the output:
> tidyData
# A tibble: 30 x 7
Country Sex Age Amount typeOfLeave group date
<chr> <chr> <chr> <dbl> <chr> <chr> <dttm>
1 Total both all 151708 sick self employed 2016-03-31 00:00:00
2 Afganistan both all 269 sick self employed 2016-03-31 00:00:00
3 Albania both all 129 sick self employed 2016-03-31 00:00:00
4 Algeria both all 308 sick self employed 2016-03-31 00:00:00
5 Andora both all 815 sick self employed 2016-03-31 00:00:00
6 Total women all 49919 sick self employed 2016-03-31 00:00:00
7 Afganistan women all 104 sick self employed 2016-03-31 00:00:00
8 Albania women all 30 sick self employed 2016-03-31 00:00:00
9 Algeria women all 18 sick self employed 2016-03-31 00:00:00
10 Andora women all 197 sick self employed 2016-03-31 00:00:00
# ... with 20 more rows
Key elements in the solution
Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.
In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:
- Reading Excel files via exact cell references with
readxl::read_excel()
- Reading a single cell into a constant
- Setting column names for ease of use with
tidyr::separate()
- Restructuring to narrow format Tidy Data
- Assigning constants
1. Reading exact cell references
The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.
readxl::read_excel() enables reading of exact cell references with the range= argument.
2. Reading a constant from one cell
If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script.
Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.
theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]
3. Setting column names for ease of use with tidyr::separate()
One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.
The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.
To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
4. Restructuring the data to narrow format Tidy Data
The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.
5. Assigning constants
The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.
Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
Additional considerations
If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().
Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.
The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.
UPDATE: reading multiple quarters from a specific worksheet
On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.
- Specify a worksheet with the
sheet=parameter - Add
_Q1to distinguish each quarter's read, and save the quarter as a key variable - Set worksheet names to years
The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2018"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output, reading from the 2018 sheet in the workbook.
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 2100 sick self employed 2018
2 Afganistan both all Q1 2100 sick self employed 2018
3 Albania both all Q1 2100 sick self employed 2018
4 Algeria both all Q1 2100 sick self employed 2018
5 Andora both all Q1 2100 sick self employed 2018
6 Total women all Q1 900 sick self employed 2018
7 Afganistan women all Q1 900 sick self employed 2018
8 Albania women all Q1 900 sick self employed 2018
9 Algeria women all Q1 900 sick self employed 2018
10 Andora women all Q1 900 sick self employed 2018
# ... with 230 more rows
>
If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.
# read second worksheet to illustrate multiple reads
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2017"
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output:
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
7 Afganistan women all Q1 15000 sick self employed 2017
8 Albania women all Q1 15000 sick self employed 2017
9 Algeria women all Q1 15000 sick self employed 2017
10 Andora women all Q1 15000 sick self employed 2017
# ... with 230 more rows
>
Pulling it all together...
At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().
## version that combines multiple years into a single narrow format tidy data file
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
years <- c("2017","2018")
typeOfLeave <- "sick"
group <- "self employed"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
lapply(years,function(x){
theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- x
tidyData
}) %>% do.call(rbind,.) -> combinedData
...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.
> head(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
> tail(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Andora women 65 and over Q4 2300 sick self employed 2018
2 Total men 65 and over Q4 2400 sick self employed 2018
3 Afganistan men 65 and over Q4 2400 sick self employed 2018
4 Albania men 65 and over Q4 2400 sick self employed 2018
5 Algeria men 65 and over Q4 2400 sick self employed 2018
6 Andora men 65 and over Q4 2400 sick self employed 2018
>
Wow! Your answer is just simply amazing, many thanks!
– Michel
Nov 26 '18 at 19:15
I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k
– Michel
Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly.
– Len Greski
Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow.
– Michel
Nov 27 '18 at 17:29
1
I accepted your answer. Many thanks for helping me to understand the problem!
– Michel
Dec 3 '18 at 10:58
|
show 4 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53446800%2fr-read-excel-file-and-switch-variables-into-observations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.
# download Excel file from github repository
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
typeOfLeave <- "sick"
group <- "self employed"
# read date and extract the value
theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]
# setup column names using underscore so we can separate key column into Sex and Age columns
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
tidyData
...and the output:
> tidyData
# A tibble: 30 x 7
Country Sex Age Amount typeOfLeave group date
<chr> <chr> <chr> <dbl> <chr> <chr> <dttm>
1 Total both all 151708 sick self employed 2016-03-31 00:00:00
2 Afganistan both all 269 sick self employed 2016-03-31 00:00:00
3 Albania both all 129 sick self employed 2016-03-31 00:00:00
4 Algeria both all 308 sick self employed 2016-03-31 00:00:00
5 Andora both all 815 sick self employed 2016-03-31 00:00:00
6 Total women all 49919 sick self employed 2016-03-31 00:00:00
7 Afganistan women all 104 sick self employed 2016-03-31 00:00:00
8 Albania women all 30 sick self employed 2016-03-31 00:00:00
9 Algeria women all 18 sick self employed 2016-03-31 00:00:00
10 Andora women all 197 sick self employed 2016-03-31 00:00:00
# ... with 20 more rows
Key elements in the solution
Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.
In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:
- Reading Excel files via exact cell references with
readxl::read_excel()
- Reading a single cell into a constant
- Setting column names for ease of use with
tidyr::separate()
- Restructuring to narrow format Tidy Data
- Assigning constants
1. Reading exact cell references
The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.
readxl::read_excel() enables reading of exact cell references with the range= argument.
2. Reading a constant from one cell
If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script.
Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.
theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]
3. Setting column names for ease of use with tidyr::separate()
One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.
The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.
To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
4. Restructuring the data to narrow format Tidy Data
The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.
5. Assigning constants
The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.
Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
Additional considerations
If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().
Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.
The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.
UPDATE: reading multiple quarters from a specific worksheet
On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.
- Specify a worksheet with the
sheet=parameter - Add
_Q1to distinguish each quarter's read, and save the quarter as a key variable - Set worksheet names to years
The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2018"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output, reading from the 2018 sheet in the workbook.
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 2100 sick self employed 2018
2 Afganistan both all Q1 2100 sick self employed 2018
3 Albania both all Q1 2100 sick self employed 2018
4 Algeria both all Q1 2100 sick self employed 2018
5 Andora both all Q1 2100 sick self employed 2018
6 Total women all Q1 900 sick self employed 2018
7 Afganistan women all Q1 900 sick self employed 2018
8 Albania women all Q1 900 sick self employed 2018
9 Algeria women all Q1 900 sick self employed 2018
10 Andora women all Q1 900 sick self employed 2018
# ... with 230 more rows
>
If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.
# read second worksheet to illustrate multiple reads
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2017"
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output:
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
7 Afganistan women all Q1 15000 sick self employed 2017
8 Albania women all Q1 15000 sick self employed 2017
9 Algeria women all Q1 15000 sick self employed 2017
10 Andora women all Q1 15000 sick self employed 2017
# ... with 230 more rows
>
Pulling it all together...
At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().
## version that combines multiple years into a single narrow format tidy data file
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
years <- c("2017","2018")
typeOfLeave <- "sick"
group <- "self employed"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
lapply(years,function(x){
theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- x
tidyData
}) %>% do.call(rbind,.) -> combinedData
...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.
> head(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
> tail(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Andora women 65 and over Q4 2300 sick self employed 2018
2 Total men 65 and over Q4 2400 sick self employed 2018
3 Afganistan men 65 and over Q4 2400 sick self employed 2018
4 Albania men 65 and over Q4 2400 sick self employed 2018
5 Algeria men 65 and over Q4 2400 sick self employed 2018
6 Andora men 65 and over Q4 2400 sick self employed 2018
>
Wow! Your answer is just simply amazing, many thanks!
– Michel
Nov 26 '18 at 19:15
I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k
– Michel
Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly.
– Len Greski
Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow.
– Michel
Nov 27 '18 at 17:29
1
I accepted your answer. Many thanks for helping me to understand the problem!
– Michel
Dec 3 '18 at 10:58
|
show 4 more comments
Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.
# download Excel file from github repository
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
typeOfLeave <- "sick"
group <- "self employed"
# read date and extract the value
theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]
# setup column names using underscore so we can separate key column into Sex and Age columns
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
tidyData
...and the output:
> tidyData
# A tibble: 30 x 7
Country Sex Age Amount typeOfLeave group date
<chr> <chr> <chr> <dbl> <chr> <chr> <dttm>
1 Total both all 151708 sick self employed 2016-03-31 00:00:00
2 Afganistan both all 269 sick self employed 2016-03-31 00:00:00
3 Albania both all 129 sick self employed 2016-03-31 00:00:00
4 Algeria both all 308 sick self employed 2016-03-31 00:00:00
5 Andora both all 815 sick self employed 2016-03-31 00:00:00
6 Total women all 49919 sick self employed 2016-03-31 00:00:00
7 Afganistan women all 104 sick self employed 2016-03-31 00:00:00
8 Albania women all 30 sick self employed 2016-03-31 00:00:00
9 Algeria women all 18 sick self employed 2016-03-31 00:00:00
10 Andora women all 197 sick self employed 2016-03-31 00:00:00
# ... with 20 more rows
Key elements in the solution
Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.
In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:
- Reading Excel files via exact cell references with
readxl::read_excel()
- Reading a single cell into a constant
- Setting column names for ease of use with
tidyr::separate()
- Restructuring to narrow format Tidy Data
- Assigning constants
1. Reading exact cell references
The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.
readxl::read_excel() enables reading of exact cell references with the range= argument.
2. Reading a constant from one cell
If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script.
Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.
theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]
3. Setting column names for ease of use with tidyr::separate()
One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.
The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.
To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
4. Restructuring the data to narrow format Tidy Data
The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.
5. Assigning constants
The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.
Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
Additional considerations
If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().
Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.
The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.
UPDATE: reading multiple quarters from a specific worksheet
On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.
- Specify a worksheet with the
sheet=parameter - Add
_Q1to distinguish each quarter's read, and save the quarter as a key variable - Set worksheet names to years
The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2018"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output, reading from the 2018 sheet in the workbook.
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 2100 sick self employed 2018
2 Afganistan both all Q1 2100 sick self employed 2018
3 Albania both all Q1 2100 sick self employed 2018
4 Algeria both all Q1 2100 sick self employed 2018
5 Andora both all Q1 2100 sick self employed 2018
6 Total women all Q1 900 sick self employed 2018
7 Afganistan women all Q1 900 sick self employed 2018
8 Albania women all Q1 900 sick self employed 2018
9 Algeria women all Q1 900 sick self employed 2018
10 Andora women all Q1 900 sick self employed 2018
# ... with 230 more rows
>
If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.
# read second worksheet to illustrate multiple reads
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2017"
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output:
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
7 Afganistan women all Q1 15000 sick self employed 2017
8 Albania women all Q1 15000 sick self employed 2017
9 Algeria women all Q1 15000 sick self employed 2017
10 Andora women all Q1 15000 sick self employed 2017
# ... with 230 more rows
>
Pulling it all together...
At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().
## version that combines multiple years into a single narrow format tidy data file
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
years <- c("2017","2018")
typeOfLeave <- "sick"
group <- "self employed"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
lapply(years,function(x){
theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- x
tidyData
}) %>% do.call(rbind,.) -> combinedData
...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.
> head(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
> tail(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Andora women 65 and over Q4 2300 sick self employed 2018
2 Total men 65 and over Q4 2400 sick self employed 2018
3 Afganistan men 65 and over Q4 2400 sick self employed 2018
4 Albania men 65 and over Q4 2400 sick self employed 2018
5 Algeria men 65 and over Q4 2400 sick self employed 2018
6 Andora men 65 and over Q4 2400 sick self employed 2018
>
Wow! Your answer is just simply amazing, many thanks!
– Michel
Nov 26 '18 at 19:15
I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k
– Michel
Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly.
– Len Greski
Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow.
– Michel
Nov 27 '18 at 17:29
1
I accepted your answer. Many thanks for helping me to understand the problem!
– Michel
Dec 3 '18 at 10:58
|
show 4 more comments
Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.
# download Excel file from github repository
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
typeOfLeave <- "sick"
group <- "self employed"
# read date and extract the value
theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]
# setup column names using underscore so we can separate key column into Sex and Age columns
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
tidyData
...and the output:
> tidyData
# A tibble: 30 x 7
Country Sex Age Amount typeOfLeave group date
<chr> <chr> <chr> <dbl> <chr> <chr> <dttm>
1 Total both all 151708 sick self employed 2016-03-31 00:00:00
2 Afganistan both all 269 sick self employed 2016-03-31 00:00:00
3 Albania both all 129 sick self employed 2016-03-31 00:00:00
4 Algeria both all 308 sick self employed 2016-03-31 00:00:00
5 Andora both all 815 sick self employed 2016-03-31 00:00:00
6 Total women all 49919 sick self employed 2016-03-31 00:00:00
7 Afganistan women all 104 sick self employed 2016-03-31 00:00:00
8 Albania women all 30 sick self employed 2016-03-31 00:00:00
9 Algeria women all 18 sick self employed 2016-03-31 00:00:00
10 Andora women all 197 sick self employed 2016-03-31 00:00:00
# ... with 20 more rows
Key elements in the solution
Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.
In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:
- Reading Excel files via exact cell references with
readxl::read_excel()
- Reading a single cell into a constant
- Setting column names for ease of use with
tidyr::separate()
- Restructuring to narrow format Tidy Data
- Assigning constants
1. Reading exact cell references
The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.
readxl::read_excel() enables reading of exact cell references with the range= argument.
2. Reading a constant from one cell
If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script.
Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.
theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]
3. Setting column names for ease of use with tidyr::separate()
One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.
The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.
To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
4. Restructuring the data to narrow format Tidy Data
The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.
5. Assigning constants
The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.
Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
Additional considerations
If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().
Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.
The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.
UPDATE: reading multiple quarters from a specific worksheet
On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.
- Specify a worksheet with the
sheet=parameter - Add
_Q1to distinguish each quarter's read, and save the quarter as a key variable - Set worksheet names to years
The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2018"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output, reading from the 2018 sheet in the workbook.
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 2100 sick self employed 2018
2 Afganistan both all Q1 2100 sick self employed 2018
3 Albania both all Q1 2100 sick self employed 2018
4 Algeria both all Q1 2100 sick self employed 2018
5 Andora both all Q1 2100 sick self employed 2018
6 Total women all Q1 900 sick self employed 2018
7 Afganistan women all Q1 900 sick self employed 2018
8 Albania women all Q1 900 sick self employed 2018
9 Algeria women all Q1 900 sick self employed 2018
10 Andora women all Q1 900 sick self employed 2018
# ... with 230 more rows
>
If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.
# read second worksheet to illustrate multiple reads
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2017"
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output:
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
7 Afganistan women all Q1 15000 sick self employed 2017
8 Albania women all Q1 15000 sick self employed 2017
9 Algeria women all Q1 15000 sick self employed 2017
10 Andora women all Q1 15000 sick self employed 2017
# ... with 230 more rows
>
Pulling it all together...
At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().
## version that combines multiple years into a single narrow format tidy data file
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
years <- c("2017","2018")
typeOfLeave <- "sick"
group <- "self employed"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
lapply(years,function(x){
theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- x
tidyData
}) %>% do.call(rbind,.) -> combinedData
...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.
> head(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
> tail(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Andora women 65 and over Q4 2300 sick self employed 2018
2 Total men 65 and over Q4 2400 sick self employed 2018
3 Afganistan men 65 and over Q4 2400 sick self employed 2018
4 Albania men 65 and over Q4 2400 sick self employed 2018
5 Algeria men 65 and over Q4 2400 sick self employed 2018
6 Andora men 65 and over Q4 2400 sick self employed 2018
>
Here is a solution that uses the readxl and tidyr packages from the Tidyverse. To make the script reproducible, I created an Excel version of the OP screen capture and saved it to my stackoverflowAnswers github repository. The script downloads the Excel file, reads it, and converts it to Tidy Data format.
# download Excel file from github repository
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
typeOfLeave <- "sick"
group <- "self employed"
# read date and extract the value
theDate <- read_excel(destinationFile,range="A2:A2",col_names=FALSE)[[1]]
# setup column names using underscore so we can separate key column into Sex and Age columns
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
theData <- read_excel(destinationFile,range="A5:G9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
tidyData
...and the output:
> tidyData
# A tibble: 30 x 7
Country Sex Age Amount typeOfLeave group date
<chr> <chr> <chr> <dbl> <chr> <chr> <dttm>
1 Total both all 151708 sick self employed 2016-03-31 00:00:00
2 Afganistan both all 269 sick self employed 2016-03-31 00:00:00
3 Albania both all 129 sick self employed 2016-03-31 00:00:00
4 Algeria both all 308 sick self employed 2016-03-31 00:00:00
5 Andora both all 815 sick self employed 2016-03-31 00:00:00
6 Total women all 49919 sick self employed 2016-03-31 00:00:00
7 Afganistan women all 104 sick self employed 2016-03-31 00:00:00
8 Albania women all 30 sick self employed 2016-03-31 00:00:00
9 Algeria women all 18 sick self employed 2016-03-31 00:00:00
10 Andora women all 197 sick self employed 2016-03-31 00:00:00
# ... with 20 more rows
Key elements in the solution
Microsoft Excel is frequently used as a data entry and reporting tool, which leads people to structure their spreadsheets in hierarchical table formats like the one illustrated in the OP. This format makes the data difficult to use in R, because the column names represent combinations of information that is rendered hierarchically in table headers within the spreadsheet.
In this section we'll explain some of the key design elements in the solution to the problem posed in the OP, including:
- Reading Excel files via exact cell references with
readxl::read_excel()
- Reading a single cell into a constant
- Setting column names for ease of use with
tidyr::separate()
- Restructuring to narrow format Tidy Data
- Assigning constants
1. Reading exact cell references
The OP question notes that there is a heading row containing a date for all the cells in a particular table. To simulate this in the sample spreadsheet I used to replicate the screen shot in the OP, I assigned the date of March 31, 2016 to cell A2 of Sheet 1 in an Excel workbook.
readxl::read_excel() enables reading of exact cell references with the range= argument.
2. Reading a constant from one cell
If we set the range= argument to a single cell and extract the cell with the [[ form of the extract operator, the resulting object is a single element vector instead of a data frame. This makes it possible to use vector recycling to assign this value to the tidy data frame later in the R script.
Since everything in R is an object, we can use the [[ extract operator on the result of read_excel() to assign the result to theDate.
theDate <- read_excel(theXLSX,range="A2:A2",col_names=FALSE)[[1]]
3. Setting column names for ease of use with tidyr::separate()
One of the characteristics that makes the original spreadsheet messy as opposed to Tidy Data is the fact that each column of data represents a combination of Sex and Age values.
The desired output data frame includes columns for both Sex and Age, and therefore we need a way to extract this information from the column names. The tidyr package provides a function to support this technique, the separate() function.
To facilitate use of this function, we assign column names with an underscore separator to distinguish the Sex and Age components in the column names.
theCols <- c("Country","both_all","women_all","men_all","both_up to 17","women_up to 17","men_up to 17")
4. Restructuring the data to narrow format Tidy Data
The key step in the script is a sequence of Tidyverse functions that takes the data frame read with read_excel(), uses tidyr::gather() on columns 2 - 7 to create one row per unique combination of Country, Sex, and Age, and then splits the resulting key column into the Sex and Age columns.
theData %>% gather(.,key="key",value="Amount",2:7) %>% separate(.,key,into=c("Sex","Age"),sep="_") -> tidyData
Data left of the underscore is assigned to the Sex column, and right of the underscore is assigned to Age. Note that the OP doesn't specify how the totals should be handled in the output. Since total doesn't make sense as a value for Sex, I used Both in its place. Similarly, for Age I assigned total as All.
5. Assigning constants
The OP does not explain where the constants sick and group are sourced, so I assigned them as constants at the start of the program. If these are included in the hierarchical part of the spreadsheet, they can easily be read using the technique I used to extract the date from the spreadsheet.
Once the data is in tidy format, we add the remaining constants via the assignment operator, taking advantage of vector recycling in R.
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$date <- theDate
Additional considerations
If the total values are not required in the output data frame, they can easily be eliminated by using the extract operator on the tidy data, or dropping columns from the messy data frame prior to using gather().
Note that I chose to leave the totals in the output data frame because almost all of the data in the screen capture represented totals of one form or another (i.e. only 2 of the 30 cells of data in the OP screen capture were not totals), and eliminating this data would make it difficult to confirm that the script worked correctly.
The solution can be extended to cover age categories referenced in the OP but not illustrated in the spreadsheet by adding appropriate column names to theCols vector, and by changing the range= argument in the read_excel() function that reads the bulk of the spreadsheet.
UPDATE: reading multiple quarters from a specific worksheet
On November 29th the original poster modified the question to explain that there were multiple worksheets in the Excel file, one for each year. This is easily handled with the following modifications.
- Specify a worksheet with the
sheet=parameter - Add
_Q1to distinguish each quarter's read, and save the quarter as a key variable - Set worksheet names to years
The resulting tidy data will have year and quarter columns. Note that I updated my Excel workbook with dummy data so worksheets representing different years have different data so the results are distinguishable.
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2018"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output, reading from the 2018 sheet in the workbook.
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 2100 sick self employed 2018
2 Afganistan both all Q1 2100 sick self employed 2018
3 Albania both all Q1 2100 sick self employed 2018
4 Algeria both all Q1 2100 sick self employed 2018
5 Andora both all Q1 2100 sick self employed 2018
6 Total women all Q1 900 sick self employed 2018
7 Afganistan women all Q1 900 sick self employed 2018
8 Albania women all Q1 900 sick self employed 2018
9 Algeria women all Q1 900 sick self employed 2018
10 Andora women all Q1 900 sick self employed 2018
# ... with 230 more rows
>
If we change the configuration parameters we can read the 2017 data from the workbook I posted to Github.
# read second worksheet to illustrate multiple reads
# set constants
typeOfLeave <- "sick"
group <- "self employed"
year <- "2017"
theData <- read_excel(destinationFile,sheet=year,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- year
tidyData
...and the output:
> tidyData
# A tibble: 240 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
7 Afganistan women all Q1 15000 sick self employed 2017
8 Albania women all Q1 15000 sick self employed 2017
9 Algeria women all Q1 15000 sick self employed 2017
10 Andora women all Q1 15000 sick self employed 2017
# ... with 230 more rows
>
Pulling it all together...
At this point we have built the basic ideas into a script that completely reads one worksheet. If we modify the code slightly and incorporate a function such as lapply(), we can start with a vector of worksheet names, read the files, convert them to tidy data format, and combine the files into a single tidy data set with do.call() and rbind().
## version that combines multiple years into a single narrow format tidy data file
# download file from github to make script completely reproducible
sourceFile <- "https://raw.githubusercontent.com/lgreski/stackoverflowanswers/master/data/soQuestion53446800.xlsx"
destinationFile <- "./soQuestion53446800.xlsx"
download.file(sourceFile,destinationFile,mode="wb")
library(readxl)
library(tidyr)
# set constants
years <- c("2017","2018")
typeOfLeave <- "sick"
group <- "self employed"
# setup column names using underscore so we can separate key column into Sex, Age, and Quarter columns
# after using rep() to build data with required repeating patterns, avoiding manual typing of all the column names
sex <- rep(c("both","women","men"),16)
age <- rep(c(rep("all",3),rep("up to 17",3),rep("18 to 64",3),rep("65 and over",3)),4)
quarter <- c(rep("Q1",12),rep("Q2",12),rep("Q3",12),rep("Q4",12))
data.frame(sex,age,quarter) %>% unite(excelColNames) -> columnsData
theCols <- unlist(c("Country",columnsData["excelColNames"]))
lapply(years,function(x){
theData <- read_excel(destinationFile,sheet=x,range="A5:AW9",col_names=theCols)
# use tidyr / dplyr to transform the data
theData %>% gather(.,key="key",value="Amount",2:49) %>% separate(.,key,into=c("Sex","Age","Quarter"),sep="_") -> tidyData
# assign constants
tidyData$typeOfLeave <- typeOfLeave
tidyData$group <- group
tidyData$year <- x
tidyData
}) %>% do.call(rbind,.) -> combinedData
...and the output, demonstrating that the combinedData data frame includes data from both 2017 and 2018 worksheets.
> head(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Total both all Q1 33000 sick self employed 2017
2 Afganistan both all Q1 33000 sick self employed 2017
3 Albania both all Q1 33000 sick self employed 2017
4 Algeria both all Q1 33000 sick self employed 2017
5 Andora both all Q1 33000 sick self employed 2017
6 Total women all Q1 15000 sick self employed 2017
> tail(combinedData)
# A tibble: 6 x 8
Country Sex Age Quarter Amount typeOfLeave group year
<chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 Andora women 65 and over Q4 2300 sick self employed 2018
2 Total men 65 and over Q4 2400 sick self employed 2018
3 Afganistan men 65 and over Q4 2400 sick self employed 2018
4 Albania men 65 and over Q4 2400 sick self employed 2018
5 Algeria men 65 and over Q4 2400 sick self employed 2018
6 Andora men 65 and over Q4 2400 sick self employed 2018
>
edited Dec 3 '18 at 0:16
answered Nov 23 '18 at 15:15
Len Greski
3,1201421
3,1201421
Wow! Your answer is just simply amazing, many thanks!
– Michel
Nov 26 '18 at 19:15
I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k
– Michel
Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly.
– Len Greski
Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow.
– Michel
Nov 27 '18 at 17:29
1
I accepted your answer. Many thanks for helping me to understand the problem!
– Michel
Dec 3 '18 at 10:58
|
show 4 more comments
Wow! Your answer is just simply amazing, many thanks!
– Michel
Nov 26 '18 at 19:15
I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k
– Michel
Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly.
– Len Greski
Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow.
– Michel
Nov 27 '18 at 17:29
1
I accepted your answer. Many thanks for helping me to understand the problem!
– Michel
Dec 3 '18 at 10:58
Wow! Your answer is just simply amazing, many thanks!
– Michel
Nov 26 '18 at 19:15
Wow! Your answer is just simply amazing, many thanks!
– Michel
Nov 26 '18 at 19:15
I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k
– Michel
Nov 26 '18 at 22:24
I should probably mention that in one sheet I have 49 variables. The screenshot presents just one of 4 quarters. The variable theDates should store a list of 4 dates. I will try tomorrow swiftly create a unique list of variables name and maybe use gather, will it work? screen: imgur.com/a/KtEsZ9k
– Michel
Nov 26 '18 at 22:24
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly.
– Len Greski
Nov 27 '18 at 4:10
Are subsequent quarters in rows below the first table, or are they in columns to the right of the data in the screenshot? Also, you’ll need to use a different technique than assigning theDates as a vector of four different dates. Update your question to more accurately describe the structure of data in the worksheet and I can update my answer to read it correctly.
– Len Greski
Nov 27 '18 at 4:10
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow.
– Michel
Nov 27 '18 at 17:29
The subsequent quarters are right of the data in the screenshot. I will update more details tomorrow.
– Michel
Nov 27 '18 at 17:29
1
1
I accepted your answer. Many thanks for helping me to understand the problem!
– Michel
Dec 3 '18 at 10:58
I accepted your answer. Many thanks for helping me to understand the problem!
– Michel
Dec 3 '18 at 10:58
|
show 4 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53446800%2fr-read-excel-file-and-switch-variables-into-observations%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Where is the information on type of leave and group stored?
– Len Greski
Nov 23 '18 at 14:09