Typically geochemical data is treated rather poorly; it often starts with a clean Excel spreadsheet and ends with a mess of calculations and plots. In this post I share what I learned about manipulating data in R using tidyverse packages. Particularly, I focus on keeping the data clean, columns in good shape and consistent treatement of data.
For example, in the last 4 years, I have been working in a stable isotope lab and thus, I've produced a lot of measurements. Having lot of measurements compiled in data sets is great, however now I have a hard time treating the data consistently.
The main problem is in small day-to-day deviations in the data monitored by analysis of standards. Because of that each session includes a set of standards along with unknowns. The difference between the measured and nominal value of standards is subtracted from the rest of the data within the same analytical session. Having to do this for each analytical sessions, it is almost impossible to use Excel.
TASK: adjust the values of unknowns based on the standards measured during different analytical sessions. Each session has a different date - that helps to automatize the experience.
DETAILS: named UOG is measured for d18O within each analytical session, it should have value of 6.52 per mil but deviations within 0.5 - 1 per mil occur. The difference (excess or deficiency) between the measured value of UOG and 6.52 should be applied to the data set on that particular day.
LOGICAL APPROACH: compile all the data in a single file. Each column is a separate variable. At least three of them: Date of analytical session, Sample name, Measured d18O value. I will instruct the machine how to average out standards within each session and apply the difference between nominal and average value to the rest of the data.
COMPUTING APPROACH is provided below. The instructions are created using RMarkdown document. With some basic familiarity with R this should be easy.
Let’s start with installing the needed packages. Each one of these packages includes a set of important functions. Dplyr and tidyr deal with manipulations, ggplot2 and gridExtra contain some plotting tools, readxl is responsible for importing Excel files. We will see what ggrepel has to offer later!
Load the data from Excel files using the function read_excel() from the readxl package. My Excel file is called ‘2014_2017_18O_David.xlsx’
Notice there are total of 49 measurements. It could be any number; once the algorithm is created, you can apply it to any amount of data. That’s what we’re after - consistency.
Now tidyr and dplyr magic kicks in. Let’s take our raw_data1 and process it with a bunch of functions using pipe operator %>%. Objects on the left are passed to the functions to the right side of the pipe %>%.
Renaming was easy too using
Now, we need to extract standards from the data set. Separate samples from standards. I am using function
The UOG_raw looks like
Ok! Note, I included Analysis # using
To be honest, the d18O scatter in standards is bigger than I wanted. It seems that two last UOG in 2017/12/15 session are abnormally low. Say, I choose the exclude them since they have experienced contamination from running low d18O samples beforehand. Doing so requires strong justification. Here I do it for the sake of example.
Now I need to take the average value of standards for each analytical session.
The function
Now we have a nice summary of what UOG looked like in each session. Obviously, when I analyzed UOG only one time there will be no standard deviation resulting in
The offset created by factors beyond our control is defined as the difference between the UOG mean value and nominal 6.52 within each analytical session. It doesn’t have to be grouped_by but could be handy in other cases. The
Function
It’s always a good practice to look at the data plotted before-and-after. Remember, you can always add
It’s time to get our data out of R and into the real world. My experience showed that the easiest way is to export it into a .csv file.
library(tidyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(readxl)
library(ggrepel)
Now, let’s upload our spreadsheet with measurements. Put the Excel files in the same folder as your R file, then click Session -> Set Working Directory -> To Source File Location.Load the data from Excel files using the function read_excel() from the readxl package. My Excel file is called ‘2014_2017_18O_David.xlsx’
raw_data1<- read_excel('2014_2017_18O_David.xlsx')
Now raw_data1 is the main object with all of our data. We can check what’s inside.raw_data1
## # A tibble: 49 x 4
## `Time Code` `FileHeader: Fil~ `d 18O/16O Mea~ `d 18O/16O Std~
## <chr> <chr> <dbl> <dbl>
## 1 2014/08/28 18:18:34 20212_UOG 6.64 0.049
## 2 2014/08/28 18:25:23 20206_WO-36 6.21 0.004
## 3 2014/08/28 18:41:21 20207_WO-39 6.25 0.036
## 4 2014/08/28 18:56:21 20208_WO-48 5.47 0.04
## 5 2014/08/28 19:10:49 20209_14ML-04 9.77 0.019
## 6 2014/08/28 19:56:32 20210_WO-11 5.08 0.041
## 7 2014/08/28 20:11:50 20211_WO-7 5.60 0.031
## 8 2014/08/29 11:13:45 20213_UOG 6.69 0.026
## 9 2014/08/29 11:29:52 20214_WO-50 6.01 0.041
## 10 2014/08/29 11:44:36 20215_WO-45 5.86 0.027
## # ... with 39 more rows
We should see a situation similar to the Excel spreadsheet view. A few things to notice. Date and time are mixed up together. Analysis number is mixed up with samples name and the columns are named ugly and can’t even fit in the window. For example, ‘d 18O/16O mean’ which can be plainly called d18O.Notice there are total of 49 measurements. It could be any number; once the algorithm is created, you can apply it to any amount of data. That’s what we’re after - consistency.
Now tidyr and dplyr magic kicks in. Let’s take our raw_data1 and process it with a bunch of functions using pipe operator %>%. Objects on the left are passed to the functions to the right side of the pipe %>%.
raw_data1<- raw_data1 %>% separate("Time Code", c("Date", "Time"), sep =" ") %>%
separate("FileHeader: Filename", c("Analysis", "Sample"),sep="_") %>%
rename(d18O= "d 18O/16O Mean", d18Osd="d 18O/16O Std Dev")
Now let’s see what we have done with the original data.raw_data1
## # A tibble: 49 x 6
## Date Time Analysis Sample d18O d18Osd
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2014/08/28 18:18:34 20212 UOG 6.64 0.049
## 2 2014/08/28 18:25:23 20206 WO-36 6.21 0.004
## 3 2014/08/28 18:41:21 20207 WO-39 6.25 0.036
## 4 2014/08/28 18:56:21 20208 WO-48 5.47 0.04
## 5 2014/08/28 19:10:49 20209 14ML-04 9.77 0.019
## 6 2014/08/28 19:56:32 20210 WO-11 5.08 0.041
## 7 2014/08/28 20:11:50 20211 WO-7 5.60 0.031
## 8 2014/08/29 11:13:45 20213 UOG 6.69 0.026
## 9 2014/08/29 11:29:52 20214 WO-50 6.01 0.041
## 10 2014/08/29 11:44:36 20215 WO-45 5.86 0.027
## # ... with 39 more rows
Sweet. We separated Date from Time. They had a space gap in between them, so we had to tell R to use " " (empty space) as a separator. Same for ugly "FileHeader: Filename"
with separator being “_“.Renaming was easy too using
rename
function. First, tell R what would you like the column to be named and then indicate which existing column to rename. So we said d18O= "d 18O/16O Mean"
. Same for the Std Dev (standard deviation).Now, we need to extract standards from the data set. Separate samples from standards. I am using function
filter
in combination with grepl
. The latter function looks for cases containing “UOG” in column Sample. The function filter
is obvious: it traps matching conditions.samples_raw<-filter(raw_data1, !grepl("UOG", Sample))
UOG_raw<- filter(raw_data1, grepl("UOG", Sample))
Oh, !grepl
is equal to saying ‘does not contain’.The UOG_raw looks like
UOG_raw
## # A tibble: 10 x 6
## Date Time Analysis Sample d18O d18Osd
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2014/08/28 18:18:34 20212 UOG 6.64 0.049
## 2 2014/08/29 11:13:45 20213 UOG 6.69 0.026
## 3 2014/08/29 12:42:54 20218 UOG 6.84 0.041
## 4 2015/03/11 10:49:30 21577 UOG 6.88 0.022
## 5 2015/03/11 12:20:38 21581 UOG 7.05 0.019
## 6 2015/03/11 14:47:20 21585 UOG 7.28 0.042
## 7 2017/12/15 10:37:33 30155 UOG 7.06 0.047
## 8 2017/12/15 10:53:59 30156 UOG 6.94 0.016
## 9 2017/12/15 14:16:51 30164 UOG 5.69 0.017
## 10 2017/12/15 18:19:40 30177 UOG 5.92 0.022
I’d like to see standards in graphical form using ggplot2 plotting Date (x-axis) against d18O value.ggplot(data=UOG_raw, aes(Date, d18O))+
geom_point()+
geom_text_repel(aes(label=Analysis))+
theme_bw()
Ok! Note, I included Analysis # using
geom_text_repel
from ggrepel package to track down individual analysis. Pretty cool, eh? We can look at the plot and the table to make observations.To be honest, the d18O scatter in standards is bigger than I wanted. It seems that two last UOG in 2017/12/15 session are abnormally low. Say, I choose the exclude them since they have experienced contamination from running low d18O samples beforehand. Doing so requires strong justification. Here I do it for the sake of example.
UOG_raw<-filter(UOG_raw, d18O>6)
UOG_raw
## # A tibble: 8 x 6
## Date Time Analysis Sample d18O d18Osd
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2014/08/28 18:18:34 20212 UOG 6.64 0.049
## 2 2014/08/29 11:13:45 20213 UOG 6.69 0.026
## 3 2014/08/29 12:42:54 20218 UOG 6.84 0.041
## 4 2015/03/11 10:49:30 21577 UOG 6.88 0.022
## 5 2015/03/11 12:20:38 21581 UOG 7.05 0.019
## 6 2015/03/11 14:47:20 21585 UOG 7.28 0.042
## 7 2017/12/15 10:37:33 30155 UOG 7.06 0.047
## 8 2017/12/15 10:53:59 30156 UOG 6.94 0.016
Using filter
again!Now I need to take the average value of standards for each analytical session.
UOG_raw<- UOG_raw %>% group_by(Date) %>% summarize(d18Omean=mean(d18O),
sd=sd(d18O))
UOG_raw
## # A tibble: 4 x 3
## Date d18Omean sd
## <chr> <dbl> <dbl>
## 1 2014/08/28 6.64 NaN
## 2 2014/08/29 6.76 0.105
## 3 2015/03/11 7.07 0.203
## 4 2017/12/15 7.00 0.0905
The function group_by
tells R to perform following functions for blocks of data unified by the same parameter, column Date in this case.The function
summarize
provides some basic statistical parameters for grouped data sets. Works ideally here. Sd stands for standard deviation.Now we have a nice summary of what UOG looked like in each session. Obviously, when I analyzed UOG only one time there will be no standard deviation resulting in
NaN
.The offset created by factors beyond our control is defined as the difference between the UOG mean value and nominal 6.52 within each analytical session. It doesn’t have to be grouped_by but could be handy in other cases. The
mutate
line creates a new column with values calculated according to the formula. The select
is used here to keep just one column, the value of the offset.adjust_offset<-UOG_raw %>% group_by(Date) %>%
mutate(offset=d18Omean-6.52) %>% select(offset)
## Adding missing grouping variables: `Date`
adjust_offset
## # A tibble: 4 x 2
## # Groups: Date [4]
## Date offset
## <chr> <dbl>
## 1 2014/08/28 0.122
## 2 2014/08/29 0.241
## 3 2015/03/11 0.549
## 4 2017/12/15 0.479
OK! Now apply this offset to the rest of the data set. The punch line: it has to be done by Date!corrected_data<-inner_join(samples_raw, adjust_offset) %>%
mutate(d18Ocor=d18O-offset)
## Joining, by = "Date"
corrected_data
## # A tibble: 39 x 8
## Date Time Analysis Sample d18O d18Osd offset d18Ocor
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2014/08/28 18:25:23 20206 WO-36 6.21 0.004 0.122 6.09
## 2 2014/08/28 18:41:21 20207 WO-39 6.25 0.036 0.122 6.13
## 3 2014/08/28 18:56:21 20208 WO-48 5.47 0.04 0.122 5.34
## 4 2014/08/28 19:10:49 20209 14ML-04 9.77 0.019 0.122 9.65
## 5 2014/08/28 19:56:32 20210 WO-11 5.08 0.041 0.122 4.95
## 6 2014/08/28 20:11:50 20211 WO-7 5.60 0.031 0.122 5.48
## 7 2014/08/29 11:29:52 20214 WO-50 6.01 0.041 0.241 5.77
## 8 2014/08/29 11:44:36 20215 WO-45 5.86 0.027 0.241 5.62
## 9 2014/08/29 12:02:40 20216 WO-49 5.79 0.038 0.241 5.55
## 10 2014/08/29 12:27:42 20217 WO-49 6.32 0.025 0.241 6.07
## # ... with 29 more rows
The inner_join
merged Sample values and the offsets based on the Date.Function
mutate
creates a new column for corrected d18O values, called d18Ocor
here. That’s all we wanted.It’s always a good practice to look at the data plotted before-and-after. Remember, you can always add
geom_text_repel(aes(label=Sample))
from ggrepel to track down the sample names of each data point.p1<-ggplot(data=samples_raw, aes(Date, d18O))+
geom_point(color='grey')+
ggtitle("raw data")+
theme_bw()
p2<-ggplot(data=corrected_data, aes(Date, d18Ocor))+
geom_point()+
ggtitle("corrected data")+
theme_bw()
grid.arrange(p1,p2, ncol=1)
It’s time to get our data out of R and into the real world. My experience showed that the easiest way is to export it into a .csv file.
write.csv(corrected_data, '2014_2017_18O_David.csv')
No comments:
Post a Comment