Sunday, June 17, 2018

Using R: consistent treatment of data in the lab


Typically geochemical data is treated rather poorly; it often starts with a clean Excel spreadsheet and ends with a mess of calculations and plots. In this post I share what I learned about manipulating data in R using tidyverse packages. Particularly, I focus on keeping the data clean, columns in good shape and consistent treatement of data.

For example, in the last 4 years, I have been working in a stable isotope lab and thus, I've produced a lot of measurements. Having lot of measurements compiled in data sets is great, however now I have a hard time treating the data consistently.

The main problem is in small day-to-day deviations in the data monitored by analysis of standards. Because of that each session includes a set of standards along with unknowns. The difference between the measured and nominal value of standards is subtracted from the rest of the data within the same analytical session. Having to do this for each analytical sessions, it is almost impossible to use Excel.

TASK: adjust the values of unknowns based on the standards measured during different analytical sessions. Each session has a different date - that helps to automatize the experience.

DETAILS: named UOG is measured for d18O within each analytical session, it should have value of 6.52 per mil but deviations within 0.5 - 1 per mil occur. The difference (excess or deficiency) between the measured value of UOG and 6.52 should be applied to the data set on that particular day.

LOGICAL APPROACH: compile all the data in a single file. Each column is a separate variable. At least three of them: Date of analytical session, Sample name, Measured d18O value. I will instruct the machine how to average out standards within each session and apply the difference between nominal and average value to the rest of the data.

COMPUTING APPROACH is provided below. The instructions are created using RMarkdown document. With some basic familiarity with R this should be easy.




normalizer_UOG
Let’s start with installing the needed packages. Each one of these packages includes a set of important functions. Dplyr and tidyr deal with manipulations, ggplot2 and gridExtra contain some plotting tools, readxl is responsible for importing Excel files. We will see what ggrepel has to offer later!
library(tidyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(readxl)
library(ggrepel)
Now, let’s upload our spreadsheet with measurements. Put the Excel files in the same folder as your R file, then click Session -> Set Working Directory -> To Source File Location.
Load the data from Excel files using the function read_excel() from the readxl package. My Excel file is called ‘2014_2017_18O_David.xlsx’
raw_data1<-  read_excel('2014_2017_18O_David.xlsx')
Now raw_data1 is the main object with all of our data. We can check what’s inside.
raw_data1
## # A tibble: 49 x 4
##    `Time Code`         `FileHeader: Fil~ `d 18O/16O  Mea~ `d 18O/16O  Std~
##    <chr>               <chr>                        <dbl>            <dbl>
##  1 2014/08/28 18:18:34 20212_UOG                     6.64            0.049
##  2 2014/08/28 18:25:23 20206_WO-36                   6.21            0.004
##  3 2014/08/28 18:41:21 20207_WO-39                   6.25            0.036
##  4 2014/08/28 18:56:21 20208_WO-48                   5.47            0.04 
##  5 2014/08/28 19:10:49 20209_14ML-04                 9.77            0.019
##  6 2014/08/28 19:56:32 20210_WO-11                   5.08            0.041
##  7 2014/08/28 20:11:50 20211_WO-7                    5.60            0.031
##  8 2014/08/29 11:13:45 20213_UOG                     6.69            0.026
##  9 2014/08/29 11:29:52 20214_WO-50                   6.01            0.041
## 10 2014/08/29 11:44:36 20215_WO-45                   5.86            0.027
## # ... with 39 more rows
We should see a situation similar to the Excel spreadsheet view. A few things to notice. Date and time are mixed up together. Analysis number is mixed up with samples name and the columns are named ugly and can’t even fit in the window. For example, ‘d 18O/16O mean’ which can be plainly called d18O.
Notice there are total of 49 measurements. It could be any number; once the algorithm is created, you can apply it to any amount of data. That’s what we’re after - consistency.
Now tidyr and dplyr magic kicks in. Let’s take our raw_data1 and process it with a bunch of functions using pipe operator %>%. Objects on the left are passed to the functions to the right side of the pipe %>%.
raw_data1<- raw_data1 %>% separate("Time Code", c("Date", "Time"), sep =" ") %>% 
            separate("FileHeader: Filename", c("Analysis", "Sample"),sep="_") %>%
            rename(d18O= "d 18O/16O  Mean", d18Osd="d 18O/16O  Std Dev")
Now let’s see what we have done with the original data.
raw_data1
## # A tibble: 49 x 6
##    Date       Time     Analysis Sample   d18O d18Osd
##    <chr>      <chr>    <chr>    <chr>   <dbl>  <dbl>
##  1 2014/08/28 18:18:34 20212    UOG      6.64  0.049
##  2 2014/08/28 18:25:23 20206    WO-36    6.21  0.004
##  3 2014/08/28 18:41:21 20207    WO-39    6.25  0.036
##  4 2014/08/28 18:56:21 20208    WO-48    5.47  0.04 
##  5 2014/08/28 19:10:49 20209    14ML-04  9.77  0.019
##  6 2014/08/28 19:56:32 20210    WO-11    5.08  0.041
##  7 2014/08/28 20:11:50 20211    WO-7     5.60  0.031
##  8 2014/08/29 11:13:45 20213    UOG      6.69  0.026
##  9 2014/08/29 11:29:52 20214    WO-50    6.01  0.041
## 10 2014/08/29 11:44:36 20215    WO-45    5.86  0.027
## # ... with 39 more rows
Sweet. We separated Date from Time. They had a space gap in between them, so we had to tell R to use " " (empty space) as a separator. Same for ugly "FileHeader: Filename" with separator being “_“.
Renaming was easy too using rename function. First, tell R what would you like the column to be named and then indicate which existing column to rename. So we said d18O= "d 18O/16O Mean". Same for the Std Dev (standard deviation).
Now, we need to extract standards from the data set. Separate samples from standards. I am using function filter in combination with grepl. The latter function looks for cases containing “UOG” in column Sample. The function filter is obvious: it traps matching conditions.
samples_raw<-filter(raw_data1, !grepl("UOG", Sample)) 

UOG_raw<-  filter(raw_data1, grepl("UOG", Sample))
Oh, !grepl is equal to saying ‘does not contain’.
The UOG_raw looks like
UOG_raw
## # A tibble: 10 x 6
##    Date       Time     Analysis Sample  d18O d18Osd
##    <chr>      <chr>    <chr>    <chr>  <dbl>  <dbl>
##  1 2014/08/28 18:18:34 20212    UOG     6.64  0.049
##  2 2014/08/29 11:13:45 20213    UOG     6.69  0.026
##  3 2014/08/29 12:42:54 20218    UOG     6.84  0.041
##  4 2015/03/11 10:49:30 21577    UOG     6.88  0.022
##  5 2015/03/11 12:20:38 21581    UOG     7.05  0.019
##  6 2015/03/11 14:47:20 21585    UOG     7.28  0.042
##  7 2017/12/15 10:37:33 30155    UOG     7.06  0.047
##  8 2017/12/15 10:53:59 30156    UOG     6.94  0.016
##  9 2017/12/15 14:16:51 30164    UOG     5.69  0.017
## 10 2017/12/15 18:19:40 30177    UOG     5.92  0.022
I’d like to see standards in graphical form using ggplot2 plotting Date (x-axis) against d18O value.
ggplot(data=UOG_raw, aes(Date, d18O))+
  geom_point()+
  geom_text_repel(aes(label=Analysis))+
  theme_bw()

Ok! Note, I included Analysis # using geom_text_repel from ggrepel package to track down individual analysis. Pretty cool, eh? We can look at the plot and the table to make observations.
To be honest, the d18O scatter in standards is bigger than I wanted. It seems that two last UOG in 2017/12/15 session are abnormally low. Say, I choose the exclude them since they have experienced contamination from running low d18O samples beforehand. Doing so requires strong justification. Here I do it for the sake of example.
UOG_raw<-filter(UOG_raw, d18O>6)
UOG_raw
## # A tibble: 8 x 6
##   Date       Time     Analysis Sample  d18O d18Osd
##   <chr>      <chr>    <chr>    <chr>  <dbl>  <dbl>
## 1 2014/08/28 18:18:34 20212    UOG     6.64  0.049
## 2 2014/08/29 11:13:45 20213    UOG     6.69  0.026
## 3 2014/08/29 12:42:54 20218    UOG     6.84  0.041
## 4 2015/03/11 10:49:30 21577    UOG     6.88  0.022
## 5 2015/03/11 12:20:38 21581    UOG     7.05  0.019
## 6 2015/03/11 14:47:20 21585    UOG     7.28  0.042
## 7 2017/12/15 10:37:33 30155    UOG     7.06  0.047
## 8 2017/12/15 10:53:59 30156    UOG     6.94  0.016
Using filter again!
Now I need to take the average value of standards for each analytical session.
UOG_raw<- UOG_raw %>%  group_by(Date) %>% summarize(d18Omean=mean(d18O),
                                                    sd=sd(d18O))
UOG_raw
## # A tibble: 4 x 3
##   Date       d18Omean       sd
##   <chr>         <dbl>    <dbl>
## 1 2014/08/28     6.64 NaN     
## 2 2014/08/29     6.76   0.105 
## 3 2015/03/11     7.07   0.203 
## 4 2017/12/15     7.00   0.0905
The function group_by tells R to perform following functions for blocks of data unified by the same parameter, column Date in this case.
The function summarize provides some basic statistical parameters for grouped data sets. Works ideally here. Sd stands for standard deviation.
Now we have a nice summary of what UOG looked like in each session. Obviously, when I analyzed UOG only one time there will be no standard deviation resulting in NaN.
The offset created by factors beyond our control is defined as the difference between the UOG mean value and nominal 6.52 within each analytical session. It doesn’t have to be grouped_by but could be handy in other cases. The mutate line creates a new column with values calculated according to the formula. The select is used here to keep just one column, the value of the offset.
adjust_offset<-UOG_raw %>%  group_by(Date) %>%
  mutate(offset=d18Omean-6.52) %>% select(offset)
## Adding missing grouping variables: `Date`
adjust_offset
## # A tibble: 4 x 2
## # Groups:   Date [4]
##   Date       offset
##   <chr>       <dbl>
## 1 2014/08/28  0.122
## 2 2014/08/29  0.241
## 3 2015/03/11  0.549
## 4 2017/12/15  0.479
OK! Now apply this offset to the rest of the data set. The punch line: it has to be done by Date!
corrected_data<-inner_join(samples_raw, adjust_offset) %>%
                      mutate(d18Ocor=d18O-offset)
## Joining, by = "Date"
corrected_data
## # A tibble: 39 x 8
##    Date       Time     Analysis Sample   d18O d18Osd offset d18Ocor
##    <chr>      <chr>    <chr>    <chr>   <dbl>  <dbl>  <dbl>   <dbl>
##  1 2014/08/28 18:25:23 20206    WO-36    6.21  0.004  0.122    6.09
##  2 2014/08/28 18:41:21 20207    WO-39    6.25  0.036  0.122    6.13
##  3 2014/08/28 18:56:21 20208    WO-48    5.47  0.04   0.122    5.34
##  4 2014/08/28 19:10:49 20209    14ML-04  9.77  0.019  0.122    9.65
##  5 2014/08/28 19:56:32 20210    WO-11    5.08  0.041  0.122    4.95
##  6 2014/08/28 20:11:50 20211    WO-7     5.60  0.031  0.122    5.48
##  7 2014/08/29 11:29:52 20214    WO-50    6.01  0.041  0.241    5.77
##  8 2014/08/29 11:44:36 20215    WO-45    5.86  0.027  0.241    5.62
##  9 2014/08/29 12:02:40 20216    WO-49    5.79  0.038  0.241    5.55
## 10 2014/08/29 12:27:42 20217    WO-49    6.32  0.025  0.241    6.07
## # ... with 29 more rows
The inner_join merged Sample values and the offsets based on the Date.
Function mutate creates a new column for corrected d18O values, called d18Ocor here. That’s all we wanted.
It’s always a good practice to look at the data plotted before-and-after. Remember, you can always add geom_text_repel(aes(label=Sample)) from ggrepel to track down the sample names of each data point.
p1<-ggplot(data=samples_raw, aes(Date, d18O))+
  geom_point(color='grey')+
  ggtitle("raw data")+
  theme_bw()

p2<-ggplot(data=corrected_data, aes(Date, d18Ocor))+
  geom_point()+
  ggtitle("corrected data")+
  theme_bw()

grid.arrange(p1,p2, ncol=1)

It’s time to get our data out of R and into the real world. My experience showed that the easiest way is to export it into a .csv file.
write.csv(corrected_data, '2014_2017_18O_David.csv')

No comments:

Post a Comment