Home:ALL Converter>Improve code efficiency

Improve code efficiency

Ask Time:2015-10-30T23:39:54         Author:user2943039

Json Formatter

I've been working on a code that reads in all the sheets of an Excel workbook, where the first two columns in each sheet are "Date" and "Time", and the next two columns are either "Level" and "Temperature, or "LEVEL" and "TEMPERATURE". The code works, but I am working on improving my coding clarity and efficiency, so any advice in those regards would be greatly appreciated.

My function 1) reads in the data to a list of dataframes, 2) gets rid of any NA columns that were accidentally read in, 3) combines "Date" and "Time" to "DateTime" for each dataframe, 4) rounds "DateTime" to the nearest 5 minutes for each dataframe, 5) replaces "Date" and "Time" in each dataframe with "DateTime". I started getting more comfortable with lapply, but am wondering if I can improve the code efficiency at all instead of have so many lines with lapply.

library(readxl)
library(plyr)

  read_excel_allsheets <- function(filename) {
  sheets <- readxl::excel_sheets(filename)
  data <- lapply(sheets, function(X) readxl::read_excel(filename, sheet = X))
  names(data) <- sheets
  clean <- lapply(data, function(y) y[, colSums(is.na(y)) == 0])
  date <- lapply(clean, "[[", 1)
  time <- lapply(clean, "[[", 2)
  time <- lapply(time, function(z) format(z, format = "%H:%M"))
  datetime <- Map(paste, date, time)
  datetime <- lapply(datetime, function(a) as.POSIXct(a, format = "%Y-%m-%d %H:%M"))
  rounded <- lapply(datetime, function(b) as.POSIXlt(round(as.numeric(b)/(5*60))*(5*60),origin='1970-01-01'))
  addDateTime <- mapply(cbind, clean, "DateTime" = rounded, SIMPLIFY = F)
  final <- lapply(addDateTime, function(z) z[!(names(z) %in% c("Date", "Time"))])
  return(final)
}

Next, I would like to plot all of my data. So, I 1) run my code for a file, 2) combine the list of dataframes into one dataframe while maintaining an "ID" for each dataframe as a column, 3) combine the lowercase and uppercase versions of the variable columns, 4) add two new columns that split the "ID". Each ID is something like B1CC or B2CO, where I want to split the "ID" like so: "B1" and "CC". Now I can use ggplot very easily.

mysheets <- read_excel_allsheets(filename)
df = ldply(mysheets)
df$Temp <- rowSums(df[, c("Temperature", "TEMPERATURE")], na.rm = T)
df$Lev <- rowSums(df[, c("Level", "LEVEL")], na.rm = T)
df <- df[!names(df) %in% c("Level", "LEVEL", "Temperature", "TEMPERATURE")]

df$exp <- gsub("^[[:alnum:]]{2}", "\\1",df$.id)
df$plot <- gsub("[[:alnum:]]{2}$", "\\1", df$.id)

Here are the data for the first two dataframes, but there are over 50 of them, and each is relatively big, and there are many files to read. Therefore, I'm looking to improve efficiency (in terms of time to run) where I can. Any help or advice is greatly appreciated!

dput(head(x[[1]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121912, -2209121612, 
-2209121312, -2209121012, -2209120712, -2209120412), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), Level = c(106.9038, 106.9059, 106.89, 
106.9121, 106.8522, 106.8813), Temperature = c(6.176, 6.173, 
6.172, 6.168, 6.166, 6.165)), .Names = c("Date", "Time", "Level", 
"Temperature"), row.names = c(NA, 6L), class = c("tbl_df", "tbl", 
"data.frame"))

dput(head(x[[2]]))
structure(list(Date = structure(c(1305504000, 1305504000, 1305504000, 
1305504000, 1305504000, 1305504000), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Time = structure(c(-2209121988, -2209121688, 
-2209121388, -2209121088, -2209120788, -2209120488), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), LEVEL = c(117.5149, 117.511, 117.5031, 
117.5272, 117.4523, 117.4524), TEMPERATURE = c(5.661, 5.651, 
5.645, 5.644, 5.644, 5.645), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_), `NA` = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), .Names = c("Date", "Time", "LEVEL", 
"TEMPERATURE", NA, NA, NA, NA, NA), row.names = c(NA, 6L), class =    
c("tbl_df", "tbl", "data.frame"))

Author:user2943039,eproduced under the CC 4.0 BY-SA copyright license with a link to the original source and this disclaimer.
Link to original article:https://stackoverflow.com/questions/33439746/improve-code-efficiency
yy