How To Clean Up Data With R


Working with data almost always means working with bad data: the wrong format, missing values, inconsistent units of measure, typos, you name it. I've heard it said many times that data scientists, the high priests of machine learning and statistics, spend a majority of their time cleaning up bad data sets.

We software developers can use all sorts of general purpose programming languages to solve pretty much any data transformation problem. However, the average business person pretty much has Excel, elbow grease, and maybe some macros if they're crafty. As the volume of data in our world continues to explode, the ability for business people to independently deal with it becomes increasingly valuable.

At CodeMash I saw a great talk by Matthew Renze about "Exploratory Data Analysis with R" (he also has a corresponding Pluralsight course on the topic which I highly recommend). The whole time I was watching this talk I was thinking "all business people should learn this right now!"

R is a special-purpose programming language for statistics. It has lots of great capabilities and libraries for fancy things like linear regression, data visualization, machine learning, k-means clustering and so on. It also happens to be pretty great for basic data manipulation tasks. Here's a simple example based on Renze's talk...

The Bad Data

Let's say we work at a large mobile carrier and have some sales data to clean up:

Fixed Data

The problems are...

The Script

Here's how you can solve all these problems with a basic R script.

    
# set working directory
setwd("C:/code/R/blog")

# load data
phones <- read.csv(
  file = "phones.bad.csv",
  header = TRUE,
  na.strings=c("")
)

# PROBLEM: omit rows with missing data
phones <- na.omit(phones)

# PROBLEM: remove "hrs" from the Battery since it's bad for math
# fix it by converting it to a character type
battery <- as.character(phones$Battery)
# then removing the "hrs"
battery <- sub(" hrs", "", battery)
# then converting back to an int
phones$Battery <- as.integer(battery)

# PROBLEM: inconsistant units of measure
# write a function to normalize to millions
convertRevenureToMillions <- function(rev)
{
  stringRev <- as.character(rev)
  revNumStrOnly <- gsub("[$|K|M]", "", stringRev)
  numericRev <- as.numeric(revNumStrOnly)
  if(grepl("M", rev)) {
    numericRev
  }else if(grepl("K", rev)){
    numericRev * 0.001
  }
}
# then apply it
phones$Revenue <- sapply(phones$Revenue, convertRevenureToMillions)
# let's be a good citizen and rename the Revenue column
names(phones)[2] <- "RevenueMillions"

# ALL DONE!
write.csv(
  phones, 
  "phones.fixed.csv",  
  row.names=FALSE
)
    

The Output

And presto, here's your fixed data...

Fixed Data

Conclusion

R can be a very useful tool for cleaning up data because it's optimized for working with data in vector (array) and frame (table) structures. Reading and writing data is straightforward, as is applying functions to data sets. I didn't touch on any of the analytical features, which is R is primarily known for.

R is built for statisticians by statisticians so it has a bit of a learning curve but if you use Excel frequently to deal with bad data you should seriously consider checking it out.

[ Archive · Home ]