Skip to content

«Best practices» Convert the incoming hot mess

The wry joke is that data science is only 10% of the sexy stuff like machine learning—the other 90% is euphemistically called "data cleaning."

Sometimes you have to open up the manhole and mix it up. ©2014, Richard Careaga, all rights reserved

Table of Contents

«Scare quotes» explained

These are used to highlight an occurrence of modern, weak-ass, passive-aggressive CYA corporate bullshit in the tradition of "mistakes were made." I class it with other preemptive blame-deflecting, such as «reach out» in the sense of passing the buck.

Question presented

The wry joke that data science is only 10% of the sexy stuff like machine learning—the other 90% is euphemistically called "data cleaning." Among the species of untamed data in the wild comes from the ancient tradition of user friendliness roams spreadsheet data, which is a weapon of mass data destruction. How to resurrect the tiny portion of information from the hot mess of decoration, inconsistent formatting, cut and paste errors and the like is the subject of this short take.

How to approach

The original sin is if the csv files derive from spreadsheets, which are vulnerable to entry errors and untrapped logic. Using a relational data base management system for data entry provides the opportunity for validation to at least prevent errors like entering a literal string $1,023.10 in a numeric field. On the other hand, in addition to user training on data entry, hosting is needed that may not fit comfortably on existing servers and sooner or later a database administrator will be needed to keep the system running. In the meantime, using a desktop scale RDMS may provide a quick way of separating out the easily salvaged from the lost causes.

Coming back to Earth, the obvious candidate is a combination of git and scripting.

  • Does file system data show any change to the csv as to date and file size? If not, skip.
  • What are the differences? Display the diff to decide whether to run the whole suite of data wrangling or if the differences are few and trivial enough to run by hand (a correction in a handful of data entry points, for example).
  • Can the carcass be thrown back to the guilty party? (That's really more of a political question.)
  • Does any statistical test planned loss an appreciable degree of power if the data is tossed?
  • Is hand curation cost-effective?
  • Version control the Rds for recovery, assuming there is someway to detect inadvertent corruption.

The most difficult, because it requires the least mechanical work but the most skull sweat is to break up the target object into pieces to isolate stable and volatile parts of the source data. This would be a candidate for persistent storage in an RDMS that would not be as big a deal to implement as the more fulsome data entry requirement. Drawing the same big object as the current object namespace is attractive but it is likely that the use to be made of the big object may require that everything be included but not necessarily all at once. That comes back to the design of the data store and arranging sub-objects to things that change often and those that don't and those that are usually handled together and those that are usually handled in isolation. The principle would be to avoid unnecessary work disaggregating a big object while at the same time to ease work assembling the desired output by creating standardized parts.

Mascot of the Day

All God's Children had a Garden of Eden stability.ai

Latest

Counting chickens

Counting chickens

Refining polling models. It's the received wisdom of the ages—don't rely on events that haven't happened yet. On the other hand, don't bring a knife to a gunfight, either.

Members Public