Data science — R

«Best practices» Convert the incoming hot mess

The wry joke is that data science is only 10% of the sexy stuff like machine learning—the other 90% is euphemistically called "data cleaning."

, and Richard Careaga

February 2, 2024 . 6:08 PM

2 min read

Sometimes you have to open up the manhole and mix it up. ©2014, Richard Careaga, all rights reserved

«Scare quotes» explained

These are used to highlight an occurrence of modern, weak-ass, passive-aggressive CYA corporate bullshit in the tradition of "mistakes were made." I class it with other preemptive blame-deflecting, such as «reach out» in the sense of passing the buck.

Question presented

The wry joke that data science is only 10% of the sexy stuff like machine learning—the other 90% is euphemistically called "data cleaning." Among the species of untamed data in the wild comes from the ancient tradition of user friendliness roams spreadsheet data, which is a weapon of mass data destruction. How to resurrect the tiny portion of information from the hot mess of decoration, inconsistent formatting, cut and paste errors and the like is the subject of this short take.

How to approach

The original sin is if the csv files derive from spreadsheets, which are vulnerable to entry errors and untrapped logic. Using a relational data base management system for data entry provides the opportunity for validation to at least prevent errors like entering a literal string $1,023.10 in a numeric field. On the other hand, in addition to user training on data entry, hosting is needed that may not fit comfortably on existing servers and sooner or later a database administrator will be needed to keep the system running. In the meantime, using a desktop scale RDMS may provide a quick way of separating out the easily salvaged from the lost causes.

Coming back to Earth, the obvious candidate is a combination of git and scripting.

Does file system data show any change to the csv as to date and file size? If not, skip.
What are the differences? Display the diff to decide whether to run the whole suite of data wrangling or if the differences are few and trivial enough to run by hand (a correction in a handful of data entry points, for example).
Can the carcass be thrown back to the guilty party? (That's really more of a political question.)
Does any statistical test planned loss an appreciable degree of power if the data is tossed?
Is hand curation cost-effective?
Version control the Rds for recovery, assuming there is someway to detect inadvertent corruption.

The most difficult, because it requires the least mechanical work but the most skull sweat is to break up the target object into pieces to isolate stable and volatile parts of the source data. This would be a candidate for persistent storage in an RDMS that would not be as big a deal to implement as the more fulsome data entry requirement. Drawing the same big object as the current object namespace is attractive but it is likely that the use to be made of the big object may require that everything be included but not necessarily all at once. That comes back to the design of the data store and arranging sub-objects to things that change often and those that don't and those that are usually handled together and those that are usually handled in isolation. The principle would be to avoid unnecessary work disaggregating a big object while at the same time to ease work assembling the desired output by creating standardized parts.

Mascot of the Day

All God's Children had a Garden of Eden stability.ai

Latest

Finger painting

Tokens

Reality. What a concept! Robin Williams

, and Richard Careaga

June 5, 2024

Paid Members Public

Election 2024

Counting chickens

Refining polling models. It's the received wisdom of the ages—don't rely on events that haven't happened yet. On the other hand, don't bring a knife to a gunfight, either.

, and Richard Careaga

May 24, 2024

Paid Members Public

Election 2024

The Devil Went Down to Georgia

If the election is close again, recounts will make it unlikely that the election outcome will be known on the day of voting.

, and Richard Careaga

May 13, 2024

Paid Members Public

Current Events

History and tradition strike again

What comes of turning legal analysis into historiography.

, and Richard Careaga

May 9, 2024

Paid Members Public

«Best practices» Convert the incoming hot mess

Table of Contents

«Scare quotes» explained

Question presented

How to approach

Mascot of the Day

Related

Latest