Data Metaphors

Archive for May 2014

Data is often referred to the new oil, but a lot of that data is in hard to get at.

What did we do one all the easy coal deposits had been mined, and all the “easy oil” had been pumped?  We went for the hard stuff.  The oil in shale.  We fracked it.

We are moving from the age of data mining to the age of data fracking.

Legacy and even current approaches for data sourcing, transformation, and integration were designed for easy data.  Hard data is not easily extracted, is not exposed through clean API’s, doesn’t come with nice conformed dimensions, and is laced with hard-to-mask PII.

Fracking hard data like medical and genomic records at scale could lead to huge advances.  Fracking all government, court, and crime data could allow a new wave of citizen data-scientists to test theories on the largest sets and streams of data available.  Fracking mobile phone location data, and the sensor streams from the internet of things could drive vast new understandings of the workings of technical, economic, social, and ecological systems.

There is already an active ecosystem of startups working on the four main aspects of data tracking: 1) Sourcing, 2) Conforming, 3) Masking, and 4) Understanding.


The first, Sourcing, is the simplest.  Companies like SyncSort are bridging between legacy mainframe and modern Hadoop systems.  Cloudera’s Enterprise Data Hub vision, combined with open source Hadoop stack components such as Flume, Scooq, and Spark Streaming make sourcing and transporting data easier than ever.  And that data is more often coming with self-describing elements of schema.  XML and JSON provide nicely fielded and nested data structures.  Avro and Thrift add even more defined schema with each packet of data.  Of the four parts of the data tracking ecosystem, this one is almost taking care of itself—when was the last time you saw a file with fixed-width fields?


The second aspect, Conforming, is a bit harder.  Traditional data warehousing practice requires lengthly manual source system analysis, as well as a detailed understanding of the reports that are needed by users.  This would traditionally turn into a data warehouse model, and a set of ETL mapping definitions that would be hand-coded by Informatica programmers.  And it would usually take 12 months end-to-end.  Fortunately the Hadoop revolution and EDH approach makes this easier, encouraging us to simply land our data into HDFS and figure out what can be joined to what later and only when needed.

That worked while Hadoop jobs were hand coded and no one minded if map reduce jobs ran for a few hours.  But now people are starting to look at the Hadoop ecosystem to replace standard data warehousing and BI activities, so they start wondering how to conform dimensions and map data to schemas.

Even Ralph Kimball, the “father of data warehousing,” has started to realize that a different approach is needed, and has started to verbalize a new approach to conformed and slowly changing dimensions in his recent webinar series together with Cloudera.  And a new set of startups is focusing on automating this process.  Trifacta and Paxata are working to re-invent ETL, to turn data that is landed into Hadoop into something that is more ready for analytics.  And Tamr, founded by Andy Palmer who also founded Vertica, just announced a $16 million venture round to use machine learning and other advanced approaches to perform automated data conforming.


The third aspect to data fracking is masking.  General analytics that we want to perform against large amounts of data can work just fine without being able to tie any individual data point back to a specific person.  Masking theoretically solves this problem by de-identifying data.  De-idenfitying replaces PII with benign meaningless surrogate keys.  This traditionally was a straight-forward process, simply replace a SSN with a GUID, and retain a mapping table in a secure place.  But an incident at Netflix a few years ago showed how even masked data can be tied back to individuals if you have enough pieces of the puzzle at your disposal.  For example, if you know my census block and my year of birth, you can identify me with that information alone.  Zip code and date of birth turns out to be a pretty unique combination as well.

More secure masking requires making the data fuzzy, and aggregated.  If you roll up enough people together and average them out, the data becomes harder to identify to any single person.  But this also can destroy the ability to perform correlation analysis at a very detailed level, as you can’t accurately correlate two datasets that have both been aggregated, no more than you can separate the flour from the eggs after a cookie batter has been mixed.

So these issues with masking and aggregation mean that the most valuable data analytics still need to be done of extremely raw data that can generally be tied to an actual person if you try hard enough.  Analytics at a granular level at scale is one of the more interesting parts of the data fracking technology ecosystem that remain unexplored.


The last step of data fracking is to actually have a human try to draw conclusions from data.  This is the area that we are working on at Zoomdata.

We are trying to build interfaces and analytics on top of data that are easy for normal business people to understand and interact with.  There is an ecosystem of new startups that are bringing new approaches to data visualization and data understanding, which combined with the Hadoop and NoSQL revolutions, and the data fracking ecosystem players, will allow humans to undertand big, complex, and challenging data in entirely new and valuable ways.

So forget data mining, that was so 1990’s.  The easy data has been minded already, and whats left is messy, in 25 different places, lacking conformed dimensions, and littered with PII.  But this data represents huge possibilities… the answer to most disease may be at our fingertips, if we can just find a secure and acceptable way to frack everyone’s medical records.  So it’s time to start fracking!