Data Metaphors

Data is often referred to the new oil, but a lot of that data is in hard to get at.

What did we do one all the easy coal deposits had been mined, and all the “easy oil” had been pumped?  We went for the hard stuff.  The oil in shale.  We fracked it.

We are moving from the age of data mining to the age of data fracking.

Legacy and even current approaches for data sourcing, transformation, and integration were designed for easy data.  Hard data is not easily extracted, is not exposed through clean API’s, doesn’t come with nice conformed dimensions, and is laced with hard-to-mask PII.

Fracking hard data like medical and genomic records at scale could lead to huge advances.  Fracking all government, court, and crime data could allow a new wave of citizen data-scientists to test theories on the largest sets and streams of data available.  Fracking mobile phone location data, and the sensor streams from the internet of things could drive vast new understandings of the workings of technical, economic, social, and ecological systems.

There is already an active ecosystem of startups working on the four main aspects of data tracking: 1) Sourcing, 2) Conforming, 3) Masking, and 4) Understanding.

Sourcing 

The first, Sourcing, is the simplest.  Companies like SyncSort are bridging between legacy mainframe and modern Hadoop systems.  Cloudera’s Enterprise Data Hub vision, combined with open source Hadoop stack components such as Flume, Scooq, and Spark Streaming make sourcing and transporting data easier than ever.  And that data is more often coming with self-describing elements of schema.  XML and JSON provide nicely fielded and nested data structures.  Avro and Thrift add even more defined schema with each packet of data.  Of the four parts of the data tracking ecosystem, this one is almost taking care of itself—when was the last time you saw a file with fixed-width fields?

Conforming 

The second aspect, Conforming, is a bit harder.  Traditional data warehousing practice requires lengthly manual source system analysis, as well as a detailed understanding of the reports that are needed by users.  This would traditionally turn into a data warehouse model, and a set of ETL mapping definitions that would be hand-coded by Informatica programmers.  And it would usually take 12 months end-to-end.  Fortunately the Hadoop revolution and EDH approach makes this easier, encouraging us to simply land our data into HDFS and figure out what can be joined to what later and only when needed.

That worked while Hadoop jobs were hand coded and no one minded if map reduce jobs ran for a few hours.  But now people are starting to look at the Hadoop ecosystem to replace standard data warehousing and BI activities, so they start wondering how to conform dimensions and map data to schemas.

Even Ralph Kimball, the “father of data warehousing,” has started to realize that a different approach is needed, and has started to verbalize a new approach to conformed and slowly changing dimensions in his recent webinar series together with Cloudera.  And a new set of startups is focusing on automating this process.  Trifacta and Paxata are working to re-invent ETL, to turn data that is landed into Hadoop into something that is more ready for analytics.  And Tamr, founded by Andy Palmer who also founded Vertica, just announced a $16 million venture round to use machine learning and other advanced approaches to perform automated data conforming.

Masking 

The third aspect to data fracking is masking.  General analytics that we want to perform against large amounts of data can work just fine without being able to tie any individual data point back to a specific person.  Masking theoretically solves this problem by de-identifying data.  De-idenfitying replaces PII with benign meaningless surrogate keys.  This traditionally was a straight-forward process, simply replace a SSN with a GUID, and retain a mapping table in a secure place.  But an incident at Netflix a few years ago showed how even masked data can be tied back to individuals if you have enough pieces of the puzzle at your disposal.  For example, if you know my census block and my year of birth, you can identify me with that information alone.  Zip code and date of birth turns out to be a pretty unique combination as well.

More secure masking requires making the data fuzzy, and aggregated.  If you roll up enough people together and average them out, the data becomes harder to identify to any single person.  But this also can destroy the ability to perform correlation analysis at a very detailed level, as you can’t accurately correlate two datasets that have both been aggregated, no more than you can separate the flour from the eggs after a cookie batter has been mixed.

So these issues with masking and aggregation mean that the most valuable data analytics still need to be done of extremely raw data that can generally be tied to an actual person if you try hard enough.  Analytics at a granular level at scale is one of the more interesting parts of the data fracking technology ecosystem that remain unexplored.

Understanding

The last step of data fracking is to actually have a human try to draw conclusions from data.  This is the area that we are working on at Zoomdata.

We are trying to build interfaces and analytics on top of data that are easy for normal business people to understand and interact with.  There is an ecosystem of new startups that are bringing new approaches to data visualization and data understanding, which combined with the Hadoop and NoSQL revolutions, and the data fracking ecosystem players, will allow humans to undertand big, complex, and challenging data in entirely new and valuable ways.

So forget data mining, that was so 1990’s.  The easy data has been minded already, and whats left is messy, in 25 different places, lacking conformed dimensions, and littered with PII.  But this data represents huge possibilities… the answer to most disease may be at our fingertips, if we can just find a secure and acceptable way to frack everyone’s medical records.  So it’s time to start fracking!

Advertisements

Art has always been a means of communication, reflection, persuasion, storytelling, and interpretation.  Artists use their imagination to create visual representations of the world, both literal and abstract.  The earliest art, prehistoric cave paintings, used simple representations, such as a cross to represent a man, and a triangle for a woman.

As the caveman’s charcoal and manganese oxide gave way to modern paint, and more recently to digital bits, visual art has become more refined.

Cogul_HBreuil

Cave Painting

But paint has always represented color, at its most primary just red, yellow, and blue.  From those ternary states, artists have created great works of art.  Data, at its binary core, can also be used as paint.  A new breed of artist is arising–data artists.

Data artists use data as paint to construct imaginative representations of the world in their own way.  Instead of reconstructing representations of what they see with their eyes, data artists create new ways for our eyes to see the massive flows of data that are constantly around us, yet invisible.

Like traditional artists, the work of data artists can be literal or abstract.  It can try to communicate a particular point of view, or not, and can be open to the viewer’s interpretation.  At its most simple, data artists bring data into light, and let us see facts, flow, and patterns that were previously invisible.  Instead of recreating views of things we could have observed ourselves, they create views of things that can not be seen directly.  Out of darkness, comes light.

Data art is in its infancy.  Like cavepeople drew stick figures of people and buffalo, we can create pie charts in Excel.  The more talented can use tools like the open-source d3.js data visualization library to code up something novel, but even then it takes a unique combination of expert coding and design skills.

dataart

Data art, created in d3.js by the New York Times

Most visual art is static, it doesn’t change over time, and is not interactive.  This is not due to lack of artistic creativity, but an artifact of the nature of paint.  Paint dries.

Data doesn’t dry.  It continuously flows.  Any static view of it is inherently almost useless, and any non-interactive view is non-satisfying.  Today’s data artists can spend weeks or months creating static visuals and infographics which are out of date the minute after they are published, and that even while fresh don’t provide the viewer any way to dig in and explore.

Tomorrow’s data artists need new tools to use their data paint to create interactive and fresh visual views of otherwise invisible data.  At my startup Zoomdata, we are creating these tools.  We hope to make the creation of interactive data art easier and more accessible.  While we may not all become a data Picasso, we can move beyond being data cavemen.