Musings on Strata + Hadoop World NYC 2015
Recently my train rolled into New York’s Penn Station just ahead of the Strata + Hadoop World 2015 Conference. Working for EMC, I have the pleasure of interacting with many big data thought leaders inside and outside of the company. At Strata, I was able to see what other vendors are doing in the big data space.
I took full advantage of the event by attending discussions, workshops, and social events. I took the time to peruse the expo floor, talk with vendors and just took it all in.
During the week at Strata I experienced the proverbial “Taking a refreshing drink from a firehouse.”
Big Data Needs to be More Conscious
I sat in on a lecture given by Evan Selinger, Professor of Philosophy at Rochester Institute of Technology, and Jules Polonetsky, Executive Director of the Future of Privacy Forum, a Washington think tank seeking to advance responsible data practices. The gist of the lecture emphasized that companies need to think about how advanced analytics can have an impact, “out of context,” from its intended purpose which can have a damaging ripple effect.
The now ubiquitous “Target Predicts Teen’s Pregnancy” story was showcased to demonstrate that there are out-of-context consequences to algorithms and the insights they surface. Target Corporation, in its efforts to drive sales in the pre-natal market, leveraged machine learning algorithms to identify a subset of female customers who were most likely pregnant.
One of the women identified was a young lady in high school, still living at home with her parents. The father, seeing coupons for baby supplies addressed to his daughter went to his local Target and confronted a manager. He accused the manager of encouraging his daughter to get pregnant! Caught off guard the store manager knew nothing about this marketing campaign being run out of corporate. And ironically, the father had no idea his daughter actually WAS pregnant.
Organizations, in their pursuit of Big Data “gold,” get overly entrenched in the technology, people and process to achieve a competitive advantage. This in and of itself is not a bad thing; however, should all found gold be used to achieve business outcomes?
Data Science for Dummies
As I explored the expo floor, I found companies promoting products that promise to remove the complexities of data science. I did not have the opportunity to sit with these vendors and ascertain their secret sauce; however, having twenty years of experience as both a data and analytics architect, I do know that analytics, be it traditional descriptive or forward looking predictive and prescriptive, are not easy for a host of reasons.
This work is difficult and it is mostly a function of messy, dirty data. Further, there is that little thing called “deep organizational knowledge” that makes it possible for people to identify the right questions that need to be answered when trying to solve a business problem with data.
There is no one model in predictive analytics that solves it all. In fact, several models are often used in combination. What model to use is a function of data, context, and desired outcome. This requires a skillset that is in the wheel house of a data scientist.
I understand the motivation to be the first-to-market with a software platform that effectively puts the power of data science into the hands of a non-data scientist; however, seeing that most organizations today don’t even understand the role of a data scientist, I think these software products and their claims of “no more data scientists” most likely require more refining.
Structured Query Language (SQL) is Highly Resilient
There is a concerted effort to abstract the complexities of emerging persistence platforms, such as HDFS and NoSQL and layer that on top the lingua franca of data, SQL.
SQL is the de-facto tool to access and manipulate data in relational database systems (RDBMS). First offered commercially by Relational Software, Inc. (now Oracle) in 1979, SQL is a programming language that is relatively easy to learn and serves as the basis to interact with data from many sources. This makes it popular with people who work with data but aren’t involved in programming.
Products like Hive and HAWQ make it easy to query data in HDFS using SQL syntax that is familiar to many. Apache Drill is going one step further by accommodating polyglot sources such as HDFS, MongoDB, HBase, and public cloud providers. Its next release will support relational databases. With Drill’s architecture, traditional reporting interfaces can access data in these disparate repositories just as if they were an RDBMS.
In the face all these emerging technologies, SQL continues to prove its resiliency and I don’t expect it to go away anytime soon.