It's as much a Science

Any book on Data Visualization, including the great Edward Tufte’s, Visual Explanations, would invariably cover the story on the English physician John Snow and his work during the days when England had to wage a war against the cholera epidemic. Since it happened
during the mid 18th century with many of the data visualization related techniques at its primitive stages, this story gains acclaim on how Snow effectively leveraged cartography to plot the data, narrate a vivid story to change the perspective on how people viewed a problem and more importantly influence a decision.

Screen Shot 2016-11-23 at 12.33.35 PM.png I would rather use the John Snow story to narrate another important tenet of Data Science that is often missed out – that it is Science after all!

To set some context, in the Britain of the 1850s, the cholera epidemic killed thousands of people and it was widely believed that these diseases were spread by air and the dominant factor at that time was the miasma theory. Remember, the germ theory as proposed by Louis Pasteur was not established at that time.

Snow was an eminent physician and he soon got involved in the cholera epidemic as the sudden and serious outbreak especially in the city of London was shrouded by mystery. With a true scientific mindset and with his deep expertise as a physician, he was particularly intrigued by few contradictions.

If inhaling noxious air caused cholera, why din’t other people in the neighborhood get sick?
Some people who stayed in a lodge in that street got sick while others dint.
If cholera really spread through air, he expected the victims to show lung damage. Yet, their lungs were okay and the damage was in their digestive systems.
None of the the workers in a nearby brewery were affected at all (not that the workers quenched their thirst with few pints of beer instead of water).

Science always thrives in someone with an open mind and who has the guts to go against the conventional wisdom. The deep understanding of their field, coupled with an insatiable passion drives their skepticism towards the norm and makes them look at the paths that others missed. But, this is easier said than done. As Gary Klein notes in Seeing What Others Don’t

When people contradict the prevailing wisdom, even professional prominence won’t protect them.

Snow started collecting data about the victims and plotted the deaths on a location map of the Soho district of London. It soon became evident that many of the deaths were clustered around the Broad Street water pump. He also made use of statistical models to establish causality between the quality of water sources and the cholera deaths. It soon turned out that the water for the pump was polluted by sewage, which in turn caused the epidemic. There were various other anomalies that went against his revelation, which he was able to objectively explain on a case-by-case basis. But, his study was convincing enough for the local council to disable the well pump by removing its handle.

Screen Shot 2016-11-23 at 12.36.48 PM.png

Source: British Library.

In summary, Snow did something that a typical science researcher would do.

Start with a strong WHY – to dispel the mystery around the cholera spread and to find a remedy.
A strong sense of curiosity that got triggered by all the anomalies around the deaths and the prevailing theories.
Start with a hypothesis and in his case to go against the common belief on the miasma theory.
Collecting data relevant for his study.
An innovative approach to pick an analytical method (cartography, in his case) to suit the problem at hand – studying the correlation between cholera case and location.

So, how would you describe John Snow – a physician, data munger, statistician, or visualization expert? What are the traits that made him so good at gaining purposeful insights that resulted in a meaningful outcome?

Hmmm!

The original title I was thinking for this post was really, Can Data Science function ever be outsourced?

The literary meaning of the term Outsourcing primarily relates to contracting out to a third party supplier of goods or services. It is primarily used in the IT Services industry where businesses outsource their IT functions to other players with the primary driver being the affordability in terms of Cost and Talent.

VERY IMPORTANT: For the sake of this post, I want to move away from the above definition and stick to a definition of outsourcing as, letting someone else, other than the primary expert in the area of interest, do the job.

While the shortage of talent in niche skills around Data Science and the market pressure to show some quick wins in the new technology areas are forcing many firms to resort to outsourcing, no wonder major research firms too recommend outsourcing as a key strategy for data science.

Despite all the hype and the heavy investments made in Big Data and Data Science, many firms still struggle with the technology and so many adopters of Data Science continue to paint a gloomy reality of their Data Strategy and in their inability to effectively leverage Data for their business outcomes. Perhaps many organizations are riding through the ‘trough of disillusionment’ as Gartner’s hype cycle calls it.

As with any problem, the diagnosis has to start with the WHY.

Contrary to the mainstream opinion, I always believe Data Science effectiveness is not primarily constrained by the technology or the talent. It is all about the mindset, culture and strategy.

The fallacy starts right from our understanding of the name itself. It is after all Science and Webster’s dictionary defines Science as

knowledge about or study of the world, based on facts learned through experiments and observation.

The 4 key terms that stand out in this definition, which are epitome to the success of any Data Science effort are Knowledge, Facts, Experiments and Observation.

Coming back to our topic on outsourcing, which aspects of the above definition of Science can you really outsource?

Have you heard of any scientific research getting outsourced and that came out as a grand success? Does any of the outsourcing industry run a successful vertical for Scientific Research? Do Oxford, Cambridge, CERN or MIT outsource any of their research efforts? Would any modern data journalist outsource his story?

Why then do we apply a different stream of thought for Data Science in businesses and expect different results? Perhaps, what gets outsourced is not really Data Science or this explains the lack of success for majority of the organizations with their Data Science efforts.

To my belief, what really scares the industry and hence gets outsourced are mainly the technology part of Data Science that involves some of the Big Data related technology areas and some portion of the visualization or the analytics side of it.

Unfortunately, technology is only one piece of the Data Science or Sense Making puzzle. What gets lost in this weak definition are some of the key facets of Science namely Knowledge on the subject and some of the key tacit skills required for a scientific mindset namely Perseverance, Curiosity, Creative Desperation and like. You cannot outsource for these skills. You can outsource for Hadoop, Cassandra, R or Python skills. Those are just means to get to the end goal.

Staying on the topic of outsourcing, leaving alone 3rd party vendors, even the composition of the Data Science team within any organization itself is critical to its success. It just cannot be a bunch of Statisticians and Technology geeks working in silos, with no business context and a hazy mandate.

Just like the field of scientific research, you need a deeper understanding and knowledge of the subject or the business, a clear purpose which in turn helps you gain the tacit skills of patiently running experiments, framing and working on hypothesis and persevering with passion for the end result.

If I can form a simple equation for Data Science talent as below, now you tell me which element of the equation is tough and which one is a good candidate for outsourcing?

Data Science = f (Subject Matter Expertise, Data, Technology, Scientific Mindset)

Now, this should also tell you why some of the firms continue to see extreme success with Data Products and while many others, without a strong WHY, continue to struggle. Many of the new age companies have been successful with their data strategy because everyone starts from zero, ground-up, and everyone understands what they are upto from a business perspective and the culture of data and sense making gets ingrained right from Day 1.

That may not be the case with many of the mainstream firms that are so envious of the new age companies and struggle to replicate their success. They continue to juggle between their legacy and strategic priorities, often taking shortcuts for quick wins.

As a remedy, the strategy some of the legacy firms could adopt are around these areas.

A belief in a data driven culture that resonates across the organization, not just limited to producing scorecards, but deriving and measuring outcomes purely based on data products.
Strong integration of their business teams with the Data Science teams – not just limited to partnerships – the business experts should believe in and drive the Data Strategy and very much be ‘part’ of the Data Science efforts for sustained success.
Build talent in-house by re-training some of the highly motivated staff from both Business and Technology and move them into Data Science. You just cannot afford to leave your success to few quants with zero business context. I always believe Technology, especially in this age, is so easy to learn than a scientific mindset or deep expertise in a subject.

Going back to the John Snow story, how successful the end result would have been, if he had outsourced his research to someone else?

If we were to think of the above Data Science talent problem as a linear equation, if knowledge levels and tacit thinking skills are the constraints on the different variables, to maximize the effectiveness of the Data Science output, which constraint do you feel is the most significant and a scarce commodity? And, which variable do you feel is easy to be sourced from outside?

There lies your answer to improve Data Science team effectiveness.