Data has a language. It speaks!

Lets start with looking at what a language is? Language is a complex system of communication that humans create and use across different regions and countries around the world to convey ideas, emotions and thoughts. It is a system that helps humans surface complex connections with in our brains to the outside world either in written or spoken form. Imagine a world without language – it is a cognitive reality that is virtually impossible for most modern human to fathom. Human species has been able to survive, persevere and constantly evolve with the changing world because of their ability to use this complex system of language, communicate and create a shared knowledge base.

However, having a language is not enough, you have to have a common one. Put two experts in a room to solve a problem that we know can only be solved by their skills combined and is impossible to solve by either of them individually. Unless they both speak a common language and communicate with each other using that, it is unlikely they will make any progress in solving the problem at hand. If I were writing this post in Hindi, I can guess a lot of the reader will not be able to understand a single word I wrote today. Having made a case for the importance of language and what it essentially is, I want to talk about how I believe each dataset has one too.

Pick up a dataset, any dataset, there will be a story it will tell. Even no data is also data and that may tell us a story we did not expect at all. If we understand the language data speaks well, we can unearth and communicate those stories faster and better than anyone else. We can understand the complex (and not so complex) connections a dataset may have. In a nutshell this is what we do as Data Scientists. We use tools like Statistics and computer science to understand the hidden connections in datasets – be it big data or small data.

Let’s look at where it all begins. There are two traditions ways and a more modern one that treats dataset as a black box – I will get to that one later. The first traditional way is where a Data Scientist (a Researcher) whatever we choose to call these data enthusiasts, start with a hypothesis say – ‘bad weather increases netflix viewership’. Next the researcher would go chase the supporting data. It will be netflix viewership timeline and the weather of the given region in question. Now finding that data is not always as straight forward as it may seem in this example.  The second tradition way entails a process in which the researcher already has a dataset in hand, but no initial hypothesis. In this case, he performs the initial explorations (which does not have to be long and complex) to develop a sense of what the dataset is trying to tell – essentially come up with a hypothesis. At this stage, both ways converge , you have a dataset and you have a hypothesis. Next step is to cleanse and process the data with the motivation to make interaction with the data easy. Once the dataset is tamable, we apply statistical techniques to fine connections and understand the stories is it trying to tell us.


Leave a comment