Data Science: A multi-disciplinary field in which scientific methodology is applied to data analysis in order to find useful insights and promote evidence-based decision making.

First blog post - what is this all about?

3rd February 2018

Author: Trevor Simmons

It is going to be predominately about data science and the skills, techniques and tools required to practise it. Data science is an emerging multi-disciplinary field that provides a wide subject area with a wealth of topics to discuss.

Strangely enough, it has only been recently that I have started to call myself a data scientist, despite having completed my first data science project in 2012. I have a background in programming and I called myself a programmer when I landed my first programming job, so why the hesitation to take on a new title?

The multi-disciplinary nature of data science means that is is a very broad field, and it is unlikely there would be a single person that would know the field in its entirety. Coupled with a rapid evolution, it can be difficult to define what a data scientist actually is, let alone to know if you are one or not. Shortly after I started to plan this blog I read Edwin Thoen’s blog post Curb your imposterism, start meta-learning where he discusses data science and its relationship to imposter syndrome. This is where high-achieving individuals feel like they are a fraud despite evidence to the contrary.

Imposter Syndrome (Source: David Whittaker)

Imposter Syndrome is quite common in academia which can be elitist and highly competitive, but for data science the problem could be that each day there are numerous data science articles appearing on the web, with some of them recommending between ten and twenty must read books along with a plethora of techniques. It felt like an avalanche of information, which had the effect of constantly expanding my reading list. By the time I finished one book/article then it was replaced by two or more. I kept telling myself that I will just learn this one more technique and then I will be ready, only to read about another, which continued ad infinitum. It is reminiscent of the legal profession, where no single person can know the entire law, it is too vast. Rather, new lawyers are taught how to research the law, and this echoes the practise Edwin Thoen calls meta-learning.

This seems to be the best option, to ground oneself in the basics and then learn what is personally interesting, or what is needed for a particular task. This is something I have found myself doing many times during my career, where I have sometimes needed to pick up the technologies required during the course of a project and to learn on my feet. For example, I came to learn PHP because I was unable do what I needed in Perl, after which I switched that entire project to PHP. I had to do this too for my first data science project, which I didn’t know in advance was going to be a data science project, where in my research it became apparent that machine learning would be the best solution. Throughout my career, different technologies have come and gone, but the skills that are developed using them are what remain, and these are transferrable.

For the basic data science skills, this is a tongue-in-cheek run down on the range of tasks a data scientist should be able to do:

Joel Grus' slide on data science tasks (Source: Joel Grus via Jenny Bryan)

As you can see I have cheekily crossed out two of them in the irreverent spirit of the original. This is not because I cannot do them, but that I disagree. The list was intended to be humorous so should not be taken too literally. Hacking a p-value would be fraudulent, and this is not something anyone should do. More importantly would be to know what p-values are, how they can be hacked, why that is bad, and maybe why they should not be used. As for coding on a whiteboard, this is something that gained popularity in US coding interviews. The original idea was to promote conversation about a programmer’s thought process but from what I have read it has too many false negatives and has been misapplied. I have interviewed coders, and personally think that are better ways of achieving the same result although it would beyond the scope of this post to elaborate on them here.

As I progress with writing this blog I will cover some of the basics as well as any new techniques that I learn along the way with a view to covering what skills are used in everyday data science with examples of their usage. My hope is to share what I have learned with people from beginners to my current level, and for me to test my knowledge by explaining it to others. There will be code examples which I will offer in both R and Python whichever is applicable, but I do not want to concentrate too much on coding. Too many articles do this and give the impression that data science is all about coding when there is a whole lot more. As an aside, I could argue R vs Python, but in practise I use both, with one being preferable to the other depending on the task at hand.

A bit about my background, I have extensive experience as a software developer, but after the previously mentioned data science project in 2012 I decided to move my career in a different direction so went back to university to study an MSc in Intelligent Systems (Artificial Intelligence) at the University of Sussex, for which I received a distinction grade. Certainly it was not a data science course, at the time that I applied the university did not have a data science course, but it covered a lot of the techniques used in data science which I have been expanding upon since completion.

More about all of this in future blog posts.