Monday, January 31, 2011

Data Analytics with Open Source Tools

A long time data wrangler serving many masters as one must in this role, I have been looking for a book that talked about the real life challenges of the job. I would love some practical advice on how to do my job better without driving myself completely crazy.
I found at least some of that in Philipp K. Janert’s book Data Analytics with Open Source Tools. I am not the right audience for the math in the book and based on my experience translating something that technical to executive management would be extremely challenging if not impossible. Often there are no serious math nerds on the team that understand the concerns of the business well enough to bring their numerical and computations skills to bear on them effectively (i.e. three action items to improve customer engagement by 15% in the next 90 days).
More often than not, it is falls on the rest of us who straddle the technical and business worlds, to divine (or help divine) something of value from the many cesspools of enterprise data. To be successful, we to know how to make the most of what little we have in terms of clean data, repeatable processes, inertia to improving them and a common understanding of data across the enterprise.
In the preface and introduction of his book, Janert advocates using as little statistics as possible, going with the most commonsense way to analyze the data set and get a feel for it just by looking at it. Slice and dice it many ways, run some charts and numbers to see if there is an interesting story buried there somewhere. This is been my approach almost 90% of the time and I was excited to see it endorsed by the author. I have used what the math yielded as a way to prove or disprove my story. While far from perfect, the method has helped point clients in the right direction, remedy issues that would have otherwise gone undiscovered.
Later in the book, the author brings up a very important point. Getting data to be good enough is often feasible but to get it to be truly high quality maybe an impossible task. If the success of a project hinges completely on the data being better than good enough, it may be wiser not to take on the project at all. This is excellent advice that I will remember to pass on to clients who are bent on cleaning the Augean stables in their quest for business intelligence nirvana.
I would definitely refer this book again if my job ever required me to do the math on data instead of analyzing it using the far less rigorous techniques that most shops are content to use. However, I will continue to look for a cookbook for the analyst who has to work within constraints of time, poor data quality and lack of cohesive processes that are the sources of data. Ideally, this book will have case studies, problem scenarios and real-life solutions that folks like myself can relate to and apply on our own jobs.

No comments: