October 23, 2015
If you have been in analytics profession for couple of years, you would have had your fair share of discussions on right tool for given job. You may have your favourite one, or you may be master of many. If you are new to analytics, you may be learning one now, sometimes driven by interest, but often times driven by what your organization supports or works with. Irrespective of your experience in world of data science, you would have constantly faced challenges and frustrations of keeping pace with new and upcoming tools and software programs.
As this field grows and new tools and technologies keep emerging – and emergence of Big Data hasn’t helped the matter – one must evaluate effort in mastering yet another language or tool or mastering new machine learning theory. Let’s take a look around and see what’s out there and what they are good for.
SAS is one of the most common tools out there for data processing and model development. When analytics function started emerging in the financial service sector couple of decades ago, SAS became common choice because of its simplicity and lot of support and documentation. SAS comes handy both for step by step data processing and automated scripting. All is well, except, SAS wasn’t and isn’t cheap. SAS also has limited capabilities for visualizations and almost no support for parallelization.
R is perhaps first choice of most data scientists and modelers because R was designed for this very purpose. R has robust support for machine learning algorithms and its library base keeps on growing every day. There will hardly be any discipline – be it bio-technology or geo-spatial processing – where a ready package will not be available in R. R is also fairly easy to learn and has good tutorials and support available – though being free product, support is often in forums and examples rather than in documentations. R is also great for data and plot visualizations, which is almost always necessary for data analysis.
Python isn’t new, per se, but Python for analytics is recent phenomenon. Another free language/software, Python has great capabilities overall for general purpose functional programming. Python works well for web-scrapping, text processing, file manipulations, and simple or complex visualizations. Python is advancing – but not yet there – in dealing with structured data and analytical models compared to R. Python also doesn’t support data visualization in as much detail.
SQL – MySQL that is – isn’t generally considered a data analysis language since it has no support for model training, but SQL is great for data manipulation and summarization. SQL is also fairly easy to learn and often used a pre-processing tool for other model development environment.
Of course, there are many more tools for general purpose data analysis, let alone specialized tools for specific applications. Interpretive languages like C++ and Java are still go-to if system integration and speed of production run-time is important. Other high level languages like Ruby and Perl also support data handling. MATLAB comes handy for data processing, visualization and optimization but perhaps not for model development. By the way, when I say, a tool doesn’t support model development readily; it means tool doesn’t have native support for training a, say, logistic regression model. That doesn’t prevent anyone from coding the gradient descent function of logistic cost function manually.
While general purpose analysis tools occupy our work days for most of us, none of us should be immune to market trends and future of analytics. Hadoop started off Big Data revolution, but thankfully, over time many technologies have to come to abstract out complexities of manually writing mapper and reducer and provide a general wrapper to work with Big Data. Pig, Hive, and Google Big Query provide SQL-like environment for handling large tables, while Spark provides general purpose data processing and analytic modeling functionalities. Storm is currently considered most suitable for streaming data handling, and MangoDB, Vertica and CouchBase provide advance data storage solutions.
|Cost||Paid||MySQL Paid, but free variants like SQLite available||Free||Free|
|Learning Difficulty||Low – Medium||Low||Medium – High||Medium – High|
If you feel confused by laundry list of tools, then I can understand. Every situation is different, of course, but I feel that R and Python are both high level languages and generally simple to pick up (yes, they take time to master) so you should be comfortable on working on any. If you are looking as first language to pick up for machine learning then I recommend R. Maybe you can foray in Python if your work goes beyond structured data. Otherwise, invest in learning open-source technologies and tools since world is generally moving away from enterprise software products.
All views are author’s personal opinion based on past experience. Some of the tools were used couple of years ago and functionalities may have changed now. This post is not intended to be authoritative word on merits or demerits of each tool but as general indicative guide.