Data Science


The science of sifting through the big data, analyzing different variables to generate obvious conclusions, predictions or novel solutions previously not known is known as data science. This branch of science includes other departments such as mathematics, information science, computer science, analytics and strong business acumen. Data scientists are highly valued for these skills and are relentlessly creative.

The two most prominent programming languages preferred by data scientists are Python and R. Technically, R programming has its homecourt advantages such as data visualizations and statistical tasks. The latest TIOBE index (i.e., April 2018) verifies the increasing popularity of R programming, currently ranked 12th.

R was designed by Ross Ihaka and Robert Gentleman and is constantly improvised by the R Development Core Team. Any R programmer can contribute to the existing list of open source R packages. The functionality of R can be greatly increased and is almost limitless due to its lexical scope. Out of 15,000 R packages, the top ten packages for data science are:

  1. Data scientists sift through the big data. This is known as data wrangling. The golden package for both data wrangling and analysis is dplyr. Dplyr is the enhanced version of plyr and is most preferred due to its easy command line interfaces such as filter(), arrange(), select(), rename(), mutate(), transmute(), summarize() and sample_n(). The beauty of dplyr is heightened with the %>% operator.
  2. The next package which further assists in data wrangling is data.table. data.table is a glorified version of data frames and is not limited to just subsetting rows or variables. One main advantage of data.table is that it is programmed to retain the primitive arrangement of groups. The function Keyby() arranges the groups in ascending order if required. Chaining expressions is also possible here.
  3. Tibbles package is an evolved form of data frames and is constantly modified to abandon obsolete features. Tibble has been programmed keeping the performance in mind. It creates data frames, binds objects into tibbles and is uncompromising in subsetting and partial matching.
  4. The next step data scientists undertake is to pictorially represent this data. This is known as data visualization. The two best R packages for this are ggplot2 and plotly. Ggplot2 is used for static graphs and produces print quality graphics. Ggplot2 allows more advanced features and refined personalisation and is preferred if you want a quick exploratory view.
  5. Plotly is inherently interactive and offers different fonts and works on Python, MATLAB, Excel etc. Another plus point is that ggplot can be converted to plotly.
  6. Tables are another way to visualize data and so, gmodels package allows you to create contingency tables swiftly.
  7. Since R programming is not the only language used for statistical purposes, the foreign package becomes crucial to import data stored in Minitab, S, SAS, SPSS, Strata etc formats. Foreign package translates the outside data into R language.
  8. For programmers who know both R and C++, using Rcpp packages must pose no problem at all. This seamless back and forth usage of R and C++ makes writing new codes easy and accelerates the integration of independent libraries.
  9. The next R package recommended to data scientists for importing and exporting data is google sheets. This package saves the trouble of analysing and filtering the data, printing it and then reopening it in Google Sheets. This package provides an access road in between and connects with your Google account. After allowing your file to access your google drive, you can seamlessly load data from Google, amend data and recreate Sheets.
  10. Finally, to create your own packages, devtools package is an elegant way to do so. You can use dplyr and ggplot2 functions for building your new package, and in order to do so, you need to import these two packages too. Devtools allows you to create new functions, document them and your package and finally installs it using devtools::install() command.
This blog post written by Nirmal Patel. The opinions expressed in this post are based on his personal view.