Workshop for beginners in R: advantages, installation and packages

5 min reading

Development / 16 October 2015

General-purpose languages are the most widely used by programmers all over the world because they allow the possibility of developing projects with different profiles. R is not one of these. It is a syntax that is used for handling large volumes of data, normally of a statistical kind, and with a single objective: to display them so they can be understood. Even so, it has become increasingly popular.

Why has R become such an interesting option?

1. Robust language.

R is a language with a steep learning curve, but which is very robust and effective for handling statistical data. This is an object-oriented language, very similar to the syntaxes C and C++. For developers who are expert in these languages, it may be simple. R is also a programming language that's constantly evolving, and for which there is ample documentation available.

2. It's easy to prepare the data.

When a programmer handles very high volumes of data, a large part of his or her time is spent preparing the information for display before any conclusions can be drawn from it. With R, this preparation is relatively simple, largely because it automates a number of processes by programming scripts.

3. R works with any file type.

R is very flexible, and can work with data from all file types: a .txt, un .csv, a JSON or an EXCEL…

4. It manages large volumes of data.

R is a language that allows the implementation of additional packages, which gives it an enormous capacity for data management. In large volume projects, scalability is a key element.

5. It's open source and free.

If you're a programmer and you want to start using R, this is an important point. There are no limitations. The code is in the repository of any collaborative development platform like GitHub, or on developers' forums like Stackoverflow. There are additional libraries and packages to develop special projects, whose code can be modified according to your liking in order to implement new functionalities. And all for free.

6. Visualizations: Who's going to understand if there are no graphics?

If there's anything that defines R, it's its capacity for displaying complex information in a simple way. All developers have libraries specializing in data visualization to create graphics.

How can I begin to use R to work with data?

To install R in your computer, go to the CRAN project (Comprehensive R Archive Network) and download the .exe for Linux, Mac OS X or Windows. Once you've done that, follow the usual process for any program or web application installed in an operating system:

– Run the .exe file. If this is the first time you're installing R in your computer, you'll have no problem. If you have an earlier version of R installed in your computer, the best idea is to uninstall it first.

– After choosing the language, the user goes through the installation process on a series of different screens.

– You need to choose a folder to store R.

– There are two types of installation: user (standard) or full.

CRAN has a full list of R packages in alphabetical order. You'll find source code and documentation. Each package also includes information on its functionalities.

Like any other syntax, a developer must use an integrated development environment (IDE) to program a project. In the case of R, one of the most commonly used is RStudio. This IDE has a console and a text editor, with some features like syntax highlighting, error debugger, code auto-complete, advanced work directory manager and more.

What are the best libraries for R?

1. Data handling

– plyr: This R package allows operations to be made in the subgroups of a large dataset. It has different functions for operating with these data: ddply, daply, dlply, adply and ldply. How can you begin to work with it?

– Installation: use the function install.packages (“plyr”).

– Load the library: use the command library (“plyr”).

– How to apply operations within a dataset included within the plyr package in R: We can use statistical data on baseball players data (baseball) and begin to operate with them to find out –for example– a player's minimum and maximum games per year:

ddply(baseball, .(id), function(x) c(years=nrow(x), minimum=min(x$g), maximum=max(x$g)))

This process of installation and running packages in R is the same for all the packages. install.packages function to install the package and the library function to load it.

– reshape2: This R package transforms data between wide and long formats. It's based on two key functions: melt and cast. Melt takes wide-format data and melts it into long-format data. The cast function does the same in reverse.

Wide-format data have one column for each variable. The long format has one column for each type of variable and another for the values of these variables. With the reshape2 package functions you can easily transform a dataset from one format to the other. Practical case.

Other R packages for data handling: lubridate and stringr.

2. Data visualization

– ggplot2: This is a package that provides R with everything it needs to create graphics in an accessible way. It's a really powerful library: It allows all types of visualizations (bars, points, lines, areas and more); and it has a system of coordinates for making map graphics and scales. ggplot2 can be combined with other libraries to create data trends, for example.

ggplot2 is not part of R's standard installation package, so you need to download it and specifically include it. Once downloaded, there are two functions for installing and executing:

– install.packages (“ggplot2”) function to install it.

– And we use the library function (“ggplot2”) to load it.

The R package ggplot2 depends on other packages that must also be downloaded and installed such as ‘itertools’, ‘iterators’, ‘reshape’, ‘proto’, ‘plyr’, ‘RColorBrewer’, ’digest’ and ‘colorspace’.

– rgl: Package for the creation of 3D graphics in real-time. It uses an OpenGL rendering backend (Open Graphics Library, a standard specification that defines a set of functions for writing data visualization applications).

3. Automatic learning

– randomforest: This is classification method that uses decision trees (a standard machine learning technique) applied to large volumes of data. It can be used for both supervised and unsupervised learning.

This method has some important features:

– It is efficient in large databases.

– It enables the management of thousands of input variables without the need to delete any variables.

– The decision trees created in one project can be used for other datasets in the future.

– It has a function for estimating missing data. This allows the method to maintain its accuracy, even when there is no access to certain data.

– randomForestSRC: Method of random forests for survival, regression and classification.

A random forest for survival is a mathematical tool that allows a large number of variables to be weighted in order to generate a predictive model of survival, and is normally used within the healthcare sector.

Classification trees are used when the response or target variable is categorical. In contrast, regression trees are used when the response variable is numerical. This is the difference between a classification and a prediction model.

Among the other R packages widely used by data scientists:

– Gradient boosting: gbm and xgboost.

– Support vector machines: e1071, LiblineaR and kernlab.

– Regression with regularization: glmnet.

– Generalized additive models: gam.

– Clustering: cluster.

Follow us on @BBVAAPIMarket

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.

Workshop for beginners in R: advantages, installation and packages

It may interest you

APIs in selling: the final push

APIs are everywhere, but… what about their documentation?

Tools to measure the success and effectiveness of your API