The data scientist’s toolbox

Caja de herramientas del científico de datos

Data Science stands today as a multidisciplinary profession, in which knowledge from various areas overlap in a profile more typical of the Renaissance than from this super-specialized 21st century.

Given the scarcity of formal training in this field, data scientists are forced to collect dispersed knowledge and tools to optimally develop their skills.

The following is intended to be a basic guide, obviously not exhaustive, of some useful resources available for each of the facets performed by these professionals.

Data management

Part of the work of the data scientist it to capture, clean-up and store information in a format suitable for its processing and analysis.

The most usual scenario is to access a copy of the data source for a one-time or periodic capture.

You will need to know SQL to access the data stored in relational databases. Each database has a console to execute SQL queries, even though most people prefer to use a graphical environment with information about tables, fields and indexes. Some of the most popular data management tools are Toad, proprietary software for Microsoft’s platform, and Tora, which is open-source and cross-platform.

Once the data is extracted we can store it in plain text files which we will upload to our working environment, for machine learning or to be used with a tool such as SQlite.

SQlite is a lightweight relational database with no external dependencies and which does not require to be installed in a server. Moving a database is as easy as copying a single file. In our case, when processing information we can do it without concurrence or multiple access to the source data, which perfectly suits the characteristics of SQlite.

The languages we use for our algorithms have connectivity to SQlite (Python, through SQlite3 and R, trhough RSQlite)so we can choose to import the data before preprocessing or to do part of it in the database itself, which will help us to avoid more than one problem after a certain amount of records.

Another alternative to bulk data capture is to use a tool including the full ETL cycle (Extraction, Transformation and Load), i.e. RapidMiner, Knime or Pentaho. With them, we can graphically define the acquisition and debugging cycles of data using connectors.

Once we have guaranteed access to the data source during preprocessing, we can use an ODBC connection (RODBC and RJDBC in R, and pyODBC, mxODBC and SQLAlchemy in Python) and benefit from making connections (JOIN) and groups (GROUP BY) using the database engine and subsequently importing the results.

For the external processing, pandas (a Python library) and data.table (a package in R) are our first choice. Data.table allows to circumvent one of R’s weaknesses, memory management, performing vector operations and reference groups without having to duplicate objects temporarily.

A third scenario would be to access information generated in real time and transmit it in formats like XML or JSON. These are called incremental learning projects, and among them we find recommendation systems, online advertising and high frequency trading.

For this we will use tools like XML or jsonlite (R packages), or xml and json (Python modules). With them we will make a streaming capture, make our predictions, send it back in the same format, and update our model once the source system provides us, later on, with the results observed in reality.

Analysis

Even though the Business Intelligence, Data Warehousing and Machine Learning fields are part of Data Science, the latter is the one which requires a greater number of specific utilities.

Hence, our toolbox will need to include R y Python, the programming language most widely used in machine learning.

For Python we highlight the suite scikit-learn, which covers almost all techniques, except perhaps neural networks. For these we have several interesting alternatives, such as Caffe and Pylearn2. The latter is based on Theano, an interesting Python library that allows symbolic definitions and a transparent use of GPU processors.

Some of the most used packages for R:

– Gradient boosting: gbm y xgboost.

– Random forests for classification and regression: randomForest and randomForestSRC.

– Support vector machines: e1071, LiblineaR and kernlab.

– Regularized regression (Ridge, Lasso y ElasticNet): glmnet.

– Generalized additive models: gam.

– Clustering: cluster.

If we need to change any R package we will need C++ and some utilities that allow us to re-generate them: Rtools, an environment for creating packages in R under Windows, and devtools, which facilitates all processes related to development.

There are also some general purpose tools that will make our life easier in R:

– Data.table: Fast reading of text files; creation, modification and deletion of columns by reference; joins by a common key or group, and summary of data.

– Foreach: Execution of parallel processes against a previously defined backend with utilities such as doMC or doParallel.

– Bigmemory: Manage massive matrices in R and share information across multiple sessions or parallel analyses.

– Caret: Compare models, control data partitions (splitting, bootstrapping, subsampling) and tuning parameters (grid search).

– Matrix: Manage sparse matrices and transformation of categorical variables to binary (onehote encoding) using the sparse.model.matrix function.

Distributed environments deserve a special mention. If we have dealt with data from a large institution or company, we will probably have experience working with the so-called Hadoop ecosystem. Hadoop is a distributed file system (HDFS) equipped with algorithms (MapReduce) that allows to perform information processing in parallel.

Among the machine learning tools compatible with Hadoop we find:

– Vowpal Wabbit: Online learning methods based on gradient descent.

– Mahout: A suite of algorithms, including among them recommendation systems, clustering, logistic regression, and random forest.

– h2o: Perhaps the tool experiencing a higher growth phase, with a large number of parallelizable algorithms. It can be executed from a graphical environment or from R or Python.

The data scientist should also keep abreast of new trends of generational change of Hadoop to Spark.

Spark has several advantages over Hadoop to process information and the execution of algorithms. The main one is speed, as it is 100 times faster because, unlike Hadoop, it uses in-memory management and only writes to disk when necessary.

Spark can run independently or may coexist as a component of Hadoop, allowing migration to be planned in a non-traumatic way. You can, for example, use HBase as a database, even though Cassandra is emerging as a storage solution thanks to its redundancy and scalability.

As a sign of the change of scenery, Mahout is working since last year in its integration with Spark, distancing itself from MapReduce and Hadoop, while H2O.ai has launched Sparkling Water, a version of its h2o suite on Spark.

Visualization

Finally, a brief reference to the presentation of results.

The most popular tools for R are clearly lattice y ggplot2, and Matplotlib for Python. But if we need professional presentations embedded in web environments the best choice is certainly D3.js.

Among the integrated Business Intelligence environments with a clear approach to presentations we should highlight the well known Tableau, and as alternatives for graphical exploration of data, Birst and Necto.

It may interest you

What is leasing and how does it work?

Businesses, from self-employed to SMEs and large companies, need financing solutions that suit their needs. Leasing is a method that can optimize the use of resources and which combines business liquidity (or lack of) with the use of assets. What is leasing and how does it work? Leasing is a financing method by which a […]

APIs , Banking as a service , Funding / 30 January 2024
Advantages of a world with open finance and open banking

Open finance is expected to be regulated over the next few years, leading to a new open data ecosystem Open finance is making its way into the legal system through the consolidation of several initiatives that will lend it legal protection. Once this is complete, customers will have an open finance framework that protects their data […]

Business development , Open banking , User Experience / 14 March 2023
What is a broker and what is it for?

Brokers are tools that allow active trading on financial markets, and they are also the people who execute those orders. In one way or another, brokers have been with us for more than half a millennium. Although they are now known as trading platforms which can be used at different levels, from beginner through to […]

Digital transformation , Entrepreneurs , Funding / 29 March 2022

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.

The data scientist’s toolbox

It may interest you

What is leasing and how does it work?

Advantages of a world with open finance and open banking

What is a broker and what is it for?