Apache Spark 2.0.: new developments in the new version

4 min reading

Development / 22 July 2016

This is not the first time we have spoken about Apache Spark at BBVAOpen4U. We explained its main features when it was still an emerging technology and confirmed its absolute success when it elbowed its way into the market of Big Data applications as the fastest, most powerful, most scalable and most sustainable option. The launch of Version 2.0 confirms the great welcome by the community of developers and the enormous possibilities that it offers companies that use big data to obtain a competitive advantage.

Apache Spark is an open source distributed computing platform with a very active community; it is faster and cheaper to implement and maintain than its predecessors in the Hadoop environment, such as MapReduce; it is unified; it features an interactive console that is very convenient for developers and it also has a quite powerful API for working with data. It is the best option on the market because it provides data engineers and scientists with a tool that resolves any possible scenario involving automatic learning, graphic computing, data streaming and interactive query processing problems in real time and with all the necessary scalability to respond to needs.

The Apache Spark version 2.0 comes with some interesting features that make it an even more powerful tool to work with Big Data:

Apache Spark 2.0, faster than Spark and MapReduce

The arrival of Apache Spark revealed speed as one of the essential advantages of the new platform, based on the fact that Spark works in memory and not in disk. Caching the data makes the interaction with the data more efficient and faster. This applies not only to the original data, but to the subsequent transformation of that information. When the system needs the data, it does not need to call on the disk, it simply goes to the cache memory. It is estimated that Apache Spark is 100 times faster in memory and 10 times faster in disk than Apache Hadoop MapReduce.

Apache Spark 2.0 doesn’t stop there. The new version has increased its data processing speed even more; this is also cache-based (integrated cache memory, in this case) as well as on code generation in execution time. It is estimated that the new version can be between 5 and 10 times faster than the 1.0 and subsequent versions of Apache Spark.

The APIs are unified in a single API

The spectacular added value of Apache Spark is real-time processing and analysis of big data. And what the community of developers started to demand from Spark administrators was a lunge ahead that involved real time data processing and combining that with other types of information analysis (working with batches on the one hand and interactive data querying on the other). So, Spark’s second version has an API that gives developers the capacity to develop applications that combine real time, interactive and batch components.

To work with this integrated Apache Spark 2.0. API, the development equipment must configure data storage with ETL functions (Extraction, Transformation and Loading). This feature provides developers with web analysis via interactive queries in a specific session or, for example, the option of applying automatic learning to create efficient patterns by training with old sample data and then including more recent information.

The API DataFrame and API Datasets are unified in a single library to make it easier for developers to learn the necessary programming elements, especially in the two languages: Java and Scala. It is not available in syntax such as Python or R because its characteristics don’t allow it.

Structured data streaming

This unified API includes new high-level structured streaming at the top of the Spark SQL engine. This is the feature that allows interactive batch querying (on static data in a database) or during streaming (real time querying of data flow between the source and database to prepare reports or monitor specific information, for example).

The idea is that developers can program “continuous applications”; in other words, applications that require a streaming engine but also integrating this with working with batches, interactive querying, external storage systems or any changes in business logic. We could say that Apache Spark 2.0 makes it easier for developers to program multi-purpose applications without the need to use several different programming models. This presents disadvantages for working with third party systems or providers, such as MySQL or Amazon S3 (Simple Storage Service).

Spark as a compilar

Spark project administrators have always expressed their concern for increasing its speed, even when it is already tremendously fast technology. The reason behind this requirement are the demands from the community itself, expressed in the periodic surveys held to improve the project. Spark 2.0 is 10 times faster than its predecessor Spark 1.6, because its developers have wiped it clean of non-essential tasks.

As stated by its administrators, most data engine cycles are dedicated to useless tasks such as calling virtual functions or reading and writing interim data in the cache. Optimizing use to avoid unnecessary CPU cycles is a big step.

Spark 2.0 is based on the second generation of the tungsten engine, which comes close to using the principles that govern the operations of modern compilers and MPP databases (massive parallel processing databases). How do they do it? They use the CPU registries to write the intermediate data and completely eliminate calls to virtual functions.

If you want to try BBVA’s APIs, test them here.

It may interest you

APIs in selling: the final push

Taking a customer through the entire buying process until it is formalized is an arduous journey and one that faces the constant possibility of the customer leaving. However, there are ways to make the buying decision happen if you are given facilities such as agile, secure financing.

API Business Models , Development , Digital Ecosystems / 15 October 2020
APIs are everywhere, but… what about their documentation?

In a connected world, APIs are the glue that keeps all the parts that form our day-to-day lives in place. The same way the power of glue depends on the material it is used on and the knowledge of its properties, APIs are only as useful as their documentation allows for.

Development / 19 February 2020
Tools to measure the success and effectiveness of your API

There are different solutions to monitor the performance of an API, in terms of traffic, quality and speed of the answers it provides. Detecting faults in the code or quantifying the generated revenues are also some of the options offered by these useful tools.

Development / 03 February 2020

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.

Apache Spark 2.0.: new developments in the new version

Apache Spark 2.0, faster than Spark and MapReduce

The APIs are unified in a single API

Structured data streaming

Spark as a compilar

It may interest you

APIs in selling: the final push

APIs are everywhere, but… what about their documentation?

Tools to measure the success and effectiveness of your API