Spark, The next big data star technology?

In the world of Big Data, new technology enriches the Hadoop ecosystem almost every month. However, this can also confuse new starters as there so many types of software, and it is not possible to remember them all. Nevertheless, over the last few months Spark has neither been forgotten nor lost strength, and is now a Big Data star technology.

At Pragsis we have been testing this new Hadoop component for some time. However, last week I attended Cloudera's three-day course (to be included in Pragsis training range soon), and got to know this company in more depth. As a result, my opinion of Spark is now even more positive.

Spark will bring a change to the world of Big Data
Spark offers many advantages when compared to MapReduce-Hadoop:

– Big Data “in-memory”
Spark's most visible facet. Let's forget about SAP-Hana and other ("expensive") proprietary solutions. Spark performs parallel jobs totally in memory, hence greatly reducing processing times. This is especially relevant if we talk about iterative processes such as Machine Learning processes. Below you can see the benchmark found at https://spark.apache.org/, which compares Spark's with Hadoop-MapReduce's performance.

Working in memory does not mean that we need to buy servers with terabytes of RAM; and this represents a step away from Hadoop's “commodity hardware”. If some data cannot be stored in memory, Spark keeps working and uses the hard disk to store the data that are unnecessary at any given time. Programmers can also set priorities, and specify which data must always be in memory.

– More flexible computing schema vs. MapReduce
I have always disliked MapReduce's little flexibility in terms of creating data flows. You can only create a “Map -> Shuffle -> Reduce” schema. Sometimes it would be interesting to create a “Map -> Shuffle -> Reduce -> Shuffle -> Reduce” schema but, with MapReduce, you need to create a “job 1(Map -> Shuffle -> Reduce) ; job 2 (Map -> Shuffle -> Reduce)” flow, which forces you to include a Map "identity" phase with no purpose other than wasting time. In order to optimize this, the Tez framework was created to replace “MapShuffleReduce” with a directed acyclic graph (DAG) flow of execution:

Spark is also aware of this limitation and, as with Tez, suggests DAG-based workflows. Processing times are further reduced.

– Real-time streaming and batch unification
This is another very important issue for Big Data. Attempts to create mechanisms that provide Hadoop with a real-time component have been made for years (framework initially oriented toward batch processes). There have been several proposals (HBase, Impala, Flume, etc.) but they only partially meet this need for real time and stream processing. Then, we had the famous “lambda” and “kappa”. Spark simplifies all this by allowing work both in batch and real-time stream modes. The same framework for two worlds.

– A very flexible way of working
Spark provides an API for Java, and Python and Scala: no esoteric languages that are only compatible with one technology (such as Pig Latin). Python and Scala can also be used in scripting mode: no time is wasted with iterations such as "edit, build and run program". You only have to run the commands in a Spark interpreter to be able to use the data in our cluster. This makes “Data Discovery” much easier (see last section of this article).

Additionally, you can write a whole application on Python, Java or Scala (this is the best language in terms of performance and use of every Spark feature), build it and run it in the cluster.

– A much more complete ecosystem
As with MapReduce, Spark acts as a base framework on which a growing number of increasingly advanced applications is developed.Currently there is:- Spark SQL: explore data with SQL language- Spark Streaming: mentioned above – MLib: libraries for Machine Learning- GraphX: for graph computing

I guess this software sounds quite similar to Hive, Mahout and Giraph on Hadoop. Given how often Spark is adopted and its appeal to many key companies in the Big Data (Yahoo, Facebook, etc.) industry, I have no doubt that this ecosystem will grow in the next few months.

For these reasons, it's not surprising that many people see Spark as the future of Hadoop. “MapReduce has no advantage over Spark” was often repeated during Cloudera's course.

Spark is not stable and is not ready for production yet

Spark seems promising. But this sentence by Cloudera needs to be put into perspective. In fact, Cloudera's staff went on to say that Spark is still a young software without the same stability as Hadoop-MapReduce. For example, during the exercises the students found a bug in Spark, and the course trainer opened an incident straight away. This had never happened before…

Another important detail: Cloudera did confirm that some of its clients are using Spark but none of them are using it in production mode.

Spark progresses very fast: in one year it went from version 0.7 to version 1.0. This is really good because it means that a lot of improvements are added, and the project is very much alive. However, so many version changes also reflect some lack of maturity and, consequently, stability.

Another example is the sudden diversity of SQL solutions on Hadoop:

– First, there was Hive.

– Cloudera launched Impala to provide "real time" on Hadoop.

– Hortonworks developed Stinger to improve Hive and compete against Impala.

– Shark was developed over Spark as a “Hive in memory”.

– Databricks stopped supporting Shark and decided to refactor everything in a new project: SparkSQL.

– There is a new Hive initiative built on Spark. It seems Cloudera is relatively supportive of this new project, which shows Hortonworks that promoting Impala was a mistake.

This tangle of SQL solutions on Hadoop is not surprising for this type of very innovative technologies. In fact, I think that such a high number of projects encourages competitiveness and is a good thing…at first. Later, as with any software, there must be consolidation with only two (or maybe three) key SQL solutions left. So there is no reason to rush.

In short, Spark is a technology with a lot of potential. But at present I would not recommend that you put a lot of effort into it, or that you try to develop very advanced applications and move solutions to production.

What should we do now? Testing and data discovery

Spark is not sufficiently mature for a production environment. Does this mean that we shouldn't use it now? Of course not.

Firstly, testing this new technology is very important for many technology companies. Since Spark will replace MapReduce, any Big Data company that wants to retain its leading position cannot ignore Spark. Pragsis is every aware of this. For this reason, we have been testing this technology for months (without major developments; see previous section).

Secondly, I do believe that this technology is a very important addition to “Data Discovery”. Spark is a very valuable tool in terms of exploring data interactively for several reasons: it can be used in scripting mode; it allows extraction of information from HDFS; it has very good scalable processing capacity; and it includes a very advanced API linked to machine learning libraries. In fact, Spark is a lot more valuable than Pig, which was once regarded as a possibility for data discovery.

Additionally, Data Discovery is a very interactive process that is often found at the beginning of some projects. Consequently, we are far from an "in production" deployment. And, to this extent, Spark's lack of stability is no problem at all.

It may interest you

What is an API, types of APIs and how they work

An API is a very useful mechanism that connects two pieces of software equipment to exchange messages or data in a standard format such as XML or JSON. Thus, it becomes an instrument that can be used to search for revenue, open the doors to talent or innovate and automate processes.

APIs , Banking as a service , Desarrollo de negocio , Digital transformation / 18 December 2023
What is business process automation, and what is it used for?

APIs can be a great support when automating business processes Companies, often with a focus on SMEs, spend too many man-hours on time-consuming business processes, thereby making mistakes that a machine would never make. How can business process automation (BPA) help these companies? Is it possible to make use of APIs for BPA? What is […]

APIs , Banking as a service , Digital transformation / 07 September 2023
How is ecommerce developing in Spain?

Ecommerce has continued to grow steadily in Spain, except during the pandemic, which has already been overcome in terms of online shopping. Ecommerce has been making inroads among the Spanish for over two decades. In 2000, it was a marginal and niche activity. Now it is almost universal. Almost all Spaniards with internet access shop online […]

Digital ecosystem , Digital transformation , e-Commerce , Payment gateway , Payments / 27 December 2022

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.

Spark, The next big data star technology?

It may interest you

What is an API, types of APIs and how they work

What is business process automation, and what is it used for?

How is ecommerce developing in Spain?