Spark, a framework that is all the rage

2 min reading
11 February 2016
Spark, a framework that is all the rage
Spark, a framework that is all the rage

BBVA API Market

At the BBVA Innovation Center on Plaza Santa Bárbara in Madrid, Jorge López-MallaBig Data Architect at Stratio, explained why so many great things have been heard for some time about Spark, which he had no access to until he started working for his present company. Stratio is looking for developers, and this is what they said during the presentation by their Human Resources representatives.

The event, which promised to give answers to key issues for development in the Apache framework -how to combine SparkSQL processes with others launched from the Spark Core or the algorithm application of MILib to real-time logic- began like a recent history lesson.

Since the concept of Big Data was born in 2003, in a paper by Google on distributed file processing, we had to wait until 2006 for the Yahoo! team to launch Hadoop, which ended up constituting the basis on which operations with Big Data would take shape.

The problem, according to López-Malla, is that Hadoop emerged in response to a type of problem different from that faced today by a developer who works with distributed file processing. Technology has changed in 10 years, but so has the market and the demand for software.

Flink (also open source) and Spark emerged in response to today’s problems and, according to López-Malla, the latter “in not the future, but the present of Big Data”. The ground-breaking feature of Spark is its processing speed. It is therefore “an evolution of Hadoop and its paradigm”, but with the advantage of offering a performance 10 to 100 times greater than any distributed computing platform.

Everything is based on RDDs, or “collections of distributed collections”, focused on processing in partitions. These partitions, which are independent from each other, enable the workflow to continue with no interruptions without taking into account what happens in the others.

If Hadoop’s core improved, the modules did not benefit from that improvement. Spark changes this radically. Programmers now benefit from the fact that Spark has a single API for everything.

Three of Spark’s most popular modules were present during the afternoon at the BBVA Innovation Center. Spark SQL (for querying structured data with SQL language or an API), Spark Streaming (for managing data in real time instead of by batches) and MILib, for providing Spark with functionalities related to machine learning.

The full presentation by Jorge López-Malla, with visual and operation examples, is available on the BBVA Innovation Center’s YouTube channel, where the video that you can see embedded below can be found along with many others.

Follow us on @BBVAAPIMarket

It may interest you