BBVA API Market
This is not the first time we have spoken about Apache Spark at BBVAOpen4U. We explained its main features when it was still an emerging technology and confirmed its absolute success when it elbowed its way into the market of Big Data applications as the fastest, most powerful, most scalable and most sustainable option. The launch of Version 2.0 confirms the great welcome by the community of developers and the enormous possibilities that it offers companies that use big data to obtain a competitive advantage.
Apache Spark is an open source distributed computing platform with a very active community; it is faster and cheaper to implement and maintain than its predecessors in the Hadoop environment, such as MapReduce; it is unified; it features an interactive console that is very convenient for developers and it also has a quite powerful API for working with data. It is the best option on the market because it provides data engineers and scientists with a tool that resolves any possible scenario involving automatic learning, graphic computing, data streaming and interactive query processing problems in real time and with all the necessary scalability to respond to needs.
The Apache Spark version 2.0 comes with some interesting features that make it an even more powerful tool to work with Big Data:
The arrival of Apache Spark revealed speed as one of the essential advantages of the new platform, based on the fact that Spark works in memory and not in disk. Caching the data makes the interaction with the data more efficient and faster. This applies not only to the original data, but to the subsequent transformation of that information. When the system needs the data, it does not need to call on the disk, it simply goes to the cache memory. It is estimated that Apache Spark is 100 times faster in memory and 10 times faster in disk than Apache Hadoop MapReduce.
Apache Spark 2.0 doesn’t stop there. The new version has increased its data processing speed even more; this is also cache-based (integrated cache memory, in this case) as well as on code generation in execution time. It is estimated that the new version can be between 5 and 10 times faster than the 1.0 and subsequent versions of Apache Spark.
The spectacular added value of Apache Spark is real-time processing and analysis of big data. And what the community of developers started to demand from Spark administrators was a lunge ahead that involved real time data processing and combining that with other types of information analysis (working with batches on the one hand and interactive data querying on the other). So, Spark’s second version has an API that gives developers the capacity to develop applications that combine real time, interactive and batch components.
To work with this integrated Apache Spark 2.0. API, the development equipment must configure data storage with ETL functions (Extraction, Transformation and Loading). This feature provides developers with web analysis via interactive queries in a specific session or, for example, the option of applying automatic learning to create efficient patterns by training with old sample data and then including more recent information.
The API DataFrame and API Datasets are unified in a single library to make it easier for developers to learn the necessary programming elements, especially in the two languages: Java and Scala. It is not available in syntax such as Python or R because its characteristics don’t allow it.
This unified API includes new high-level structured streaming at the top of the Spark SQL engine. This is the feature that allows interactive batch querying (on static data in a database) or during streaming (real time querying of data flow between the source and database to prepare reports or monitor specific information, for example).
The idea is that developers can program “continuous applications”; in other words, applications that require a streaming engine but also integrating this with working with batches, interactive querying, external storage systems or any changes in business logic. We could say that Apache Spark 2.0 makes it easier for developers to program multi-purpose applications without the need to use several different programming models. This presents disadvantages for working with third party systems or providers, such as MySQL or Amazon S3 (Simple Storage Service).
Spark project administrators have always expressed their concern for increasing its speed, even when it is already tremendously fast technology. The reason behind this requirement are the demands from the community itself, expressed in the periodic surveys held to improve the project. Spark 2.0 is 10 times faster than its predecessor Spark 1.6, because its developers have wiped it clean of non-essential tasks.
As stated by its administrators, most data engine cycles are dedicated to useless tasks such as calling virtual functions or reading and writing interim data in the cache. Optimizing use to avoid unnecessary CPU cycles is a big step.
Spark 2.0 is based on the second generation of the tungsten engine, which comes close to using the principles that govern the operations of modern compilers and MPP databases (massive parallel processing databases). How do they do it? They use the CPU registries to write the intermediate data and completely eliminate calls to virtual functions.
Traditional banks are making the commitment to BaaS models, open banking is driving digital financial services, regulatory bodies are increasing scrutiny when it comes to BaaS providers, the banking ecosystem is rapidly changing and increased competition and regulatory pressures are expected in the BaaS sector. The State of Banking-as-a-Service (BaaS) is a report prepared by […]
APIs can be a great support when automating business processes Companies, often with a focus on SMEs, spend too many man-hours on time-consuming business processes, thereby making mistakes that a machine would never make. How can business process automation (BPA) help these companies? Is it possible to make use of APIs for BPA? What is […]