Data extraction tools for beginners and professionals

To work with data, it is essential to have data. Sometimes that information is structured and on other occasions it is unstructured. Nowadays there are many tools or processes through which a developer can extract data from complex formats such as PDF or one or more websites, which is known as web scraping. The goal is to have the data to be able to view and understand.

Web scraping could be defined as the technique by which a team of developers is able to scrape or free data from websites run by governments, state institutions and organizations to access private or public data that may be published or distributed in open format. The problem is that the format of most interesting data is not reusable and it is opaque such as a PDF for example.

In order to access and distribute this information, there are a lot of tools or processes through the use of programming languages. This is a guide to using the main data extraction methods.

Web scraping tools

● ImportHTML formula

As part of Google applications, the grand search engine has developed its own Excel called Google Spreadsheet. This tool provides almost all the features offered by Microsoft Excel, but also has some added functionality through content that is indexed to the internet by the search engine: RSS feed reader, website updates and data extraction.

All this is possible through the use of formulas like ImportFeed, ImportHTML and ImportXML. The second of these allows any user to extract data from tables or lists in an orderly fashion from any website. One of the elements of the formula type varies depending on whether it is a table or a list. Here are two practical examples:

=ImportHTML(“url página web”, “table”, 2)

=ImportHTML(“url página web”, “list”, 2)

By including any of these formulas in the first cell of Google Spreadsheet, it possible to extract the second table or list of the URL that the user adds within double quotes. It is very simple.

● Table Capture

Table Capture is an extension for the Chrome browser, which provides a user with data on a website with little difficulty. It extracts the information contained in an HTML table of a website to any data processing format such as Google Spreadsheet, Excel or CSV. This is similar to the ImportHTML formula.

● ScraperWiki

ScraperWiki is the perfect tool for extracting data arranged in tables in a PDF. All you need to do is load the file and export it. If the PDF has multiple pages and numerous tables, ScraperWiki provides a preview of all the pages and the various tables and the ability to download the data in an orderly way and separately.

With ScraperWiki you can also clean the data before it is exported to a Microsoft Excel file. This is useful as it makes things much easier when clean data is added to a visualization tool.

● Tabula

Tabula is a desktop application for Windows, Mac OSX and Linux computers that provides developers and researchers with a simple method to extract data from a PDF to a CSV or Microsoft Excel file for editing and viewing. Tabula is a tool that is widely used in data journalism.

The steps to use Tabula are as follows:

– Load a PDF with the data table you want to export.

– Select the table with all the information.

– Select the option ‘Preview and data extraction’. Tabula scrapes the data in the table and provides the user with a preview of the information extracted for it to be checked.

– Click ‘Export’.

– The data are exported to a Microsoft Excel file, or a LibreOffice file if you do not have Microsoft Office.

Tabula is an open source project available on GitHub.

● Import.io

Import.io is a free online tool, but there is also a fee-based version for companies. This aids structured extraction of data and downloading in CSV format or generating an API with the information. API data are updated as data is modified in the source environment.

Import.io has a desktop application that can be downloaded by any user to their Windows, Mac OSX or Linux computer. In this application, Import.io offers several methods for extracting very different data: information contained in a URL, information in HTML or XML, images, numerical values, maps, i.e. everything.

Extracting data with Python

In BBVAOpen4U we have seen what Python is and how it works when developing digital projects or using libraries for data visualization, but this is the first time one of its most interesting and professional features has been mentioned: extracting unstructured data. There are also many libraries in this language for data access.

● BeautifulSoup

BeautifulSoup is a Python library used to easily extract specific data from a web page in HTML without much programming. It is technically called parsing HTML. One of the advantages of this library in Python is that all of the output documents of the data extraction are created in UTF-8, which is quite interesting because the typical problem of encoding is completely solved.

Another powerful feature of BeautifulSoup is that it uses Python analyzers such as lxml or html5lib, which makes it possible to crawl tree-structured websites. These enable you to go through each ‘room’ of a website, open it, extract your information and print it.

An example, extracting all links of a site such as Reddit:

from bs4 import BeautifulSoup
import urllib2

redditFile = urllib2.urlopen(“http://www.reddit.com”)
redditHtml = redditFile.read()
redditFile.close()

soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all(“a”)
for links in soup.find_all(‘a’):
print (links.get(‘href’))

● Python Mechanize

Mechanize is a virtual browser that manages to track a website with Python programming language. It is based on the urllib module.

● Scrapy

Scrapy is an open code development framework for data extraction with Python. This framework allows developers to program spiders used to track and extract specific information from one or several websites at once. The mechanism used is called selectors; however, you can also use libraries in Python such as BeautifulSoup or lxml.

Follow us on @BBVAAPIMarketA

It may interest you

Buy Now Pay Later B2B: What is it and how can it benefit your company?

In the dynamic world of payments, a new star has emerged in recent years: Buy Now Pay Later (BNPL), i.e. short-term financing that allows you to buy now and pay later. This model allows businesses to purchase goods or services and pay for them in installments, often interest-free, making it an attractive alternative to credit […]

04 November 2024
BBVA and Vecttor (Cabify): an innovative arrangement for driver cash management

BBVA and Vecttor, Cabify’s subsidiary engaged in managing vehicles with drivers, have entered into an alliance that saves time and provides security to the company and its drivers. The collaboration allows drivers to deposit cash collections at any BBVA ATM and Vecttor to automatically reconcile this activity from their accounts with those in the company’s […]

Treasury / 23 October 2024
BBVA Best Bank in World for its Open Banking Offer, according to Global Finance

BBVA has been recognized by Global Finance as the bank with the best global open banking offer for companies. This award comes on top of 12 other recognitions the magazine has bestowed on the company, such as the best bank for corporate clients and the one recognizing its AI factory as one of the best […]

Treasury / 21 October 2024

Name	Owner	Duration	Description
gobp.lang	BBVA	1 month	Language preference
aceptarCookies	BBVA	1 year	Configuration Accepted Cookies
_abck	BBVA	1 year	Helps protect against malicious website attacks
bm_sz	BBVA	4 hours	Helps protect against malicious website attacks
ADRUM_BTs	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT1	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BTa	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
ADRUM_BT	Salesforce Marketing Cloud	Session	Required for monitoring of the service, inherent to SFMC
xt_0d95e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
__s9744cdb192d044faa1bf201d29fafd1e	Salesforce Marketing Cloud	Session	Remember user preferences (if any)
wpml_browser_redirect_test	WPML	Session	Text translation in the portal
wp-wpml_current_language	WPML	24 hours	Text translation in the portal

Name	Owner	Duration	Description
AMCV_***	Adobe Analytics	Session	Unique Visitor IDs used in Cloud Marketing solutions
AMCVS_***	Adobe Analytics	2 years	Unique Visitor IDs used in Cloud Marketing solutions
demdex (safari)	Adobe Analytics	180 days	Create and store unique and persistent identifiers
sessionID	Adobe Analytics	Session	Launch's internal cookie used to identify the user
gpv_URL	Adobe Analytics	Session	Adobe Analytics plugin: getPreviousValue Capture the value of a certain variable in the following page view, in this case the prop1
gpv_level1	Adobe Analytics	Session	Cookie used to store the DataLayer levl1 of the previous page.
gpv_pageIntent	Adobe Analytics	Session	Cookie used to store the pageIntent of the previous page.
gpv_pageName	Adobe Analytics	Session	Cookie used to store the pagename of the previous page.
aocs	Adobe Analytics	Session	Cookie that stores the first values collected at the beginning of a process.
TTC	Adobe Analytics	Session	Cookie used to store the time between the App Page Visit event and the App Completed event.
TTCL	Adobe Analytics	Session	Cookie used to store the time between the LogIn event and App Completed.
s_cc	Adobe Analytics	Session	Determine if cookies are active
s_hc	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_ht	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_nr	Adobe Analytics	2 years	Determine the number of user visits
s_ppv	Adobe Analytics	Permanent	Adobe Analytics plugin: getPercentPageViewed Determine what percentage of the page a user views
s_sq	Adobe Analytics	Session	ClickMap/ActivityMap features
s_tp	Adobe Analytics	Session	Cookie used by Adobe for analytical purposes
s_visit	Adobe Analytics	2 years	Cookie used by Adobe to know when a session has been started.

Name	Owner	Duration	Description
OT2	VersaTag	90 days	VersaTag Cookie used to store a user id and the number of user visits.
u2	VersaTag	90 days	VersaTag Cookie where the user ID is stored
TargetingInfo 2	MediaMind	1 year	Cookie that serves to assign a unique random number that generates MediaMind.

Name	Owner	Duration	Description
mbox	Adobe Target	9 days	Cookie used by Adobe Target to test user experience customization.

Data extraction tools for beginners and professionals

Web scraping tools

Extracting data with Python

It may interest you

Buy Now Pay Later B2B: What is it and how can it benefit your company?

BBVA and Vecttor (Cabify): an innovative arrangement for driver cash management

BBVA Best Bank in World for its Open Banking Offer, according to Global Finance