Data engineering is a set of operations to create interfaces and mechanisms for information flow and access. Joe Reis and Matt Housley in their book, Fundamentals of Data Engineering defined data engineering as the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports use cases such as analysis and machine learning.
It is a highly specialized field where data engineers are responsible for maintaining data which can be real-time or batched while ensuring that they remain available and usable by others.
Because of the large volume of data available in today’s world, where organisations process data from different sources available to them, we are then introduced to the concept of Big data which is used to refer to extremely large datasets.
For an organisation like a bank that has data stored in different resources such as Oracle (for core banking), MySQL (for CRM), internet data (JSON), XML, etc. An approach to get all these data from different sources into a particular bucket so they can be analysed as a whole is known as Extraction, and where these data are dumped is known as a DATA WAREHOUSE (e.g., Teradata)
Some of the challenges of data warehousing include:
· It is very expensive.
· It is not real-time (preferably done during non-working hours).
· There is a limit to the amount of data that can be extracted at once.
Because of how expensive data warehouses can be, corporations have now preferred to move their data infrastructures to other platforms such as the Hadoop Platform.
MapReduce is the default programming framework on top of Hadoop. It is used to analyse any data on top of Hadoop.
Hadoop is a batch processing system so can be slow sometimes. What organisations then do is store some readily in-demand data in a real-time data warehouse like Teradata and store others not frequently used in Hadoop.
The Hadoop framework also comprises what we call the Hadoop cluster where there is a master system and subsequent slave systems that can be used to store data. These slave systems can be added on the go as at when needed to create additional storage in a concept known as SCALING OUT. The Storage master is called the Name Node while the storage slaves are known as the Data nodes.
The Hadoop Architecture comprises of three major components:
· HDFS – Hadoop Distributed File System (Handles Storage)
· YARN – Handles Resource Management
· MapReduce – Handles Processing
Hadoop Components:
Hive – invented by Facebook to allow interaction with MapReduce in SQL while the underlying backend is still Java. It is the most widely used component in big data and Hadoop. It is a query engine and not a Database.
Pig – Invented by Yahoo. The scripting language used on this is known as Pig Latin
Sqoop: Scoop helps provide direct access to RDBMS to import data and store them in Hadoop and export from Hadoop to RDBMS. It is used for data pipelines.
Oozie: It is a scheduler invented by Yahoo. It can be used to schedule all Hadoop jobs.
HBase: It is the Hadoop database developed by Facebook.
The following platforms also provide data as a service distribution for users
· Cloudera
· Hortonworks – HDP sandbox
(Cloudera and Hortonworks have now merged)
· Amazon – EMR
· Microsoft – HD insight
· IBM – Big Insight
· Google - DataBlock
· Data bricks – Databricks sandbox