We are going to deliver a series of Tutorials on the following concepts one by one:
- BigData
- Hadoop
- Hadoop Ecosystem
- Cloud
- Amazon Web Services
- Google Cloud Platform
- Microsoft Azure
- BigData with Cloud
- Spring Cloud – Cloud Foundry
- Spring Hadoop Module
- Spark With Hadoop
First we will start with BigData Basics, then move to Hadoop to Cloud then finally we will discuss about “How to use BigData Solutions with Cloud Platforms”. We will discuss different BigData and Cloud Platforms Solutions available in the current market like Amazon Web Services (AWS), Google Cloud Platform, Microsoft Azure, IBM Bluemix, Pivotal Cloud Foundry, Yahoo Cloud Platform etc.
Finally we will discuss how to develop applications using Spring Cloud and Spring Hadoop Modules. We feel that these two are really Big subjects: BigData and Cloud so it may take more time to discuss all these concepts in-detail with Real-time examples. Please bare with us.
In this series, first we are going to discuss about BigData Basics in this post.
Post’s Table Of Contents
- Introduction
- BigData Introduction
- What is BigData
- BigData Characteristics
- Why Data is Important
- Why Big Data is so Important
- BigData: Data Formats
- BigData Advantages
- BigData Solutions
- BigData Use Cases
BigData Introduction
Now We are living in Big Data Era.
Few years ago, Systems or Organizations or Applications were using all Structured Data only ( Structured Data means In the form of Rows and Columns). It was very easy to use Relational Data Bases (RDBMS) and old Tools to store, manage, process and report this Data.
However recently, Nature of Data is changed. And Systems or Organizations or Applications are generating huge amount of Data in variety of formats at very fast rate.
That means Data is not simple Structured Data(Not in the form of simple Rows and Columns). It does not have any proper format, just RawData without any format. It is “very difficult or not possible” to use Old Technologies, Traditional Relational Databases and Tools to store, manage, process and report this Data. Traditional DataBases cannot Store, Process and Analysis this kind of Data.
Then how to solve this problem? Here BigData Solutions come into picture.
Big Data Solutions solve all these problems very easily.
Let us start with understanding What is BigData and How important it is in our life.
What is BigData
We don’t have a straightforward definition to BigData. However, we will try to answer this question in different ways.
In Simple Words, Big Data is a technique to solve data problems that are not solvable using Traditional DataBases and Tools.
In other way, BigData means not just huge amount of Data. BigData means huge amount of data generating at very fast rate in different formats.
Big Data is a Technique to “Store, Process, Manage, Analysis and Report” a huge amount of variety data, at the required speed, and within the required time to allow Real-time Analysis and Reaction.
BigData is Data with has the following three characteristics:
- Extremely Large Volumes of Data
- Extremely High Velocity of Data
- Extremely Wide Variety of Data
BigData Characteristics
The following three are known as “BigData Characteristics”.
-
- Volume:
Volume means “How much Data is generated”. Now-a-days, Organizations or Human Beings or Systems are generating or getting very vast amount of Data say TB(Tera Bytes) to PB(Peta Bytes) to Exa Byte(EB) and more.
-
- Velocity:
Velocity means “How fast produce Data”. Now-a-days, Organizations or Human Beings or Systems are generating huge amounts of Data at very fast rate.
-
- Variety:
Variety means “Different forms of Data”. Now-a-days, Organizations or Human Beings or Systems are generating very huge amount of data at very fast rate in different formats. We will discuss in details about different formats of Data soon.
BigData refers to 3V (VVV) Paradigm:
Three “Vs” Paradigm (Volume, Velocity, Variety) of Big Data was defined by “Doug Laney” in 2001.
If our Organization’s Data is in this 3Vs Paradigm, that means we are in BigData Problems. So we should use some BigData Solutions to solve our problems.
These 3Vs Paradigm is not enough to get better value from our BigData. There is another V (4th V), which is most important for every BigData problem.
4th V : Veracity
Veracity means “The Quality or Correctness or Accuracy of Captured Data”. Out of 4Vs, it is most important V for any BigData Solutions. Because without Correct Information or Data, there is no use of storing large amount of data at fast rate and different formats. That data should give correct business value.
So this 4th V answers the following questions:
How accurate is that data in predicting business value?
Do the results of a big data analysis actually make sense?
BigData 4Vs In Simple Terminology:
V(Volume) : The Amount of Data
V(Variety) : The number of Type of Data
V(Velocity) : The Speed of Data Processing
V(Veracity) : The Correctness of Data
Why Data is Important
We are living in Data Era or Information Era. Data is most important factor for all Organizations for the following reasons or benefits:
-
- Data is useful in Decision Making
- To know Customer Preferences so that Organizations can improve their Business
- Getting the Right Information for Business
- By analyzing Data, We can optimize our systems.
- More Data, More Analysis, More Results, More Profits.
- Data is effective in improving Business Value
- Data Analysis provides Customer Likes and Dislikes information
And More.
Why BigData is so Important
Now-a-Days, Big data is very very important for Organizations or Companies form Medium-Size to Large-Size, because it enables them to gather, store, manage, and mani
pulate “Extremely Large Amounts Of Data, Extremely High Velocity of Data and Extremely Wide Variety of Data”:
- At the right speed
- At the right time
- To get the required Business Value
By following this Big Data 4Vs Paradigm, we will get lot of benefits as shown below:
By using those BigData 4Vs Paradigm, Organizations can get many befits by understanding “What, Who, When, Where, How” kind of questions:
- What business decisions need to be made?
- What insight can we derive from the information?
- How accurate is that data in predicting business value?
- Who could benefit from the information that we are capturing?
- When do they need to know in order to make a more informed decision?
- How to improve our business value?
- How to improve our profits?
- Where do we have more Profits?
BigData: Data Formats
In BigData 3V Paradigm, one V refers to Variety. It means generating or getting data in different formats.
In Data Era, We, Systems, Devices or Organizations are generating or getting the following types of Data Formats.
-
- Structured Data
Structured Data means Data that is in the form of Rows and Columns. So it is very easy to store even in Relational Databases.
In Simple words, Anything which possible to store in the form of Rows and Columns that is Structured Data.
For Example:- Relational DBs Data(Online Subscription, Transactional Data etc).
-
- Semi-Structured Data
Semi-Structured Data means Data that is formatted in some way. But it is not formatted in the form of Rows and Columns. It is possible to store in Relational Databases, but bit complex to manage and provide very less performance.
For Example:-
-
- Log Files
In Log Files, Columns are separated by using “Whitespace” charaters (Which are characters used to align things either horizontally or vertically. For instance, space or Tab space, next line etc).
Observe the following JBoss Server log file:
1 2 3 4 5 6 7 8 9 10 11 |
09:20:01,054 INFO [org.jboss.modules] (main) JBoss Modules version 1.3 09:20:01,652 INFO [org.jboss.as.process.Host Controller.status] (main) JBAS012017: Starting process 'Host Controller' 09:20:05,079 INFO [org.jboss.as.process.Server: myserver.status] (ProcessController-threads - 10) JBAS012017: Starting process 'Server: myserver' 17:01:58,833 INFO [org.jboss.as.process] (Shutdown thread) JBAS012016: Shutting down process controller 17:02:03,408 INFO [org.jboss.as.process.Host Controller.status] (Shutdown thread) JBAS012018: Stopping process 'Host Controller' 17:02:15,246 INFO [org.jboss.as.process.Server: myserver.status] (ProcessController-threads - 9) JBAS012018: Stopping process 'Server: myserver' 17:03:02,990 INFO [org.jboss.as.process.Server:myserver.status] (reaper for Server: myserver) JBAS012010: Process 'Server: myserver' finished with an exit status of 0 17:03:13,170 INFO [org.jboss.as.process.Host Controller.status] (reaper for Host Controller) JBAS012010: Process 'Host Controller' finished with an exit status of 0 17:03:13,195 INFO [org.jboss.as.process] (Shutdown thread) JBAS012015: All processes finished; exiting |
If we observe above log file, first column (contains “timestamp”) is separated by some Whitespaces with 2nd column (Contains Logging level). It is semi-formatted, not fully formatted text.
-
- XML Documents
Observe the following XML Document. It is also semi-formatted with XML start and end tags.
- Un-Structured Data
Un-Structured Data means Data that is not formatted in any way. It is not possible to store data in Relational Databases.
For Example:- Audio files, Videos, Call Centre Executive Typed Text, Photos, Sensor Data,Web Data,Mobile Data,GPS Data,Social Media Data etc are Un-Structured Data.
If we open any image file (for instance, jpeg file) in any text editor, we can see all binary data, which is not at all formatted any form.
Now-a-Days, People, Machines, Devices, Organizations and Internet are generating Multi-Structured Data that means combination of Structured Data, Semi-Structured Data and Un-Structured Data. It is not at all possible to store and manage this kind of Data using Traditional Old Technologies, Databases and Tools.
Here Big Data solutions solve this problem in efficient and cost-effective way.
BigData Advantages
If we use BigData solutions to store, manage, process and report our Data, we will get the following benefits:
- Store Data of all types and sizes at low cost
- Efficiently Store, Process and Manage our Data.
- Provides Cost-effective way to mange our Data.
- Provides Better Performance Solutions
- Provides Highly Scalable Solutions
- Produces Right Business Value
- Increase Productivity
- Increase Profits
BigData Solutions
The following is the list of Most Popular BigData Solutions available in the market.
- Apache Hadoop BigData Solution
- Amazon Web Services (AWS) BigData Solutions
- Google Cloud BigData Solutions
- Microsoft BigData Solutions
- Cloud Era BigData Solutions
- IBM BigData Solutions
- Oracle BigData Solutions
BigData Use Cases
Most of the Organizations are using or moving to BigData. So it is not possible to list out all those BigData Organizations or Customers here.
We will provide only some popular Organizations who are using and benefiting from Big Data Solutions.
Facebook is one of the popular Social Networking WebSite. World-wide, Around 1000 million users are using Facebook Application. It is collecting around 500TB (Tera Bytes) per Day from Users Subscription, User Likes, Posts, Relations Information, Audios, Videos, Pictures etc.
Google is also using their BigData Cloud Platform to mange their applications data like Gmail, Google+, Google Search Engine, YouTube etc.
-
- Adhar India
In India, UIDAI (Unique Identification Authority Of India) manages all Adhar Card information. It is also using BigData solutions to manage that huge amount of Data.
-
- RedBus
RedBus is India’s largest online Bus Ticket and Hotel Booking organization. It is also using BigData Solutions to manage that huge amount of Data with very high traffic rate.
-
- eBay and Amazon
Two World famous online shopping giants: eBay and Amazon are also using BigData solutions to mange their Customer Data, products information etc.
-
- Airline Industry
A lot of Airlines (For Example:- British Airways, Singapore Airlines etc.) today are using BigData solutions to store and mange their aircraft and customers information.
-
- Yahoo
Yahoo is also using their BigData Cloud Platform solutions to mange their applications data like Yahoo Mail, Yahoo Search Engine, Flickr etc.
-
- Safari Books Online
Safari Books Online is an online subscription service for Individuals and Organizations to access their online Books, Tutorials, Videos.
-
- New York Stock Exchange
The New York Stock Exchange is one the famous Stock Exchanges in the World. It generates about 5 TB (Tera Bytes) of data per day.
That’s it all about BigData Introduction. We will discuss some more BigData concepts and Hadoop Basics in my coming posts.
Please drop me a comment if you like my post or have any issues/suggestions.