For many kinds of dynamic and transactional businesses like Banking, Retail etc, there is a huge need for storing Giga and Peta Databases. Moreover, data analysis needs of such organizations keeps increasing as per the business growth. This calls for the need to plan, deploy and manage Peta scale database.
These petascale database are quiet huge to have all data inside a single DB file, so these files are distributed/stored and managed using petascale distributed file system. These technologies are currently being used by search engines and networking sites to analyse people’s interaction and probable network of a person helping these social networking sites to popup with suggestions and help us in locating our age old schools friends. These are the power of analytics that these companies are running currently. Bank have started projects so that they do not misses an opportunity to address high net worth clients due to an inability to parse its transaction data efficiently.
These petascale databases comes up with their own jargons and paradigms. These are not very close to SQL queries. Here data redundancy is not valued as much as data availability. In other worlds the same data would be stored across multiple nodes so that data could be retrieved even if some nodes are down.
So the whole architecture looks close to this.
Here is an ideal topology of systems at work where the systems interact with each component for a specific purpose. In the diagram above we have a multiple nodes connected to each other. These nodes are low cost machines over which the data resides. To handle such scenarios we use the latest technology from Apache (called Hadoop). Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. The advantage of Hadoop is that it calculates and performs operations on the node where the data resides it adds a big advantage to current architecture.
Most popular architectures used currently store data in DB and perform operation on application servers that either resides on a different machine or in a different network. This means that the data is first migrated from one box to another box and then the operations are performed over them. With introduction of above mentioned architectural platform, we take the operations to that particular machine where the data resides. “Single instruction multiple data” kind of operations can be performed on these data. Once the data is analysed it is pushed to results fetching part.
Explaining The Nodes:
1. How does the data move inside the HDFS [Hadoop distributed file system] ? Once we find this we would be able to push more and more information for analysis. Note: the data is stored in multiple locations in a network. Hadoop-HBase is a system that is responsible for pushing data inside this distributed file system. For example the organization has done 12 million trades and of which there are 5 trades that needs to be repunched back as the system has found that these trades are either punched wrongly or were not entered. Now here the HBase would be used for inserting the data into the Hadoop file system.
2. Nodes as dark circles in centre represent desktop farms on which the calculations are getting performed. These are either some variant of Linux or is having some virtual Linux version of OS running inside them.
3. Once the analysis is done on data that resides on the Hadoop the results could be fetched via a system known as Hive. This is like a query system that resides and retrieves information from various Hadoop nodes and pulls it out as a query result. However, the querying mechanism is not very much like sql.
4. Once the data is retrievable from Hive, we can execute the ETL- Pentaho or any other tool of organisation’s choice can be used for this purpose. We can then push this data from Hadoop reduced map into either cube or star schema from where other tools can be used for analysis.