PetaData and Maga Scale Data

For many kinds of dynamic and transactional businesses like Banking, Retail etc, there is a huge need for storing Giga and Peta Databases. Moreover, data analysis needs of such organizations keeps increasing as per the business growth. This calls for the need to plan, deploy and manage Peta scale database.

These petascale database are quiet huge to have all data inside a single DB file, so these files are distributed/stored and managed using petascale distributed file system. These technologies are currently being used by search engines and networking sites to analyse people’s interaction and probable network of a person helping these social networking sites to popup with suggestions and help us in locating our age old schools friends. These are the power of analytics that these companies are running currently. Bank have started projects so that they do not misses an opportunity to address high net worth clients due to an inability to parse its transaction data efficiently.

These petascale databases comes up with their own jargons and paradigms. These are not very close to SQL queries. Here data redundancy is not valued as much as data availability. In other worlds the same data would be stored across multiple nodes so that data could be retrieved even if some nodes are down.

So the whole architecture looks close to this.



Here is an ideal topology of systems at work where the systems interact with each component for a specific purpose. In the diagram above we have a multiple nodes connected to each other. These nodes are low cost machines over which the data resides. To handle such scenarios we use the latest technology from Apache (called Hadoop). Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. The advantage of Hadoop is that it calculates and performs operations on the node where the data resides it adds a big advantage to current architecture.

Most popular architectures used currently store data in DB and perform operation on application servers that either resides on a different machine or in a different network. This means that the data is first migrated from one box to another box and then the operations are performed over them. With introduction of above mentioned architectural platform, we take the operations to that particular machine where the data resides. “Single instruction multiple data” kind of operations can be performed on these data. Once the data is analysed it is pushed to results fetching part.

Explaining The Nodes:

1. How does the data move inside the HDFS [Hadoop distributed file system] ? Once we find this we would be able to push more and more information for analysis. Note: the data is stored in multiple locations in a network. Hadoop-HBase is a system that is responsible for pushing data inside this distributed file system. For example the organization has done 12 million trades and of which there are 5 trades that needs to be repunched back as the system has found that these trades are either punched wrongly or were not entered. Now here the HBase would be used for inserting the data into the Hadoop file system.

2. Nodes as dark circles in centre represent desktop farms on which the calculations are getting performed. These are either some variant of Linux or is having some virtual Linux version of OS running inside them.

3. Once the analysis is done on data that resides on the Hadoop the results could be fetched via a system known as Hive. This is like a query system that resides and retrieves information from various Hadoop nodes and pulls it out as a query result. However, the querying mechanism is not very much like sql.

4. Once the data is retrievable from Hive, we can execute the ETL- Pentaho or any other tool of organisation’s choice can be used for this purpose. We can then push this data from Hadoop reduced map into either cube or star schema from where other tools can be used for analysis.

Business Intelligence

Business intelligence is a practice available at VibrantWorx where we manage the cost and give scalable and advanced solution for organizations who are interested in adopting versatile and open architecture based solution.

Here is the excerpt from the Gartner’s report (Magic Quadrant for Business Intelligence Platforms) that openly votes for Pentaho platform in the arena as very comprehensive  and feature-rich solution considering its open-source foundation.



Gartner’s Review of products in BI space (Pentaho’s Competitors)

“Pentaho, after just four years in existence, has put together a comprehensive open-source BI platform that includes data integration and data mining capabilities. In 2008, Pentaho was noticeably more aggressive, openly competing against traditional BI platform vendors. Like Jaspersoft, Pentaho is affordable and also offers a subscription-based model that avoids an initial large payment for the software license. Some of the significant features Pentaho introduced in 2008 include an automatic table designer that analyses relational schemas and data patterns, performs a cost-benefit analysis of aggregation at different levels, and generates and populates those aggregate tables. Despite a handful of large customers, Pentaho reference survey respondents more frequently indicated that they had more departmental deployments (versus enterprise wide) and smaller data volumes compared with the other vendors.”

Pentaho is also getting adopted and implemented with various financial firms like “Aberdeen Group” who has selected Pentaho for their business intelligence purposes, which is a big win for Pentaho. Find the report how Aberdeen Group has integrated Pentaho in their product portfolio.

Pentaho as a platform is very suitable as it is platform independent which means it can be run on any hardware platform that has a JRE installed in it. Pentaho on top of platform independence also gives freedom of decoupling the underlying database changes. For example your product may be subjected to change in future and you want the BI to independent of underlying database.




It is generally a great worry for CTO of an organization to ensure how will the data warehouse gets impacted when the underlying schema gets changed in the subsequent releases. With the metadata architecture over which the design is built, CTO can rest in peace as there are two non physical layer over which the business intelligence component is build on.

What are steps for under taking a Business Intelligence project and how Pentaho BI Suite of products help in achieving this task?

Each Business Intelligence project has an ETL process that needs to push the data from existing database to other database using some script on real time [Embedded] or as EOD process. The EOD process can push the data from existing Real time data into a staging area from here the data is then filtered and loaded into star schema, here the facts are getting calculated and stored for a particular set of dimensions.

So how does the Pentaho product stacks up for ETL? Calypso has spoon as an editor where user can create extraction and transformation rules in a stand alone application this editor generates the desired result and transforms the data from various datasources into the staging db it can extract from various sources namely flat files, excel files, various db and perform action over these. Pentaho also comes up with a Pentaho data integration later which performs this transformation.

Once the data is secured, it is mined for various dimensions of management view, these contain  analysis logic which then populates the fact tables of star schema.



In the example above we store fact table that is the analysis numbers for a set of each dimensions. For example what is the numbers that we would see if the see from products perspective so the join would one products and numbers are store in fact table. We then have pre defined dimension on product types say for ex: we have $X revenues from FMCG and $X from men’s wear, however the facts sales number would be readily calculated as these products are preconfigured already. Similarly we can guess how the other dimensions would work here.

Now we have the numbers with us. To generate the presentation of these numbers in a dashboard, we need lots of features on these dashboards. So Pentaho provides lots of goodies in dashboard preparation. In the dashboard the user would select and aggregate based on these dimensions and would like how the fact changes for the given condition of dimensions.

Pentaho has dashboard which is rendered using GWT/JPIVOT/GoogleMaps/Servlet and these reports are secured using administration module. This module would be responsible for rendering and slicing/dicing and have in built drill down facility. Companies are free to enhance these functionality and add features to existing utilities that come along with Pentaho.



Management could also be shown with google map representing the sales and figure each unit is bringing. This helps the top management to focus and drill down from each locations.