Saturday 5 January 2013

Visualization Tool for HDFS blocks/chunks

From the time I started working on hadoop, I always felt the need of a tool to visualize the blocks of a HDFS file/directory. Unfortunately I could not find any such tool in open source hadoop versions. Here is an online simple tool to visualize the output of hadoop fsck command in graphical format. To use the tool just run the following command and get the output in a text file. This text file can be used by the tool to visualize the nodes, blocks and the total size on each slave nodes.

The following command collects the blocks details for the HDFS folder '/in' and creates the file fsck.txt. Please check my previous blog for details.

hadoop fsck /in -files -blocks -locations -racks > fsck.txt

Just use this file and click 'Choose File' to choose the fsck output file and voila! you see the HDFS chunks in graphical format without any software installation!

The following snapshot shows the sample output:


Just get started with the tool:


Visualization Tool for HDFS blocks/chunks

HDFS Blocks Visualization

Max Blocks:

Help: Please click Choose File and select a fsck output file stored on your system to visualize the HDFS blocks or chunks across your hadoop cluster. You can use the Sample Data button to see the output for a sample file, or to see the contents of sample file press Show Sample Data.



Note that I have done limited testing of the tool and mostly with hadoop 1.x.x and Chrome.

Please share your comments to enhance the tool and also let me know if you see any issues. Also as all the processing happens on your browser, try to use it with a single HDFS file or a directory with limited blocks/files. 

How to find the blocks/chunks of a file in HDFS?

We know that HDFS stores a file by splitting it in multiples blocks or chunks based on the configured block size property. The blocks of a file are replicated for fault tolerance based on the replication factor configuration.

But how we can find the locations of these blocks and the nodes these blocks are replicated? It might be required in case of any maintenance activity and we need to shutdown few data nodes. Also for the data locality to know the exact location of the data in cluster.

The following fsck command of hadoop can be used to find the blocks and the location of the blocks:

 $ hadoop fsck /test -files -blocks -locations -racks  
 FSCK started by hadoop from /10.58.127.50 for path /test at Sun Jan 06 01:04:00 IST 2013  
 /test <dir>  
 /test/README.txt 1366 bytes, 1 block(s): OK  
 0. blk_606611195878801492_2473688 len=1366 repl=3 [/default-rack/10.58.127.59:50010, /default-rack/10.58.127.57:50010, /default-rack/10.58.127.51:50010]  
 Status: HEALTHY  
  Total size: 1366 B  
  Total dirs: 1  
  Total files: 1  
  Total blocks (validated): 1 (avg. block size 1366 B)  
  Minimally replicated blocks: 1 (100.0 %)  
  Over-replicated blocks: 0 (0.0 %)  
  Under-replicated blocks: 0 (0.0 %)  
  Mis-replicated blocks: 0 (0.0 %)  
  Default replication factor: 3  
  Average block replication: 3.0  
  Corrupt blocks: 0  
  Missing replicas: 0 (0.0 %)  
  Number of data-nodes: 9  
  Number of racks: 1  
 FSCK ended at Sun Jan 06 01:04:00 IST 2013 in 0 milliseconds  
 The filesystem under path '/test' is HEALTHY  

The above output shows that in the /test folder the file README.txt has only 1 blocks with replication factor 3 and also shows the ip addresses of the slave nodes.

Sunday 30 December 2012

Installing and configuring Ganglia on RHEL

Recently I installed and configured Ganglia to monitor the hadoop cluster, here are the details to install and setup the Ganglia Monitoring server. As the Ganglia RHEL rpm's have lot of dependencies, its easier to install using yum and needs internet access on the server.

Ganglia Monitoring system have the following components:

  • gmond: The Ganglia monitoring demon that needs to be installed on all the monitored nodes
  • gmetad: The Ganglia meta demon that runs on the master node and collects all the data from the monitoring demons(gmond) nodes. The gmetad component uses Round Robin Database(rrdtool) to store the data.
  • Web front-end server: The web front-end server run using the httpd server written in PHP and shows the latest stats and trends in graph formats on various measured parameters.

The following snapshot shows the high-level interaction of the above components:


Installation on RHEL

To install Ganglia gmond on the monitored domain, download the following rpm's from here and install using the following commands.

 rpm -ivh libconfuse-2.7-4.el6.x86_64.rpm  
 rpm -ivh libganglia-3.4.0-1.x86_64.rpm  
 rpm -ivh ganglia-gmond-3.4.0-1.x86_64.rpm  
 service gmond start  

For installation of gmetad on the master node using yum install needs internet access on the server, please run the following commands if you are behind a firewall.
 export https_proxy=proxy server:port  
 export http_proxy=proxy server:port  
 export proxy=proxy server:port  

Also install the latest EPEL repository on the RHEL before running the following command to install the Gangalia meta server and web server.
 yum install ganglia ganglia-gmetad ganglia-web  

The above might fail due to some rpm dependencies, and these dependent component needs to be manually downloaded and installed from internet. e.g. rrdtool, dejavu-fonts etc.

Configuration of Ganglia

After installing the component configure using the configuration file for gmond, gmetad respectively as /etc/ganglia/gmond.conf and /etc/ganglia/gmetad.conf. The default configuration uses the multicast UDP messages and could be used without changes if your network supports multicast messages.

After configuration restart the services using command:

 service gmond restart  
 service gmetad restart  
 service httpd restart  

and open the browser with URL http://<IP of httpd server>/ganglia to start monitoring your servers :).

You might run into the following issues:

  • httpd server is not able to run due to mod_python component, Install mod_python on httpd server if its not already installed
  • URL Access issues due to permission settings in /etc/httpd/conf.d/ganglia.conf. Set the proper configuration or "Allow from all" to access during testing phase.
In the next blog I would describe the details to monitor the Hadoop components statistics using Ganglia server.

Sunday 9 September 2012

Liferay Multiple Instance Portlets - How to display different contents

Liferay Portal server provides the mechanism to create the multiple instances of a portlet. Its a common requirement to display different contents/data in each instance of the portlet.

For example, the following snapshot shows a portlet with 3 instances. Each instance display the different data. The first display the used memory of the JVM, next shows the Free memory and the last shows the Current Sales.


As the multiple instances of the portlet are similar but display the different data, it make sense to develop only a single portlet and create the multiple instances. This blog explains a method to achieve it.

First create a new "Liferay Project" using the eclipse wizard and select the "Allow Multiple Instances" to enable the multiple instances of the portlet, also select the "Edit Mode" of the Portlet Mode option as we would like to configure the Portlet instances from preferences option. It allows the administrator to configure the portlet instances and specify the content to display in the instance. The users without the "configuration" privileges would not be able to change the settings, and view the configured instances.

The Preferences option allows the administrator to specify the content to view in the instance. Choose the Preferences option and..


.. the following is displayed, to provide the content to display in this instance. Please provide the name as follows and save it.


The Portlet Preferences is used in the edit.jsp page to store the preferences of the instance of the portlet. Please see the following edit.jsp source code for details.

   
 <%@ taglib uri="http://java.sun.com/portlet_2_0" prefix="portlet"%>  
 <%@ taglib uri="http://liferay.com/tld/aui" prefix="aui"%>  
 <%@ page import="javax.portlet.PortletPreferences"%>  
   
 <portlet:defineObjects />  
   
 Please provide the name of the portlet to view in  
 <b>MultiInstancePortlet</b>  
 .  
 <%  
      PortletPreferences prefs = renderRequest.getPreferences();  
      String content = renderRequest  
                .getParameter("<portlet:namespace/>content");  
      if (content != null) {  
           prefs.setValue("content", content);  
           prefs.store();  
 %>  
 <p>Saved Successfully!!</p>  
 <%  
      } else {  
           content = (String) prefs.getValue("content", "");  
      }  
 %>  
 <portlet:renderURL var="editURL">  
 </portlet:renderURL>  
 <aui:form action="<%=editURL%>" method="post">  
      <aui:input label="Content" name="<portlet:namespace/>content"  
           type="text" value="<%=content%>" />  
      <aui:button type="submit" />  
 </aui:form>  
   

The view.jsp is similar to the previous blog for Ajax processing using jQuery.
   
 <%@ taglib uri="http://java.sun.com/portlet_2_0" prefix="portlet"%>  
   
 <portlet:defineObjects />  
 <portlet:resourceURL var="resourceURL">  
 </portlet:resourceURL>  
 <script type="text/javascript">  
 function <portlet:namespace/>Data(){  
      $.post("<%=resourceURL%>", function(result) {  
                if (result == 'NOT_CONFIGURED') {  
                     $('#<portlet:namespace/>header')  
                               .text('Portlet not configured!');  
                     return;  
                }  
                var vals = result.split(',');  
                $('#<portlet:namespace/>container').css('width', vals[1] + '%');  
                $('#<portlet:namespace/>value').text(vals[1] + '%');  
                $('#<portlet:namespace/>header').text(vals[0]);  
           });//post  
      }  
      $(document).ready(function() {  
           setInterval("<portlet:namespace/>Data()", 1000);  
      });  
 </script>  
   
 <div id="<portlet:namespace/>header"  
      style="color: orange; font-size: 18px;">JVM Used Memory -</div>  
 <br />  
 <h4 id="<portlet:namespace/>value">--</h4>  
 <div style="border: 1px solid Blue; width: 100%; height: 5px;">  
      <div id="<portlet:namespace/>container"  
           style="background-color: SkyBlue; width: 0px; height: 5px;"></div>  
 </div>  
   

Finally, for backend processing the following MultiInstancePortlet.java class implements the serveResource method invoked during the POST method invocation. It checks the value of "content" in PortletPreferences of the instance and sends the data for that particular "content" type.

   
 package com.test;  
   
 import java.io.IOException;  
 import java.io.PrintWriter;  
   
 import javax.portlet.PortletException;  
 import javax.portlet.PortletPreferences;  
 import javax.portlet.ResourceRequest;  
 import javax.portlet.ResourceResponse;  
   
 import com.liferay.util.bridges.mvc.MVCPortlet;  
   
 /**  
  * Portlet implementation class MultiInstancePortlet  
  */  
 public class MultiInstancePortlet extends MVCPortlet {  
      @Override  
      public void serveResource(ResourceRequest resourceRequest,  
                ResourceResponse resourceResponse) throws IOException,  
                PortletException {  
           System.out.println("serveResource..");  
           long freemem = Runtime.getRuntime().freeMemory();  
           long totalmem = Runtime.getRuntime().totalMemory();  
           long usedmem = totalmem - freemem;  
           usedmem = usedmem / (1024 * 1024);// MB  
           freemem = freemem / (1024 * 1024);//MB  
           totalmem = totalmem / (1024 * 1024);// MB  
   
           PortletPreferences prefs = resourceRequest.getPreferences();  
           String content = prefs.getValue("content", "");  
   
           PrintWriter writer = resourceResponse.getWriter();  
           if(content.equals("JVM Used Memory")){  
                long percentage = usedmem * 100 / totalmem;  
                writer.print(content+ " - "+usedmem + "/" + totalmem + " MB," + percentage);  
           }else if(content.equals("JVM Free Memory")){  
                long percentage = freemem * 100 / totalmem;  
                writer.print(content+ " - "+freemem + "/" + totalmem + " MB," + percentage);  
           }else if(content.equals("Current Sales")){  
                long peakSale = 1000;  
                long currentSale = (long) (peakSale * Math.random());  
                long percentage = currentSale * 100 / peakSale;  
                writer.print(content+ " - "+currentSale + "/" + peakSale + " Units," + percentage);  
           }else if(content.equals("")) {  
                writer.print("NOT_CONFIGURED");  
           }  
           writer.close();  
      }  
 }  
   

Please note that for simplicity, this example shows a very basic mechanism for contents type matching but it could be configurable in some database.

Writing an Ajax Portlet using Liferay

In this blog, I walk-through the steps to create a simple Ajax Portlet using Liferay Portal server. If you have not installed the Liferay server, please follow the steps of "Installation Guide" and also the "Getting Started Tutorial" for the SDK setup and Portlet creation using eclipse.

http://www.liferay.com/community/wiki/-/wiki/Main/Liferay+IDE+Installation+Guide
http://www.liferay.com/community/wiki/-/wiki/Main/Liferay+IDE+Getting+Started+Tutorial

The following snapshot shows a simple Ajax portlet that displays the current memory usage of the JVM and updates the values every second. We would be developing this portlet in this blog..


For the Ajax invocation, the complete page is not refreshed and only a part is changed based on the content received from the server. The following code gets the Used memory/Total Memory of the current (Liferay) JVM and also the percentage value to display in the HTML DIV element.

 package com.test;  
   
 import java.io.IOException;  
 import java.io.PrintWriter;  
   
 import javax.portlet.PortletException;  
 import javax.portlet.ResourceRequest;  
 import javax.portlet.ResourceResponse;  
   
 import com.liferay.util.bridges.mvc.MVCPortlet;  
   
 /**  
  * Portlet implementation class AjaxPortlet  
  */  
 public class AjaxPortlet extends MVCPortlet {
      @Override  
      public void serveResource(ResourceRequest resourceRequest,  
                ResourceResponse resourceResponse) throws IOException,  
                PortletException {  
           System.out.println("serveResource..");  
           long freemem = Runtime.getRuntime().freeMemory();  
           long totalmem = Runtime.getRuntime().totalMemory();  
           long usedmem = totalmem - freemem;  
           long percentage = usedmem*100/totalmem;  
           usedmem = usedmem/(1024*1024);//MB  
           totalmem = totalmem/(1024*1024);//MB  
             
           PrintWriter writer = resourceResponse.getWriter();  
           writer.print(usedmem+"/"+totalmem+" MB,"+percentage);  
           writer.close();  
      }  
 }

For the front-end processing, we need to define the resourceURL and invoke it using HTTP POST method and updates the values received from the server. I am using the jQuery framework for the Ajax invocation, so add the jquery library in the path and update the liferay-portlet.xml. The complete source code of the view.jsp file:

 <%--  
 /**  
 * Copyright (c) 2000-2010 Liferay, Inc. All rights reserved.  
 *  
 * This library is free software; you can redistribute it and/or modify it under  
 * the terms of the GNU Lesser General Public License as published by the Free  
 * Software Foundation; either version 2.1 of the License, or (at your option)  
 * any later version.  
 *  
 * This library is distributed in the hope that it will be useful, but WITHOUT  
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS  
 * FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more  
 * details.  
 */  
 --%>  
   
 <%@ taglib uri="http://java.sun.com/portlet_2_0" prefix="portlet"%>  
   
 <portlet:defineObjects />  
 <portlet:resourceURL var="resourceURL">  
 </portlet:resourceURL>  
 <script type="text/javascript">  
 function <portlet:namespace/>Data(){  
      $.post("<%=resourceURL%>", function(result) {  
                var vals=result.split(',');  
                $('#<portlet:namespace/>container').css('width', vals[1]+'%');  
                $('#<portlet:namespace/>value').text(vals[1]+'%');  
                $('#<portlet:namespace/>header').text('JVM Used Memory - '+vals[0]);  
           });//post  
      }  
      $(document).ready(function() {  
           setInterval("<portlet:namespace/>Data()", 1000);  
      });  
 </script>  
   
 <div id="<portlet:namespace/>header" style="color:orange; font-size: 18px;">  
      JVM Used Memory -   
 </div>  
 <br/>  
 <h4 id="<portlet:namespace/>value">--</h4>  
 <div style="border: 2px solid Blue; width: 100%; height: 5px;">  
      <div id="<portlet:namespace/>container"  
           style="background-color: SkyBlue; width: 0px; height: 5px;"></div>  
 </div>  
   


Also the setInterval method triggers the POST method invocation after every second to update the memory used.

Saturday 11 August 2012

Performance Tip - Select only required columns from HBase

HBase is column-oriented database and allows you to select only the required columns or columns family using Scan objects. For a query on some specific columns, choose only those columns in the scan object to get better performance. Performance is better as the less data needs to be read from the disks and could be read faster.

To set the specific column/column family use the following methods of Scan class.

Scan scan = new Scan();
//Set the specific column name of a column family
//scan.addColumn(Bytes.toBytes(<family>), Bytes.toBytes(<qualifier>));
scan.addColumn(Bytes.toBytes("data"), Bytes.toBytes("firstName"));

//Or set the complete column family
//scan.addFamily(Bytes.toBytes(<family>));
//scan.addFamily(Bytes.toBytes("data"));

Also select only the required rows by using the setStartRow and setStopRow methods of the scan object.

By selecting only the required columns; processing of 1 Billion records of my previous blog could be much faster.

How to calculate the Compression in HBase?

The compression in HBase can really reduce the storage requirements, as the row-key, column-family and the column names repeat for each column of a record. There are many algorithm for compression including LZO, GZIP and SNAPPY. The choice really depends on the compression level and performance of the algorithm. Please look the following link to setup and configure the Compression in HBase.

http://hbase.apache.org/book/compression.html

After setting up the compression, the following formula could be used to get a approximate percentage size reduced by the compression:

100 - StoreFileSize *100 / ((avgKeyLen+avgValueLen+8)*entries)

StoreFileSize: Is the size of the file on the HDFS
avgKeyLen,avgValueLen and entries are output from HFile tool

Use the following commands to check the regions in the table.
hadoop fs -lsr /hbase/<table name>

Now choose a store file with larger size, as it would have more records and check the size.
hadoop fs -lsr /hbase/<table name>/<region>/data/<store file>
-rw-r--r-- 1 prafkum supergroup 25855018 2012-08-11 06:24 /hbase/employee/9c21ea1c93d5c8f96f84c888bbcf23e7/data/880ababfda4643f59956b90da2dc3d3f


Also run the HFile tool utility on the store file. Please view my previous blog for details.
hbase org.apache.hadoop.hbase.io.hfile.HFile -v -m -f /hbase/<table name>/<region>/data/<store file>

avgKeyLen=29,
avgValueLen=4,
entries=1146564

Using above formula to calculate the compression ratio:

100 - 25855018*100/((29+4+8)*1146564) = 45% compression

Please note that the compression computation is not accurate as the store file also have header along with key and values of the columns. Also the compression could be different in different store files. But of-course you get a good clue of the compression ratio.