Saturday 5 January 2013

Visualization Tool for HDFS blocks/chunks

From the time I started working on hadoop, I always felt the need of a tool to visualize the blocks of a HDFS file/directory. Unfortunately I could not find any such tool in open source hadoop versions. Here is an online simple tool to visualize the output of hadoop fsck command in graphical format. To use the tool just run the following command and get the output in a text file. This text file can be used by the tool to visualize the nodes, blocks and the total size on each slave nodes.

The following command collects the blocks details for the HDFS folder '/in' and creates the file fsck.txt. Please check my previous blog for details.

hadoop fsck /in -files -blocks -locations -racks > fsck.txt

Just use this file and click 'Choose File' to choose the fsck output file and voila! you see the HDFS chunks in graphical format without any software installation!

The following snapshot shows the sample output:


Just get started with the tool:


Visualization Tool for HDFS blocks/chunks

HDFS Blocks Visualization

Max Blocks:

Help: Please click Choose File and select a fsck output file stored on your system to visualize the HDFS blocks or chunks across your hadoop cluster. You can use the Sample Data button to see the output for a sample file, or to see the contents of sample file press Show Sample Data.



Note that I have done limited testing of the tool and mostly with hadoop 1.x.x and Chrome.

Please share your comments to enhance the tool and also let me know if you see any issues. Also as all the processing happens on your browser, try to use it with a single HDFS file or a directory with limited blocks/files. 

How to find the blocks/chunks of a file in HDFS?

We know that HDFS stores a file by splitting it in multiples blocks or chunks based on the configured block size property. The blocks of a file are replicated for fault tolerance based on the replication factor configuration.

But how we can find the locations of these blocks and the nodes these blocks are replicated? It might be required in case of any maintenance activity and we need to shutdown few data nodes. Also for the data locality to know the exact location of the data in cluster.

The following fsck command of hadoop can be used to find the blocks and the location of the blocks:

 $ hadoop fsck /test -files -blocks -locations -racks  
 FSCK started by hadoop from /10.58.127.50 for path /test at Sun Jan 06 01:04:00 IST 2013  
 /test <dir>  
 /test/README.txt 1366 bytes, 1 block(s): OK  
 0. blk_606611195878801492_2473688 len=1366 repl=3 [/default-rack/10.58.127.59:50010, /default-rack/10.58.127.57:50010, /default-rack/10.58.127.51:50010]  
 Status: HEALTHY  
  Total size: 1366 B  
  Total dirs: 1  
  Total files: 1  
  Total blocks (validated): 1 (avg. block size 1366 B)  
  Minimally replicated blocks: 1 (100.0 %)  
  Over-replicated blocks: 0 (0.0 %)  
  Under-replicated blocks: 0 (0.0 %)  
  Mis-replicated blocks: 0 (0.0 %)  
  Default replication factor: 3  
  Average block replication: 3.0  
  Corrupt blocks: 0  
  Missing replicas: 0 (0.0 %)  
  Number of data-nodes: 9  
  Number of racks: 1  
 FSCK ended at Sun Jan 06 01:04:00 IST 2013 in 0 milliseconds  
 The filesystem under path '/test' is HEALTHY  

The above output shows that in the /test folder the file README.txt has only 1 blocks with replication factor 3 and also shows the ip addresses of the slave nodes.