K-Means Clustering

K-Means Clustering using Apache Mahout

Download Dummy Data Set from following command :
$ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.
Upload a file to HDFS :
Make Directory in Hadoop : $ hadoop fs -mkdir testdata.
Check Mahout Home variable path : $ echo $MAHOUT_HOME.
Above command should return /usr/lib/mahout/bin.
Now putting the data file into hadoop file system (2 $ symbols is correct):
$ $HADOOP_HOME/bin/hadoop fs -put /PATH/TO/synthetic_control.data testdata.
Run the K Means Clustering algorithm on test data (synthetic control data).
$ $MAHOUT_HOME/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
It will do Map and Reduce on the data and create clusters of it.
Check the output file in HDFS : $ hadoop fs -ls output.
Output is looks like :

Found 14 items
-rwxr-xr-x 1 unmesha unmesha 194 2014-11-04 09:06 output/_policy
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusteredPoints
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-0
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-1
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-10-final
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-2
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-3
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-4
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-5
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-6
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-7
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-8
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-9
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/data

The output file i.e. cluster-10-final file is not Human Readable file.
So We will use Apache Mahout utility tool named as "clusterdump" for making the output file readable.

Make a new directory in HDFS : $ hadoop fs -mkdir kmeansoutput.
Get the output data into hdfs : $ hadoop fs -get output kmeansoutput.
Run the following command : $ mahout clusterdump --input output/clusters-10-final --pointsDir output/clusteredPoints --output kmeansoutput/clusteranalyze.txt.
This generates a output file which is human readable.

Thank You :)
Anshul Shrivastava

Reference Link : https://mahout.apache.org/users/clustering/k-means-commandline.html