K-Means Clustering using Apache Mahout
- Download Dummy Data Set from following command :
- $ wget http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.
- Upload a file to HDFS :
- Make Directory in Hadoop : $ hadoop fs -mkdir testdata.
- Check Mahout Home variable path : $ echo $MAHOUT_HOME.
- Above command should return /usr/lib/mahout/bin.
- Now putting the data file into hadoop file system (2 $ symbols is correct):
- $ $HADOOP_HOME/bin/hadoop fs -put /PATH/TO/synthetic_control.data testdata.
- Run the K Means Clustering algorithm on test data (synthetic control data).
- $ $MAHOUT_HOME/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
- It will do Map and Reduce on the data and create clusters of it.
- Check the output file in HDFS : $ hadoop fs -ls output.
- Output is looks like :
Found 14 items
-rwxr-xr-x 1 unmesha unmesha 194 2014-11-04 09:06 output/_policy
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusteredPoints
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-0
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-1
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-10-final
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-2
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-3
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-4
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-5
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-6
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-7
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-8
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/clusters-9
drwxrwxr-x - unmesha unmesha 4096 2014-11-04 09:06 output/data
The output file i.e. cluster-10-final file is not Human Readable file.
So We will use Apache Mahout utility tool named as "clusterdump" for making the output file readable.
- Make a new directory in HDFS : $ hadoop fs -mkdir kmeansoutput.
- Get the output data into hdfs : $ hadoop fs -get output kmeansoutput.
- Run the following command : $ mahout clusterdump --input output/clusters-10-final --pointsDir output/clusteredPoints --output kmeansoutput/clusteranalyze.txt.
- This generates a output file which is human readable.
Thank You :)
Anshul Shrivastava
Reference Link : https://mahout.apache.org/users/clustering/k-means-commandline.html