Sunday, 22 December 2013

GraphLab and GraphChi : Overview



GraphLab is an example of a Pregel-inspired system. It does not use the BSP model. Instead, it uses an asynchronous model suitable for computations with sparse data dependencies. GraphLab is written in C++ and was originally tailored for machine learning tasks. GraphLab provides an Update function which is analogous to the Map function in MapReduce. This function can read and update the sets of program state in a controlled fashion. It also provides an equivalent of the Reduce function the Sync
function which enables the simultaneous execution of reductions as well as computations.Look at current rank of your neighbors. If changed ask neighbors to update.Think Like a vertex.Update function automatically parallelise for you.

GraphLab1 :About Possibility.Asynchronous execution,Very Intutive. but Not Very Scalable.Did't work for web graph.High level of Communication. Split vertices across machine.
GraphLab2 : Powergraph. About Scalablity. GAS decomposition.Gather information from neighbors in a data parallel way.Local Mapreduce.Apply those value to Vertices.Scatter values around neighbors.Split edges across machine.Very High Performnce.Loss of usablity.Very Rigid Abstraction.
GraphLab3:WARP SYSTEM. About Usablity.Much more expressive.No problem with order GAS or SAG.WARP outperform GRPHLAB2.Access neighborhood through parallel iterators.Huge scale machine learning available to all.

GraphChi is a spinoff of the GraphLab project.
Unlike other systems which need to store the whole graph they are processing in memory, GraphChi is able to process a graph from persistent storage such as an SSD or hard drive. GraphChi uses a method called Parallel Sliding Window (PSW) to process very large graphs from disk. Using an asynchronous model of execution, GraphChi can execute various data mining and graph algorithms on a single computer. GraphChi is available for C++ and Java.Problem is Random access.PSW minimizes random access.


In one of the project I did last term we quantified and analyzed the performance and scalablity of Pregel clones and related systems. We compared various open source implementations of Pregel Giraph,GPS,GraphLab,GraphChi on basis of their runtime,Network I/O,Memory,CPU utilization etc to execute various algorithms like PageRank,Shortest Path ,etc.
I created a 1,4,8 nodes Amazon Ec2 (Amazon Elastic Comute Cloud) instances.We used EC2 large instances (2 EC2 Compute Units,3.75 GiB of RAM and moderate network perfomance) running Ubuntu 12.04.
We analyzed the runtime ,memory foorprint for all these algorithms based on data set of varying sizes ranging from 8 vertices ,70000 vertices, 1million vertices,10 million vertices,30 million vertices . The size of files varies from 10 kb to 200MB.
For Memory footprint we compared the peak memory usage of master node during execution of algorithm for each system.
The datasets were taken from
http://snap.stanford.edu/data/
http://algs4.cs.princeton.edu/44sp/
The datasets were converted to form required by the graphs.

To run algorithm on GraphLab following steps were followed
http://graphlab.org/projects/tutorials.html

We included GraphChi in our comparision because it is sufficiently different from other systems.GraphChi is capable of running very large graphs on a single  machine.To our surprise we found that there was a significant improvement in Runtime when we compared the graphs from HAMA GIRAPH graphChi,GPS and GraphLab. Though performance of the systems like GraphLab and GPS can be further imporved with few little tweeks like splitting the graph into smaller chunks or increasing no of pts for GPS for algorithms with larger supersteps.




This figures shows us the BarChart for various datasets on different systems . GraphLab showed signigicant lower runtimes as compared to Giraph.


A link to the blog is also posted at
http://bickson.blogspot.ca/2014/03/university-of-waterloo-evaluates.html



Tuesday, 19 November 2013

Install Hadoop steps - Both single and Multiple machine...


Hadoop Installation for Windows..


Steps :
A. Prequisites

1. Install Putty. http://www.chiark.greenend.org.uk/~sgtatham/putty/
2.Get Ec2 Instance.
3.Go to EC2 instance and check the public DNS name.
4.Get Private key(.pem)

B.Convert .pem to .ppk

1. Start PuTTYgen.
2.Under Type of key to generate, select SSH-2 RSA.



3.Click Load. By default, PuTTYgen displays only files with the extension .ppk. To locate your .pem file, select the option to display files of all types. Generate PPK from Pem file you obtained from ec2.





 4a. Click Load if already saved Config

 
4b.Click SSh to browse PPK key.




5.Install JAVA (Hadoop is written in Java)
sudo apt-get install openjdk-7-jdk

6.Install OpenSSH for SSH
sudo apt-get install openssh-server

7.Create Hadoop user. So that you can run your program
sudo addgroup hadoop // creating Hadoop Group
sudo adduser --ingroup hadoop hduser // adding user to the group
sudo adduser hduser sudo

8. Generating Keys
$ ssh-keygen -t rsa -P ''
...
Your identification has been saved in /home/hduser/.ssh/id_rsa.

...
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost

9.Disabling IPV6

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

9. Installing Hadoop

Cd /usr/local/ 

sudo wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.203.0/hadoop-0.20.203.0rc1.tar.gz

//Unzip the tar

sudo tar xzf hadoop-0.20.203.0rc1.tar.gz

//move 
sudo mv hadoop-0.20.203.0rc1 hadoop

//set ownership   
sudo chown -R hduser:hadoop hadoop


10.Update .bashrc with the proper path

cd ~
add java_home
    
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
11.cd /usr/local/hadoop/etc/hadoop
vi hadoop-env.sh
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
12.Create tmp for hdfs 

$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
$ sudo chmod 750 /app/hadoop/tmp

13. Configure Hadoop - edi core-site.xml
cd /usr/local/hadoop/etc/hadoop

or 
cd vi /usr/local/hadoop/cong/core-site.xml
vi core-site.xml

For version <2 
<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property> 

For version >2
 Bcz of Yarn

<property>
   <name>fs.default.name</name>
   <value>hdfs://localhost:9000</value>
</property
13. update for version < 2
mapred-site.xml

<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property> 

for version >2
<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>
<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

14. Set Replication Factor 
file conf/hdfs-site.xml
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>


15. Format nAmenode
hadoop namenode -format

16.Start-all.sh
depecated for v >2 but will work Invokes both dfs and mapred-site.xml
 ----- Till here single node ---

For multinode.. 
17.Everything is working fine..

18.Create few more instances of EC2 clusters in same way
Update vi etc/hosts from home directory
add IP address and name which you will use
ex...
10.23.45.67 master10.11.11.11 slave1
10.12.12.12 slave2

19.Copy pub key of master to the list of authorized key of client 
vi /home/hduser/.ssh/id_rsa.pub
add the content to 
vi /home/hduser/.ssh/id_rsa/authorized_keys of slave1

20.you can ssh to both master and slave

21.once you type ssh slave you will be taken to slave machine 
exit it from maser

22.On master edit conf/master 
master

23. On master edit conf/slaves  
add the following

master
slave1 
slave2

23. Update core-site.xml ALL MACHINE
 <property>
  <name>fs.default.name</name>
  <value>hdfs://IPADDRESS OF MASTER HERE:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

23. update mapred-site.xml for all machine
<property>
  <name>mapred.job.tracker</name>
  <value>masterIP ADDRESS:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property> 

24.set  replication update hdfs-site.xml
<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property> 

25.format name node 

26.start-all.sh


This is what i got from reading from various blogs.. I have installed Hadoop many times 
and feel like I should write the steps  
Its the mix of both single and multinode cluster

I used http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/#java-io-ioexception-incompatible-namespaceids
but modified in many ways..