Wednesday 7 September 2016

Setting up spark on a multicluster Machine

  1. Spark work on a master - slave model. 
  2. Master Machine and Slave Machines required. You can have master and slave on same machine too. 
  3. For testing purpose you can generate the cluster on same machine , approach similar 
  4. In each machine check if Java is installed

    JAVA install/update 
  5. If you are using a new machine, ec2 it wont come preinstalled with Java.  
  6. You can set up Java following below steps
  7.  First update package index :
    1. sudo apt-get update
    2. check version java -version 
    3. If Java not found →  Install JRE 
      1. sudo apt-get install default-jre
      2. sudo apt-get install default-jdk
  8. check Java-version 
  9. This might install Java 1.6 
  10. sudo apt-get install openjdk-7-jre
  11. sudo apt-get isntall openjdk-7-jdk
  12. Issue : Java still points to version 6. 
  13. Follow below steps to update 
    1. update java-alternatives -l
    2. sudo update-java-alternatives s java1.7.0-openjdk-amd64 (your path to the newer) 

    Instance Setup 
  14. Login to Dev or EC2 instances 
  15. If Running on Ec2 Do the following:
    1. Create an account, generate private  and public key pairs download the keys to one of the machine. You can get those steps online on how to generate key pairs. 
    2. Create two system variables for AWS_ACCESS_KEY and AWS_SECRET_KEY
    3. Go to spark directory on your local machine  and run the ec2 script from ./sbin/ec2 file.
    4. ./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a launch my-spark-cluster
    5. This will create two Large instances cluster. 

    SetUp Spark 
  16. Download Spark → Select Hadoop latest  Version and follow the link  
  17. For dev box : you can download using : 
    1. wget spark link 
    2. unzip using tar -xzf folder for spark
  18. Do this in all machines you want to run spark on .
  19. Set up env variable SPARK_HOME to point folder just before the bin directory, in normal case it will be the unzipped folder
  20. Save it using source ./bashrc
  21. Decide on which machine you want your master and slave


    Master Machine 
  22. Login to Master Machine, follow below steps
  23. edit hosts file for master
    1. vi /etc/hosts
    2. Provide the link between IP and name 
    3. IP Machine1
      IP Machine 2 
       
  24. Next few steps are changes in spark code
    1. slaves template is located in the conf directory.
    2. there is a file called slave template, same file can be used or a new file called you can create from that buy copying the content to slaves
    3. Paste the names of all slaves in this file 
      1. slave 1
      2. slave 2
      3. slave 3
  25. Go to Spark Sbin Directory 
    1. cd ./sbin/
    2. run ./sbin/start-master.sh
  26. This will start the master on port 8080, access it via IP address or Machine Name
  27. For current case go to any web browser and type :  http://machinename:8080
  28. This was required because you need to get the spark master address. When you go to the link it will show the url at the top spark://machinename:7077 
  29. Copy it 
  30. Go to /conf/spark-defaults.conf.template
  31. Add the masters url in the file 
  32. spark.master spark://machinename:7077 (it can also start with IP or a bigger name like machinename:7077
  33. Go to /sbin/spark-env.sh
  34. Add the following lines to it 
  35. export SPARK_WORKER_MEMORY=1g  You can specify how much memory you wanna give to worker 
    export SPARK_WORKER_INSTANCES=2 → No of workers
    export SPARK_MASTER_IP=spark-2 → Name of Spark master
    Slave
  36.  Go to slave Machine /sbin/start-slave.sh
  37. Run JPS in both slaves and master

Note : You need a password less access between master and  copy masters pub key defined in : id_rsa.pub move to authorized_key in slave and save. 

Very Useful links
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/
https://www.codementor.io/spark/tutorial/spark-python-rdd-basics