Wednesday 7 September 2016

Setting up spark on a multicluster Machine

  1. Spark work on a master - slave model. 
  2. Master Machine and Slave Machines required. You can have master and slave on same machine too. 
  3. For testing purpose you can generate the cluster on same machine , approach similar 
  4. In each machine check if Java is installed

    JAVA install/update 
  5. If you are using a new machine, ec2 it wont come preinstalled with Java.  
  6. You can set up Java following below steps
  7.  First update package index :
    1. sudo apt-get update
    2. check version java -version 
    3. If Java not found →  Install JRE 
      1. sudo apt-get install default-jre
      2. sudo apt-get install default-jdk
  8. check Java-version 
  9. This might install Java 1.6 
  10. sudo apt-get install openjdk-7-jre
  11. sudo apt-get isntall openjdk-7-jdk
  12. Issue : Java still points to version 6. 
  13. Follow below steps to update 
    1. update java-alternatives -l
    2. sudo update-java-alternatives s java1.7.0-openjdk-amd64 (your path to the newer) 

    Instance Setup 
  14. Login to Dev or EC2 instances 
  15. If Running on Ec2 Do the following:
    1. Create an account, generate private  and public key pairs download the keys to one of the machine. You can get those steps online on how to generate key pairs. 
    2. Create two system variables for AWS_ACCESS_KEY and AWS_SECRET_KEY
    3. Go to spark directory on your local machine  and run the ec2 script from ./sbin/ec2 file.
    4. ./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a launch my-spark-cluster
    5. This will create two Large instances cluster. 

    SetUp Spark 
  16. Download Spark → Select Hadoop latest  Version and follow the link  
  17. For dev box : you can download using : 
    1. wget spark link 
    2. unzip using tar -xzf folder for spark
  18. Do this in all machines you want to run spark on .
  19. Set up env variable SPARK_HOME to point folder just before the bin directory, in normal case it will be the unzipped folder
  20. Save it using source ./bashrc
  21. Decide on which machine you want your master and slave


    Master Machine 
  22. Login to Master Machine, follow below steps
  23. edit hosts file for master
    1. vi /etc/hosts
    2. Provide the link between IP and name 
    3. IP Machine1
      IP Machine 2 
       
  24. Next few steps are changes in spark code
    1. slaves template is located in the conf directory.
    2. there is a file called slave template, same file can be used or a new file called you can create from that buy copying the content to slaves
    3. Paste the names of all slaves in this file 
      1. slave 1
      2. slave 2
      3. slave 3
  25. Go to Spark Sbin Directory 
    1. cd ./sbin/
    2. run ./sbin/start-master.sh
  26. This will start the master on port 8080, access it via IP address or Machine Name
  27. For current case go to any web browser and type :  http://machinename:8080
  28. This was required because you need to get the spark master address. When you go to the link it will show the url at the top spark://machinename:7077 
  29. Copy it 
  30. Go to /conf/spark-defaults.conf.template
  31. Add the masters url in the file 
  32. spark.master spark://machinename:7077 (it can also start with IP or a bigger name like machinename:7077
  33. Go to /sbin/spark-env.sh
  34. Add the following lines to it 
  35. export SPARK_WORKER_MEMORY=1g  You can specify how much memory you wanna give to worker 
    export SPARK_WORKER_INSTANCES=2 → No of workers
    export SPARK_MASTER_IP=spark-2 → Name of Spark master
    Slave
  36.  Go to slave Machine /sbin/start-slave.sh
  37. Run JPS in both slaves and master

Note : You need a password less access between master and  copy masters pub key defined in : id_rsa.pub move to authorized_key in slave and save. 

Very Useful links
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/
http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/
https://www.codementor.io/spark/tutorial/spark-python-rdd-basics


Tuesday 24 November 2015

Data Analytics with Amazon Machine Learning - Premier League Data

Dataset Source : Soccer 
Use Case and Dataset
The most important part of any machine learning application is to get the right dataset. As I am very much interested in Soccer, I thought of using the previous Premier league to predict winner of this season.
I agree that winner prediction is arguable and its hard to predict the winners due to so many factors associated, but I just want to try my hands on the new machine learning framework released by amazon few days back called “AWS Machine Learning ”.  
Steps :
1. Download last 5 years of Premier League data from the website (2010-2015).
2. Data From (2010-2014)  will be the training data.
3. 2015 will be the test data.
4. Extract following information from dataset

HomeTeam = Home Team
AwayTeam = Away Team
FTHG = Full Time Home Team Goals
FTAG = Full Time Away Team Goals
FTR = Full Time Result (H=Home Win, D=Draw, A=Away Win)
B365H = Bet365 home win odds
B365D = Bet365 draw odds
B365A = Bet365 away win odds
The reason I want to include Bet365 is because I think bets tell a lot about the current performance of the team which the match result doesn't tell. 
5. Combine all 4 files of dataset into one.
file1 <- read.csv("~/My Received Files/2010.csv", header = TRUE, stringsAsFactors = FALSE)
file2 <- read.csv("~/My Received Files/2011.csv", header = TRUE, stringsAsFactors = FALSE)
file3 <- read.csv("~/My Received Files/2012.csv", header = TRUE, stringsAsFactors = FALSE)
file4 <- read.csv("~/My Received Files/2013.csv", header = TRUE, stringsAsFactors = FALSE)
file5 <- read.csv("~/My Received Files/2014.csv", header = TRUE, stringsAsFactors = FALSE)

f1f2 <- rbind(file1, file2)
f4f5 <- rbind(file4, file5)
f1clean <- f1f2[,c(3,4,5,6,24,25,26,7)]
f2clean <- f4f5[,c(3,4,5,6,24,25,26,7)]
f3clean <- file3[,c(3,4,5,6,23,24,25,7)]
f2combine <- rbind(f1clean, f2clean)
finalfile <- rbind(f3clean, f2combine)

image
Rbind function : Combines  matrix, vactor or dataframe by values.

6. Save the dataframe to CSV file
write.csv(finalfile ,"~/My Received Files/test.csv")
7. Sort in alphabetical order.
teams = sort(unique(c(as.character(finalfile$HomeTeam), as.character(finalfile$AwayTeam))))
8. Create a table to store the wins and point record of each team. For that we create an empty data frame called final table with below columns.
finaltable = data.frame(Team = teams,
                     Games = 0, Win = 0, Draw = 0, Loss = 0, PercentWins=0,
                     HomeGames = 0, HomeWin = 0, HomeDraw = 0, HomeLoss = 0, percentHomeWins =0,
                     AwayGames = 0, AwayWin = 0, AwayDraw = 0, AwayLoss = 0, percentAwayWins =0)
9. Store in the above table count of matches all teams we can use the function as.numeric for that
finaltable$HomeGames = as.numeric(table(finalfile$HomeTeam))
finaltable$AwayGames = as.numeric(table(finalfile$AwayTeam))
10. Fill other columns based on values.
The values can be filled using FTR.  Extract the hometeam column of the dataset and see if the FTR column is H, D or A.  Group all these together first based on FTR H then based on team.  Similarly for D and A. and also for win and loss.

finaltable$HomeWin =
 as.numeric(table(finalfile$HomeTeam[finalfile$FTR == "H"]))

finaltable$HomeDraw =
 as.numeric(table(finalfile$HomeTeam[finalfile$FTR == "D"]))

finaltable$HomeLoss =
 as.numeric(table(finalfile$HomeTeam[finalfile$FTR == "A"]))

finaltable$AwayWin =
 as.numeric(table(finalfile$AwayTeam[finalfile$FTR == "A"]))
finaltable$AwayDraw =
 as.numeric(table(finalfile$AwayTeam[f7$FTR == "D"]))
finaltable$AwayLoss =
 as.numeric(table(finalfile$AwayTeam[f7$FTR == "H"]))
11. Calculate total wins, games,draw  and loss. by adding the total for Home and Away games.
finaltable$Games = finaltable$HomeGames + finaltable$AwayGames
finaltable$Win = finaltable$HomeWin + finaltable$AwayWin
finaltable$Draw = finaltable$HomeDraw + finaltable$AwayDraw
finaltable$Loss = finaltable$HomeLoss + finaltable$AwayLoss
12. Calculate percent wins for total, home and away wins.
finaltable$PercentWins = floor((finaltable$Win / finaltable$Games)*100)
finaltable$percentHomeWins = (finaltable$HomeWin / finaltable$HomeGames)*100
finaltable$percentAwayWins = (finaltable$AwayWin / finaltable$AwayGames)*100
13. Simply graph to extract top 10 teams
finaltable = finaltable[order(-finaltable$PercentWins),]
graphtable = finaltable[1:7,]
graphdata = finaltable[,c(1,5,10,15)]

barplot(graphtable$PercentWins, main="Top Teams in EPL by percent wins", xlab="Teams",
       ylab="PercentWins", names.arg=c("ManCity","United","Chelsea","Arsenal","TTham","Liverpool","Everton"),
       border="blue",col=rainbow(7),fill = rainbow(7))

image
ManCity holds the record of maximum wins in last 5 years, followed closely by Man united and chelsea.

14. Add data to S3.Since AWS ML uses S3 as data storage, you need to upload your data to S3. Create a bucket in S3 and upload the dataset created above to it. This is our test data and all predictions will be based on it. Our dataset is in CSV format as required by AWSML

image
15. Open AWS  Console : https://console.aws.amazon.com/machinelearning/
Next step is to create a ML model and link the S3 dataset.
Create new datasource. You can specify data location to be either S3 or redshift.


image

Sunday 19 July 2015

Data Analytics with R - World Bank Refugee Population data

Data Source :  World Bank Data
Problem :  To observe the distribution of refugees across the globe in past two decades.

Data Cleaning : Removed all countries with <100 people as refugees. Remove all unnecessary columns with no relevent data like Indicator name, Indicator code.

Step 1: 
1. Install R
2. IDE : RSTudio,
3. Online Editor : Data Joy.

4. Loading dataset in R:
df <- read.csv("data.csv", stringsAsFactors=FALSE)
this create an object by name mydf. Each cell in a CSV file is in a delimiter seperated format, mostly the delimiter is comma but there can be others as well.  The first row contains the header in this case “Country Name “, “Country Code” and the refugee population between years 1990-2013. We can prevent conversion of string to factor( A type) in R by setting stringAsFactors to false. By default it is true.


image
5. To check all the column names :
In the console type : 

  • str(df)This function compactly displays the Structure of an R Object.  All the column headers with data type wil be displayed.


image
2 char columns are Country Name and Code. The rest are number of people that seek refugee in the specific country.

6. To access columns of a dataframe,
You can use :

table(mydf$Country.Name)
To get header names use :
print(names(df))
Table command will return you a vector with value in column Country.Name and the count of that value, since these values are population of refugees, which is unique it gives you count 1.
Afghanistan                  Albania                  Algeria                                               1                                        1                          1
To get the proportion one can use :
prop.table(table(mydf$Country.Name))
Though not required for this data.
7.  Create a new column Category in the data frame
df$Category  <- mydf$Country.Code
8. To get the upper limit of our data, we need to get the maximum number of refugees by a country in particular column. For Year 2013 the maximum value can be extracted using:
max(df$X2013, na.rm = TRUE)
9. There are some countries in dataset where the columns are either not available or empty, resulting in a lot of “NA” in the data. Lets convert all Empty columns to a numerical value of 1(~0).
df[is.na(df)] <- 1
In ‘R’ the value is assigned using ‘<-‘ operator, this makes all ‘NA’ columns as 1.

10.  Since the dataset has  highly varying range of values, ranging from 1 to 2712888, I decide to categories them into different categories. The idea is to create a bucket distirbution. Each bucket will have some capacity, in this case say 10000.  


Bucket 1 : 1- 9999 (All entries between 1-9999 will be in this bucket.
Bucket 2:  10000-19999
and so on.

11. Now we have to loop through every cell value in the data frame and replace it with the bucket they fall into.

for(i in names(df)){if((i != colnames(df)[1]) && (i != colnames(df)[2]) && (i != colnames(df)[18])){
      sq<-seq(0,3000000,10000)
      qr<- cut(df[,i],sq,labels = c(1:300))
      df[,i]<-as.numeric(qr)
   }
}
Excluding all values in the first (Country Name column), second (Country Code column), last column which we created previously called Category.
R provides a method called cut: cut converts the range of values into intervals and assigns the values in x according to which interval they fall.
cut((x, breaks, labels = NULL, ...))
X : a numeric vector which is to be converted to a factor by cutting.

breaks : breaks either a numeric vector of two or more unique cut points
labels”  labels for the levels of the resulting category.
A intermediate factor vector is created for each column and the resulting value of the column is updated with it.  SInce this is a factor vector, the value is typecast to numeric in the next line. If the number of cut points doesn't match based on the cut, an error will be thrown “Length do not match”
No of cutpoint = Data Max Value / Capacity of bucket
12. Convert your data to long format as needed by ggplot
GGPlot is a graph plotting library of R.
Reshape2 is a transformation library.

Library(reshape2)
df.molten <- melt(df, value.name="Count", variable.name="Year", na.rm=TRUE)
13. Plot the graph using ggplot’s qplot by categories.
par( mfrow = c(3,3) )

library(ggplot2)

qplot( data=df.molten, x = Year,y = Count, geom="bar", size = I(2),stat = "identity" ,las=0.3, cex.names=0.4) + facet_wrap( "Category" ) + geom_bar(width=1.5)


image

14. Some useful information retrieved from data :
a ) Number of refugees increasing every year.
b) Huge rise in number of refugees in European countries in last few years
c) Jordon, Pakistan, Iran and Germany has most number of refugees.
d) Sweden refugees are increasing at an alarming pace, but lesser then last few years.
e) Most of the countries are very much constant with the number of refugees they allow in their home country esp Chech Republic, Greece, India, and China,
f) In European countries Germany (Country Code - DEU) has the highest number of refugees.
g) There is a large uneven distribution of refugees across europe, some countries <1000 refugees and some numbers are too high.
h) United States also has a large number of refugee population and it is just second to Germany(excluding middle eastern states) in terms of numbers.
i) There has been sudden rise in number of refugees especially in Gaza, Syria, Canada, Britain.
f) Iran, Zimbabwe, Saudi Arabia Ghana has seen big decline in past few years .
g) The number if refugees in Saudi Arabia, UAE, Russia and Qatar are alarmingly low.
h) Number of refugees in Europe is rising. Germany, France, United Kingdom and Sweden and Turkey leads in number of refugees.

Wednesday 24 June 2015

Playing with Tumblr API

1. Install Ouath2
2. Install Pytumblr
3. Register an Application
https://www.tumblr.com/oauth/apps
4. Get tumblr ouath2, you will get once you create app

5. Enter your credentials in following code in Python file
client = pytumblr.TumblrRestClient(
    '<consumer_key>',
    '<consumer_secret>',
    '<oauth_token>',
    '<oauth_secret>',
)
pytumblr is a library,  through which you can make calls to tumblr.

6. Code to get all blogs you are following
off =0
while True:
    my_dict = client.following(offset =off)
    res = my_dict['blogs']
    for rs in res:
        print(rs['name'] + "...." + rs['title'])
   
       
    off+=20
7. Number of posts liked for each blog
off =0
like_dict= {}
while True:
    my_dict = client.blog_likes('conflatedthought.tumblr.com',offset =off)
    res = my_dict['liked_posts']
    for rs in res:
        strs = str(rs['tags']).strip('[]')
        #print(rs['blog_name'] +" "+ strs)
        #print("..")
        if rs['blog_name'] in like_dict.keys():
            like_dict[rs['blog_name']] += 1
            #print rs['blog_name'] +"  " + str(like_dict[rs['blog_name']])
        else:
            like_dict[rs['blog_name']] = 1   
          
    off+=20
for the_key, the_value in like_dict.iteritems():
    print the_key, 'corresponds to', the_value 
8. Sample Output for code 6
sportspage....Sports Page
themobilemovement....The Mobile Movement
adidasfootball....adidas Football
instagram-engineering....Instagram Engineering
soccerdotcom....SOCCER.COM
sony....Sony on Tumblr
yahoolabs....Yahoo Labs
taylorswift....Taylor Swift
beyonce....Beyoncé | I Am
itscalledfutbol....Did someone say "futbol"?
futbolarte....Futbol Arte
fcyahoo....FC Yahoo
yahooscreen....Yahoo Screen
yahoo....Yahoo
engineering....Tumblr Engineering
yahoodevelopers....Yahoo Developer Network
mongodb....The MongoDB Community Blog
yahooeng....Yahoo Engineering
marissamayr....Marissa's Tumblr
staff....Tumblr Staff

whoagurt....Whoagurt
narendra-modi....Narendra Modi
nytvideo....New York Times Video
bonjovi-is-my-life....Bon Jovi♥ Is My Life
etsy....Etsy
game-of-thrones....You win or you die.
seinfeld....Seinfeld
itunes....iTunes
teamindiacricket....Team India
gameofthrones....Game of Thrones: Cast A Large Shadow
forzaibra....Forza Ibra