#Big Data, #Hadoop, #Spark, #Machine Learning, #NLP, #Data Mining
Friday, 17 January 2014
Blum-Floyd-Pratt-Rivest-Tarjan algorithm
Algorithm to find kth largest element from an unsorted array in linear time.
Step 1: If the number of elements in an array is large .Divide the array into arrays each with 5 elements. n/5 arrays.
Step2 : Compute Median of these 5 elements.
This you can do by sorting each group.
as there are 5 elements in Each group sorting takes 5log5. For n/5 groups in total time taken is , n/5*5log5 ~ o(n)
Step 3: Collect Median of all in array .
Step 4: Recursively compute median of this array.
Step5: Use this Median as Pivot in QuickSelect
Median > half of these n/5 group medians. So median > n/10 , also each median in respective group is greater then half element so in total median is greater then
3n/10 element. So in total median is greater then 3n/10 and lesser then 3n/10 element
So worst case of algorithm will obey T(n) O(n)+ T(n/5)+ T(7n/10).
Resulting in tn = O(n)
IF N IS LARGE ENOUGH DEVIDE THE LIST INTO N/5 LISTS OF 5 ELEMENTS
public void floydAlgorithm(int []list_of_elements,int k){
//for n(array length) large enough in the range of 30 elements
m = n/5; //no of lists
int meA[] = new int[m];
for(int i=1 to m){
meA[i] = Median_of_5_Lists(5i-4...5i);
MOM = floydAlgorithm(meA,m/2);
r = Partition(list_of_elements,MOM);
if(k<r) return floydAlgorithm(A[1...r-1],k);
else if(k>r)return floydAlgorithm(A[r+1..n],k-r);
else return mom;
}
}
Sunday, 12 January 2014
Supervised and Unsupervised Machine Learning
What is supervised Machine Learning ?
Take input as data and generates model that predicts response to new data.
The data can be :
1)Classification : Samples belonging from different classes the algorithm learns from already labeled data to predict the class of unlabled data. It can have values 0 and 1 .An example of classification problem will be to predict weather tumor is Malignant or benign .
2)Regression : If desired output has one or more contiues value the task is called regression.Miles per gallon for a car.
What is unsupervised Machine Learning ?
Input data consist of set of vectors with out any corresponding target value.Goal is to determine set of similar examples within set of data. Or to determine distribution of data.called density estimation or reducing the dimensionality of data.
* taken from http://bioinformatics.oxfordjournals.org/content/24/6/783/F1.expansion.html
Steps for Supervised machine Learning :
1.Prepare data
2.Choose ALgorithm
3.Fit Model.
4.Validation Model
5.Use it for predication
Various learning algorithm :
Classification Trees
Regression Trees
Discriminant Analysis (classification)
K-Nearest Neighbors (classification)
Naive Bayes (classification)
Classification or Regression Ensembles
Classification or Regression Ensembles in Parallel
Take input as data and generates model that predicts response to new data.
The data can be :
1)Classification : Samples belonging from different classes the algorithm learns from already labeled data to predict the class of unlabled data. It can have values 0 and 1 .An example of classification problem will be to predict weather tumor is Malignant or benign .
2)Regression : If desired output has one or more contiues value the task is called regression.Miles per gallon for a car.
What is unsupervised Machine Learning ?
Input data consist of set of vectors with out any corresponding target value.Goal is to determine set of similar examples within set of data. Or to determine distribution of data.called density estimation or reducing the dimensionality of data.
* taken from http://bioinformatics.oxfordjournals.org/content/24/6/783/F1.expansion.html
Steps for Supervised machine Learning :
1.Prepare data
2.Choose ALgorithm
3.Fit Model.
4.Validation Model
5.Use it for predication
Various learning algorithm :
Classification Trees
Regression Trees
Discriminant Analysis (classification)
K-Nearest Neighbors (classification)
Naive Bayes (classification)
Classification or Regression Ensembles
Classification or Regression Ensembles in Parallel
Saturday, 11 January 2014
Probabilistically Bounded Staleness for Practical Partial Quorums : Summery
I recently gave a presentation on paper by PeterBailis and his friends at University of California Berkley
http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf
Unlike other papers does not propose a consistency model rather the paper talks about a probabilistic model to predict the staleness of data . They introduce term called Probabilistic bounded staleness to provide bounds on staleness of data wrt both version and clock time. I found the paper pretty interesting and I feel that because of the move more and more towards eventual consistency the question "How Eventual and How Consistent " makes more sense as it will help the users to choose the best available read and write quorum configuration for there system. Or in words it provides a lens for analyzing, improving, and predicting the behavior of existing, widely deployed systems.
Quorum System : A quorum system is a collection of sets where all pairs of sets intersect.
A (strict) quorum System Q* over a universe U is a set system over U such that for every
Q1, Q2 Є Q, Q1 П Q0 ≠ Ф Each Q Є Q is called a quorum.
N – No of nodes that store Replica
W – No of Replicas acknowledge the receipt of the update before the update completes
R – the number of replicas that are contacted when a data object is accessed through a read operation
For Strong Consistency : R+W > N (Read ,Write set Overlap)
For Eventual Consistency : R+W =< N
* Probabilistic Quorum Systems : Dahlia Malkhi
Probabilistic Quorum Systems
Relax Quorum Intersection Property
Hold with High Probability 1 – ε (ε small Number)
Scaling no of replicas increase probability
A quorum system obeys PBS k-staleness consistency with probability 1 –psk where psk is the probability of non-intersection with one of the last k independent quorums.
N = 3, R = W = 1
Probability of returning a version
Within 2 versions is 0.5
Within 3 versions is 0.703
Within 5 versions > 0.868
Within 10 versions > 0.98
t-visibility is the probability that a read operation, starting t seconds after a write commits will observe the latest value of data.
t- inconsistency window.
Assumption : reads occur instantaneously
and writes commit immediately after W replicas there is no delay acknowledging the write to the coordinating node.
A quorum system obeys PBS <k-t> -staleness consistency if, with probability 1- pskt, at least one value in any read quorum will be within k versions of the latest committed version when the read begins, provided the read begins t units of time after the previous k versions commit.
* http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf
http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf
Unlike other papers does not propose a consistency model rather the paper talks about a probabilistic model to predict the staleness of data . They introduce term called Probabilistic bounded staleness to provide bounds on staleness of data wrt both version and clock time. I found the paper pretty interesting and I feel that because of the move more and more towards eventual consistency the question "How Eventual and How Consistent " makes more sense as it will help the users to choose the best available read and write quorum configuration for there system. Or in words it provides a lens for analyzing, improving, and predicting the behavior of existing, widely deployed systems.
Quorum System : A quorum system is a collection of sets where all pairs of sets intersect.
A (strict) quorum System Q* over a universe U is a set system over U such that for every
Q1, Q2 Є Q, Q1 П Q0 ≠ Ф Each Q Є Q is called a quorum.
N – No of nodes that store Replica
W – No of Replicas acknowledge the receipt of the update before the update completes
R – the number of replicas that are contacted when a data object is accessed through a read operation
For Strong Consistency : R+W > N (Read ,Write set Overlap)
For Eventual Consistency : R+W =< N
* Probabilistic Quorum Systems : Dahlia Malkhi
Probabilistic Quorum Systems
Relax Quorum Intersection Property
Hold with High Probability 1 – ε (ε small Number)
Scaling no of replicas increase probability
N = 3, R = W = 1
Probability of returning a version
Within 2 versions is 0.5
Within 3 versions is 0.703
Within 5 versions > 0.868
Within 10 versions > 0.98
t-visibility is the probability that a read operation, starting t seconds after a write commits will observe the latest value of data.
t- inconsistency window.
Assumption : reads occur instantaneously
and writes commit immediately after W replicas there is no delay acknowledging the write to the coordinating node.
A quorum system obeys PBS <k-t> -staleness consistency if, with probability 1- pskt, at least one value in any read quorum will be within k versions of the latest committed version when the read begins, provided the read begins t units of time after the previous k versions commit.
* http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf
Subscribe to:
Posts (Atom)