Wednesday, 26 March 2014

Python and NLP

I recently worked on a project titled "Recommending Similar defects on Apache Hadoop" .Its a recommendation system that predicts similar defects and then predicts the effort estimate for each defect.
Steps:
1) Extract XML/Excel data from Apache Hadoop Issue Tracker.
https://issues.apache.org/jira/browse/HADOOP
2)Convert the extracted data into CSV for persistent storage.
3)Extract required Column


Python COde :

import csv
import re

def col_selector(table, column_key):
    return [row[column_key] for row in table]

with open("Data/next.csv","r") as csvfile:
    reader = csv.DictReader(csvfile, delimiter=",")
    table = [row for row in reader]
    foo_col = col_selector(table, "Summary")
    bar_col = col_selector(table, "Description")

The above example extract two columns from Apache Hadoop Issue Tracker CSV file.  Your program must include python library called csv.py
http://docs.python.org/2/library/csv.html

4)From these Column we will generate a set of words specific to Hadoop.
We will apply various NLP to generate various words from the summary and description.

5)There are 5 Steps in Natural Language Processing 
1. Tokenizing
2. Stemming
3. Stop Word Removal
4. Vector Space Representation
5. Similarity Measures

Step 1 : Tokenizing :
 The tokenization process involves breaking a stream of characters of text up into words or phrases, symbols or other meaningful elements called tokens. Before indexing, we Fillter out all common English stopwords.I obtained a list of around 800 stopwords online. 
K. Bounge. Stop Word List.
https://sites.google.com/site/kevinbouge/stopwords-lists
The list contained articles, pronouns, verbs etc. I filtered out all those words from our extracted text. After reviewing the list, we felt stopwords list for a Hadoop Database has to be built separately, as numbers and sym-
bols are also to be filtered out. 


Step 2:
Stemming is used to try to identify a ground form for each word in the text. Some words that carry the same information can be used in different grammatical ways, depending on how the creator of the report wrote it down. This phase will remove a xes and other components from each token
that resulted from tokenization so that only the stem of each word remains. For stemming, we used a python library called PortorStemmer. We passed to it stream of extracted words. Words like caller, called, calling whose stem was call were Filtered and only 1 word, call, was kept in the nal list.I Filtered around 1200 words this way.


Step 3:
Stop Word Removal 
Synonyms removal and replace by 1 common word.I used wordnet NLTK to perform this.
Second Phase : Spell checking: List compared with list of misspelled words.

Step 4:
Vector Space representation.
After the first 3 steps I had around 5500 words. These words were used to identify tags.Each defect with tags was then represented into a Vector space model.Used general method used by scikit.

Step 4: Similarity Measure.
Calculated the cosine similarity between the two defect vectors.
 

Sunday, 2 March 2014

The Curious Case of Leonardo Di Caprio's Oscar :Sentiment Analyisis

I was very excited yesterday night for the Oscars as Leonardo Di Caprio was in the last few of Best actor nominees. Though he has done some brilliant movies in the past and he is a great actor , I was not confident with this movie getting him the award as I felt he has done much better work in other films . But Still fingers were crossed for brilliant actor like Leonardo. I was just curious to see how twitter is doing with the Oscars. I did sentiment analysis on Tweets to see what’s people point of view on Leonardo is Just before the Oscar . How many of them wanted him to win. How many feel that Leonardo is not the right person for Oscars and someother actor should win it.
Sentiment Analysis on tweets gave me interesting results.
Steps :
1. Extract tweets with HashTag on Leonardo
2. Generate CSV of Tweets
3. Extract required information
4. Natural Language Processing - Tokenizing ,Stamming etc.
5.Classify them as Positive Negative Neutral
6.Apply Naivebayes.

Positive Tweets
RT @FindingSquishy_: If #Leonardo Di Caprio wins an Oscar tonight, Tumblr will probably break
if #Leonardo di Caprio doesn't win an oscar I am going to scream
RT @Mohammed_Meho: #Leonardo Di Caprio better win an Oscar tonight.
RT @Miralemcc: #The Wolf of the Wall Street and# Leonardo di Caprio for #Oscars2014



Negative Tweet
#Leonardo Di Caprio doesn't deserve and never has deserved an oscar. Deal with it

.............................................

Step1 : Step 1 is Scrapping tweets for the required tag. This can be done using the twitter API or You can use online sites for searching tweets and extract the search results from it. There are many sites that can give you direct Sentiment analysis results like NCSU project : 
http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/
Stanford Project : 
Sentiment140
http://www.sentiment140.com/
But I choose twitter seeker that just gives you search result without sentiments and I wanted to do Sentiment analysis myself. 
TwitterSeeker generates you a Excel sheet with all tweet information.

You can filter it by selecting language as english  In the image I applied no filter. 
Excel file generated will have user name ,time of posting,tweet and many other as option. In the current case I am only concerned with the tweet. 

STEP 2 : generate CSV of Tweets. 
For my data as input to ML algorithms , I used CSV file. CSV is Comma Seperated Value format in which each column is seperated by delimiter. After getting excel from twitter I converted into a CSV file. 

STEP 3: Extract Requried Information:
This is the step where your knowledge of Data mining will come into use. As in the present I am only concerned with one column that is tweet. Now general tweet is generally in a form 
Username @User #tag Link
which can very randomly.
Now I removed all the unnecessary words from it . All usernames tags and links.


#updated every day.

STEP 4: Tag Generation. 
Get tags for All tweets. 

STEP 5: Sentiment Analysis :
For sentiment Analysis I am using ANEW dataset from University of Florida.
Our Dictionary Datset was composed of 3 main components:

Valence which is the pleasantness of stimulus
Arousal Intensity of Provoked Emotion
Dominance Degree of control exerted by Stimulus.


We decided to use the arousal ratings to estimate polarity
of a tweet. The following steps were followed regarding the
same.

  • Generate tags for each tweet.
  • For each word i in the tweet that exist in the Arousal Dictionary, extract the mean and standard deviation of valence, arousal, dominance.
  • Count number of tags for each tweet. If they are zero or 1 ignore it because of less information to estimate
  • sentiment.
  • To calculate the overall mean and standard deviation of each twitter feed , numerically average the generated n tags mean and standard deviations.
STEP 6: From List of tweets that I collected, this is what I got.








Saturday, 1 March 2014

Keyword Analysis on Apache Hadoop Issue Tracker

In my recent project titled "Recommending similar defects and Effort Estimate for Apache Hadoop Issue Tracker"
I recently wrote python code to extract most used hadoop specific keywords in the issue tracker after removing irrelevant words and stop words from the list.
I am classifying them into various classes like HDFS Hadoop Error MINING DataNode etc Some of the words that is found on the list are posted.
Click for the word list



List has approximately 4700 words .
Duplicate words were removed from the list.
The list analyzed first 200 defects from Hadoop Commons and Hadoop HDFS.
Both Summary and Description of the defects were analyzed and were selected based on their need on the defect analysis.


The stop word list is prepared by combining various lists available online like FoxStoplist.txt  
stopwords-lists 
I believe that this list might be useful for someone working on Language Processing for Issue related words and Hadoop Specific words.  
Defect Example : HDFS : 6001  
Description : When hdfs is set up with HA enable, FileSystem.getUri returns hdfs:// Here dfs.nameservices is defined when HA is enabled.
In documentation: This is probably ok or even intended. But a caller may further process the URI, for example, call URI.getHost(). This will return the 'mycluster', which is not a valid host anywhere.
Summary : In HDFS HA setup, FileSystem.getUri returns hdfs://  


Keywords : #Hdfs #dfs.nameservices # FileSystem #getUri #Nameservices #host #URIL #HA #returns