I recently wrote python code to extract most used hadoop specific keywords in the issue tracker after removing irrelevant words and stop words from the list.
I am classifying them into various classes like HDFS Hadoop Error MINING DataNode etc Some of the words that is found on the list are posted.
Click for the word list
List has approximately 4700 words .
Duplicate words were removed from the list.
The list analyzed first 200 defects from Hadoop Commons and Hadoop HDFS.
Both Summary and Description of the defects were analyzed and were selected based on their need on the defect analysis.
The stop word list is prepared by combining various lists available online like FoxStoplist.txt
stopwords-lists
I believe that this list might be useful for someone working on Language Processing for Issue related words and Hadoop Specific words.
Defect Example : HDFS : 6001
Description : When hdfs is set up with HA enable, FileSystem.getUri returns hdfs://
In documentation: This is probably ok or even intended. But a caller may further process the URI, for example, call URI.getHost(). This will return the 'mycluster', which is not a valid host anywhere.
Summary : In HDFS HA setup, FileSystem.getUri returns hdfs://
Keywords : #Hdfs #dfs.nameservices # FileSystem #getUri #Nameservices #host #URIL #HA #returns
Link doesn't work for the tech words. Thanks for the post .
ReplyDelete