Machine Learning & NLTK Analyses: September 2017

Tuesday 19 September 2017

Text analyzing using NLTK

Text Analyzing using NLTK commands

NLTK programming forms integral part of text analyzing.

Steps are:

a) On python, use (pip install nltk)
b) Then, import texts using command given below:
>>> from nltk.book import *
Output:
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Some of the text analyzing commands are:
>>> set(text1) //prints all content in text as individual words in as given format//
Output: , {'ash', 'islands', 'extinct', 'Ordinaire', 'Terrible', 'mutually', 'lengthen', 'since', 'nick', 'stature', 'DISSECTION', 'boat', 'Unconsciously', 'contenting', 'Plum', 'Humane', 'membranes', 'necessitated', 'Dorchester', 'Unappalled', 'sufficiently', 'invunerable', 'touchy', 'Bad',........
>>>sorted(text1) //prints all content in text as individual words in alphabetical format//
Output:{, 'quiver', 'quivered', 'quivering', 'quivers', 'quoggy', 'quohogs', 'quoin', 'quoins', 'quote', 'quoted', 'raal', 'rabble', 'rabid', 'race', 'raced', 'races', 'racing', 'rack', 'racket', 'radiance', 'radiant', 'radiates', 'radiating', 'radical', 'rafted', 'rafters', 'rafts', 'rag',......

>>> len(text1) //gives total words in text1//
260819

>>> len(set(text1)) //gives total words in sets in text1//
19317

>>> text1.collocations() //gives words which appear as group of two-words//
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

>>> text2.count('wrong') //counts repetition of word in whole text//
22

>>> text2.concordance('right') //shows location of occurance of word in line from text//
Displaying 25 of 32 matches:
ttendants . No one could dispute her right to come ; the house was her husband
ded at the time . Had he been in his right senses , he could not have thought o
own expenses ." " I believe you are right , my love ; it will be better that t
wood , " I believe you are perfectly right . My father certainly could mean not
hich in general direct him perfectly right ." Marianne was afraid of offending
mmonly moderate , as to leave her no right of objection on either point ; and ,
one else . Every thing he did , was right . Every thing he said , was clever .
our conjectures may be , you have no right to repeat them ." " I never had any
s . Mrs . Jennings sat on Elinor ' s right hand ; and they had not been long se

>>> text2.similar('good') //displays close words in text//
large short long young great much comfortable kind quiet pretty
charming in respectable as to house one that time thing

>>> text4.index('the') //to find location of word 'the' in the list of words//
4
>>> text5[:50] // to print first 50 words in the text5//
['now', 'im', 'left', 'with', 'this', 'gay', 'name', ':P', 'PART', 'hey', 'everyone', 'ah', 'well', 'NICK', ':', 'U7', 'U7', 'is', 'a', 'gay', 'name', '.', '.', 'ACTION', 'gives', 'U121', 'a', 'golf', 'clap', '.', ':)', 'JOIN', 'hi', 'U59', '26', '/', 'm', '/', 'ky', 'women', 'that', 'are', 'nice', 'please', 'pm', 'me', 'JOIN', 'PART', 'there', 'ya']

>>> text5[1200:1268] // printing words from index No. 1200 to 1268//
['U116', 'PART', 'U7', 'PART', 'there', 'is', 'not', '!', 'heyy', 'U148', 'i', 'hate', 'you', '.', 'boys', 'are', 'naughtier', 'U92', '.', 'JOIN', 'bye', 'U148', 'Hmm', 'you', 'I', 'hate', 'you', 'say', '..', 'Guess', 'what', 'PART', 'i', 'hate', 'you', 'U121', 'fuck', 'your', 'ugly', 'JOIN', 'if', 'i', 'had', 'a', 'daughter', 'she', 'would', 'regret', 'me', 'bein', 'her', 'dad', 'huh', '?', 'Hmm', 'PART', 'What', '?', 'aw', 'U115', 'whys', 'that', 'deep', 'inside', 'U121', 'wants', 'what', 'she']

>>> text6[-60:-20] // printing words from end of text from index No. -60 to -20//
['s', 'an', 'offensive', 'weapon', ',', 'that', 'is', '.', 'OFFICER', '#', '2', ':', 'Come', 'on', '.', 'Back', 'with', "'", 'em', '.', 'Back', '.', 'Right', '.', 'Come', 'along', '.', 'INSPECTOR', ':', 'Everything', '?', '[', 'squeak', ']', 'OFFICER', '#', '1', ':', 'All', 'right']

>>> text7[1234] //word with index No.1234 in text7//
'the'
>>> text9[13900]
'.'
>>> ' '.join(['Raj', 'Krish', 'Arnie', 'Suze']) //joining group of words//
'Raj Krish Arnie Suze'

>>> 'All that goes well ends well'.split() //spliting line into group of words//
['All', 'that', 'goes', 'well', 'ends', 'well']

>>> 'Are'+' '+'you'+' '+'feeling'+' '+'well'+'?' //joining group of words//
'Are you feeling well?'

Finding specific words in text using NLTK and regex

Sentiments in texts using Regular Expressions

Sentiments are expressed as words in the tests.

Following steps in NLTK is used to find specific sentimental words in the text

Step1:

>>> f=FreqDist(w for w in set(text1) if re.search('^ash.*(ed)$', w) or re.search('^dis.*', w) or re.search('^sh[o|a][c|k].*', w) or re.search('^neg.*', w))
>>> sorted(f)
['dis', 'disable', 'disabled', 'disadvantage', 'disaffection', 'disagreeable', 'disappearance', 'disappeared', 'disappearing', 'disappears', 'disappointed', 'disaster', 'disasters', 'disastrous', 'disbands', 'disbelief', 'discerned', 'discernible', 'discernment', 'discerns', 'discharge', 'discharged', 'discharges', 'discharging', 'disciple', 'disciples', 'discipline', 'disclosed', 'disclosures', 'discolour', 'discoloured', 'discomforts', 'disconnected', 'discount', 'discourse', 'discourseth', 'discoursing', 'discover', 'discovered', 'discoverer', 'discoverers', 'discoveries', 'discovering', 'discovery', 'discreditably', 'discreet', 'discreetly', 'discretion', 'discriminating', 'discrimination', 'disdain', 'disdained', 'disease', 'disembowelled', 'disembowelments', 'disencumber', 'disengaged', 'disentangling', 'disgorge', 'disguise', 'disguisement', 'disguises', 'disgust', 'disgusted', 'dish', 'disheartening', 'dishes', 'dishonour', 'disincline', 'disinfecting', 'disintegrate', 'disinterested', 'disinterred', 'disjointedly', 'disks', 'dislike', 'dislocated', 'dislocation', 'dislodged', 'dismal', 'dismally', 'dismantled', 'dismasted', 'dismasting', 'dismay', 'dismember', 'dismembered', 'dismemberer', 'dismembering', 'dismemberment', 'dismissal', 'dismissed', 'disobedience', 'disobey', 'disobeying', 'disorder', 'disordered', 'disorderliness', 'disorderly', 'disorders', 'disparagement', 'dispel', 'dispensed', 'dispenses', 'dispersed', 'dispirited', 'dispirits', 'displaced', 'display', 'displayed', 'displays', 'disport', 'disposed', 'disposing', 'disposition', 'disproved', 'dispute', 'disputes', 'disputing', 'disquietude', 'disrated', 'disreputable', 'dissatisfaction', 'dissect', 'dissemble', 'dissembling', 'dissent', 'dissertations', 'dissimilar', 'dissociated', 'dissolutions', 'dissolve', 'dissolved', 'distance', 'distances', 'distant', 'distantly', 'distended', 'distension', 'distilled', 'distinct', 'distinction', 'distinctions', 'distinctive', 'distinctly', 'distinguish', 'distinguished', 'distinguishing', 'distortions', 'distracted', 'distraction', 'distress', 'distressed', 'distributed', 'district', 'districts', 'distrust', 'distrusted', 'distrustful', 'distrusting', 'disturb', 'disturbing', 'negations', 'negative', 'negatived', 'negatively', 'neglect', 'neglected', 'negro', 'negroes', 'shake', 'shaken', 'shakes', 'shaking', 'shock', 'shocked', 'shocking', 'shocks']

Step2:

>>> f=FreqDist(w for w in set(text1) if re.search('^ash.*(ed)$', w) or re.search('^dis[a|c|h|g|l|o|m|t][g|s|p|r|o|u|e|b|a].*', w) or re.search('^dis[m|o][a|b].*', w) or re.search('^sh[o|a][c|k].*', w) or re.search('^neg[a|l].*', w))
>>> sorted(f)
['disable', 'disabled', 'disagreeable', 'disappearance', 'disappeared', 'disappearing', 'disappears', 'disappointed', 'disaster', 'disasters', 'disastrous', 'discerned', 'discernible', 'discernment', 'discerns', 'discolour', 'discoloured', 'discomforts', 'disconnected', 'discount', 'discourse', 'discourseth', 'discoursing', 'discover', 'discovered', 'discoverer', 'discoverers', 'discoveries', 'discovering', 'discovery', 'discreditably', 'discreet', 'discreetly', 'discretion', 'discriminating', 'discrimination', 'disgorge', 'disguise', 'disguisement', 'disguises', 'disgust', 'disgusted', 'disheartening', 'dishes', 'dishonour', 'dislocated', 'dislocation', 'dislodged', 'dismal', 'dismally', 'dismantled', 'dismasted', 'dismasting', 'dismay', 'dismember', 'dismembered', 'dismemberer', 'dismembering', 'dismemberment', 'disobedience', 'disobey', 'disobeying', 'disorder', 'disordered', 'disorderliness', 'disorderly', 'disorders', 'distance', 'distances', 'distant', 'distantly', 'distended', 'distension', 'distortions', 'distracted', 'distraction', 'distress', 'distressed', 'distributed', 'district', 'districts', 'distrust', 'distrusted', 'distrustful', 'distrusting', 'disturb', 'disturbing', 'negations', 'negative', 'negatived', 'negatively', 'neglect', 'neglected', 'shake', 'shaken', 'shakes', 'shaking', 'shock', 'shocked', 'shocking', 'shocks']

Step 3:

>>> f=FreqDist(w for w in set(text1) if re.search('^ash.*(ed)$', w) or re.search('^dis[a|c|o|m][g|s|p|r|o|u|e|b|a].*', w) or re.search('^dist(r)[u|e].*', w) or re.search('^dis[g|h|m|o][a|b|o|u].*', w) or re.search('^sh[o|a][c|k].*', w) or re.search('^neg[a|l].*', w))
>>> sorted(f)
['disable', 'disabled', 'disagreeable', 'disappearance', 'disappeared', 'disappearing', 'disappears', 'disappointed', 'disaster', 'disasters', 'disastrous', 'discerned', 'discernible', 'discernment', 'discerns', 'discolour', 'discoloured', 'discomforts', 'disconnected', 'discount', 'discourse', 'discourseth', 'discoursing', 'discover', 'discovered', 'discoverer', 'discoverers', 'discoveries', 'discovering', 'discovery', 'discreditably', 'discreet', 'discreetly', 'discretion', 'discriminating', 'discrimination', 'disgorge', 'disguise', 'disguisement', 'disguises', 'disgust', 'disgusted', 'dishonour', 'dismal', 'dismally', 'dismantled', 'dismasted', 'dismasting', 'dismay', 'dismember', 'dismembered', 'dismemberer', 'dismembering', 'dismemberment', 'disobedience', 'disobey', 'disobeying', 'disorder', 'disordered', 'disorderliness', 'disorderly', 'disorders', 'distress', 'distressed', 'distrust', 'distrusted', 'distrustful', 'distrusting', 'negations', 'negative', 'negatived', 'negatively', 'neglect', 'neglected', 'shake', 'shaken', 'shakes', 'shaking', 'shock', 'shocked', 'shocking', 'shocks']

Step 4:

>>> f=FreqDist(w for w in set(text1) if re.search('^ash.*(ed)$', w) or re.search('^dis[a][g|p]*(?!p)(?!e)(?!o).*', w) or re.search('^dist(r)[u|e].*', w) or re.search('^dis[g|h|m|o][a|b|o|u].*', w) or re.search('^sh[o|a][c|k].*', w) or re.search('^neg[a|l].*(?!o).*', w))
>>> sorted(f)
['disable', 'disabled', 'disadvantage', 'disaffection', 'disagreeable', 'disaster', 'disasters', 'disastrous', 'disgorge', 'disguise', 'disguisement', 'disguises', 'disgust', 'disgusted', 'dishonour', 'dismal', 'dismally', 'dismantled', 'dismasted', 'dismasting', 'dismay', 'disobedience', 'disobey', 'disobeying', 'distress', 'distressed', 'distrust', 'distrusted', 'distrustful', 'distrusting', 'negations', 'negative', 'negatived', 'negatively', 'neglect', 'neglected', 'shake', 'shaken', 'shakes', 'shaking', 'shock', 'shocked', 'shocking', 'shocks']

Project 1: Finding words in webpage and other details

Pruning words from webpage

Finding sentiments in webpage and online portal is important to judge reach of the the webpage as well as its popularity among customers.
We will use combination of Regular expression (regex) to sort words from the webpage content.

Steps followed are as follows:
Start with by looking at HTML format of the webpage.

Step1: Print all the necessary webpage content as text and store it somewhere as text file.

>>> res = re.findall(r'title=[\'"](.*?)[\'"]', str(page))// Here purpose was to print all things which start with 'title=' and end with '//
>>> print(res)
['The Times of India', 'Videos', 'City', 'India', 'World', 'Business', 'Tech', 'Cricket', 'Sports', 'Entertainment', 'TV', 'Life & Style', 'Photos', 'Travel', 'Live TV', 'TIMES NEWS - RADIO', 'Modi Government', 'Yoga Day', 'GST', 'Elections 2017', 'Delhi MCD 2017', 'Brandwire', 'Yearender 2016', 'Good Governance', 'Harvey survivers return home', 'Narendra Modi poses with Aung San Suu Kyi', 'Largest wildfire in Los Angeles history', 'PM Modi attends BRICS Summit 2017', 'Journey to Mecca', 'Texans refuse to leave pets behind', 'Biden works to help kids cope with flood trauma', 'CTE: How repeated head blows affect the brain', 'Richmond mulling fate of confederate monuments', 'Trump unveils rough outline of tax cut package', 'Irma lashes Puerto Rico, leaves Barbuda devastated', 'Texas: Woman in handcuffs steals police car', '', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil', 'Honeypreet Insan', 'Rahul Gandhi', 'Blue Whale Challenge', 'India China standoff', 'Yogi Adityanath', 'Arvind Kejriwal', 'Nitish Kumar', 'AIADMK news', 'Narendra Modi', 'Post', 'Facebook', 'Google', 'Email', '{{:user.points}} Points', '{{:name}}', '{{:name}}', '{{:name}}', '{{:abstime}}', 'Follow {{:user.name}} {{:user.follower_text}}', 'Toggle Replies', 'Toggle Replies', 'Up Vote', 'Down Vote', 'Mark as offensive', 'Already marked as offensive', 'Wordsmith', 'Man inappropriately touches women, held', 'Man inappropriately touches women, held', 'Parineeti reveals how she injured her foot', 'Parineeti reveals how she injured her foot', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Cop arrested for taking bribe of Rs 5000 in UP\\', 'Cop arrested for taking bribe of Rs 5000 in UP\\', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'Shilpa Shetty spotted on a dinner date with hubby Raj Kundra', 'Shilpa Shetty spotted on a dinner date with hubby Raj Kundra', 'Sanjay Dutt and Maanayata spotted on a dinner outing with their twins Shahraan and Iqra', 'Sanjay Dutt and Maanayata spotted on a dinner outing with their twins Shahraan and Iqra', 'Shahid Kapoor celebrates Mira Rajput\\', 'Shahid Kapoor celebrates Mira Rajput\\', 'Billions of dead trees force US fire crews to shift tactics', 'Billions of dead trees force US fire crews to shift tactics', 'Trump hails Kuwait\\', 'Trump hails Kuwait\\', 'Yamaha Fascino Miss Diva 2017 final auditions: Bikini Round', 'Yamaha Fascino Miss Diva 2017 final auditions: Bikini Round', 'Unveiling the Yamaha Fascino Miss Diva 2017', 'Unveiling the Yamaha Fascino Miss Diva 2017', 'First Glimpse at Yamaha Fascino Miss Diva 2017 Final Auditions', 'First Glimpse at Yamaha Fascino Miss Diva 2017 Final Auditions', 'TOI Sports\\', 'TOI Sports\\', 'Mumbai: Month on, Siddhi Sai residents get new homes', 'Mumbai: Month on, Siddhi Sai residents get new homes', 'Mumbai: Three women held with stolen cellphones worth Rs 3 lakh', 'Mumbai: Three women held with stolen cellphones worth Rs 3 lakh', 'Meda Meeda Abbayi Telugu Movie Review', 'Chicken Masala Recipe: How to Make Chicken Masala', 'Chicken Biryani Recipe: How to make Chicken Biryani', 'Advertisement', 'Advertisement', 'Advertisement', 'Mithali Raj', 'Hurricane Imra', 'PU Election 2017', '1993 Mumbai blast', 'Virat Kohli', 'Abu Salem', 'Shaktipunj express', 'Gauri Lankesh', 'Mumbai blasts verdict', 'Tarun Tejpal', 'Ind vs SL T20', 'Cricket News', 'Bombay Blast Verdict Video', 'Shaktipunj Express Video', 'Rajdhani Express Accident Video', 'Pregnancy Calculator', 'BMI Calculator', 'Ovulation Calculator', 'How to Get Pregnant', 'Stop Hair loss', 'Blue Whale Game', 'xXx: Xander Cage', 'Weight Loss Tips', 'Bigg Boss Tamil', 'Sunny Leone Photos', 'Telugu Movie News', 'Meda Meeda Abbayi review', 'Oonchi Hai Building 2.0', 'Parineeti Chopra', 'Baadshaho box office', 'Video', 'Daddy review', 'Yuddham Sharanam review', 'Jhavi Kapoor', 'New movies', 'Sunny Leone Songs', '~name~', '~name~', '~name~', '~name~', '~name~']

Step 2 : Finding particular words in saved text or selected text content

>>> for w in res: //We wish to find 'trees' in titles of the article. So, all words containing 'trees' in respective rows will be searched//
... re.search('trees', w)
... f = re.search('trees', w)
... print(f)
...
None
None
None
......
<_sre.SRE_Match object; span=(17, 22), match='trees'>
<_sre.SRE_Match object; span=(17, 22), match='trees'>
<_sre.SRE_Match object; span=(17, 22), match='trees'>
<_sre.SRE_Match object; span=(17, 22), match='trees'>
None
None
......

Other useful NLTK commands are:
a) To find length of words in the text

>>> [len(w) for w in set(res)]
[0, 23, 13, 7, 18, 41, 37, 74, 13, 12, 20, 9, 15, 6, 45, 16, 16, 47, 12, 15, 27, 42, 6, 39, 46, 14, 23, 9, 49, 62, 28, 16, 46, 8, 19, 5, 21, 5, 20, 4, 17, 51, 15, 16, 6, 12, 9, 47, 68, 64, 18, 74, 19, 17, 5, 18, 4, 14, 13, 50, 59, 16, 14, 23, 11, 8, 17, 7, 45, 11, 9, 14, 5, 20, 17, 60, 8, 11, 16, 14, 16, 63, 12, 87, 26, 43, 12, 3, 33, 52, 20, 6, 11, 13, 43, 24, 34, 31, 59, 18, 2, 13, 15, 10, 6, 24, 6, 4, 39, 9, 20, 57, 12, 15, 7, 15, 14, 12, 14, 37, 68]

b) To find the word(s) which end with 'ful'

>>> sorted(w for w in set(res) if w.endswith('ful'))
[]

c) To find the word(s) which end with 'l'
>>> sorted(w for w in set(res) if w.endswith('l'))
['Arvind Kejriwal', 'Bigg Boss Tamil', 'Email', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil', 'Tarun Tejpal', 'Travel']

>>> res.collocations()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'collocations'

>>> sorted(w for w in set(res) if w.startswith('a'))
[]
>>> sorted(w for w in set(res) if w.startswith('A'))
['AIADMK news', 'Abu Salem', 'Advertisement', 'Already marked as offensive', 'Arvind Kejriwal']

>>> sorted(w for w in set(res) if w.startswith('B'))
['BMI Calculator', 'Baadshaho box office', 'Biden works to help kids cope with flood trauma', 'Bigg Boss Tamil', 'Billions of dead trees force US fire crews to shift tactics', 'Blue Whale Challenge', 'Blue Whale Game', 'Bombay Blast Verdict Video', 'Brandwire', 'Business']

>>> sorted(w for w in set(res) if w.startswith('C'))
['CTE: How repeated head blows affect the brain', 'Chicken Biryani Recipe: How to make Chicken Biryani', 'Chicken Masala Recipe: How to Make Chicken Masala', 'City', 'Cop arrested for taking bribe of Rs 5000 in UP\\', 'Cricket', 'Cricket News']

>>> sorted(w for w in set(res) if w.startswith('D'))
['Daddy review', 'Delhi MCD 2017', 'Down Vote']

>>> sorted(w for w in set(res) if w.startswith('E'))
['Elections 2017', 'Email', 'Entertainment']

>>> sorted(w for w in set(res) if w.startswith('F'))
['Facebook', 'First Glimpse at Yamaha Fascino Miss Diva 2017 Final Auditions', 'Follow {{:user.name}} {{:user.follower_text}}', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil']

d) Print first 80 'words'/ 'titles' in the content
>>> res[:80]
['The Times of India', 'Videos', 'City', 'India', 'World', 'Business', 'Tech', 'Cricket', 'Sports', 'Entertainment', 'TV', 'Life & Style', 'Photos', 'Travel', 'Live TV', 'TIMES NEWS - RADIO', 'Modi Government', 'Yoga Day', 'GST', 'Elections 2017', 'Delhi MCD 2017', 'Brandwire', 'Yearender 2016', 'Good Governance', 'Harvey survivers return home', 'Narendra Modi poses with Aung San Suu Kyi', 'Largest wildfire in Los Angeles history', 'PM Modi attends BRICS Summit 2017', 'Journey to Mecca', 'Texans refuse to leave pets behind', 'Biden works to help kids cope with flood trauma', 'CTE: How repeated head blows affect the brain', 'Richmond mulling fate of confederate monuments', 'Trump unveils rough outline of tax cut package', 'Irma lashes Puerto Rico, leaves Barbuda devastated', 'Texas: Woman in handcuffs steals police car', '', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil', 'Honeypreet Insan', 'Rahul Gandhi', 'Blue Whale Challenge', 'India China standoff', 'Yogi Adityanath', 'Arvind Kejriwal', 'Nitish Kumar', 'AIADMK news', 'Narendra Modi', 'Post', 'Facebook', 'Google', 'Email', '{{:user.points}} Points', '{{:name}}', '{{:name}}', '{{:name}}', '{{:abstime}}', 'Follow {{:user.name}} {{:user.follower_text}}', 'Toggle Replies', 'Toggle Replies', 'Up Vote', 'Down Vote', 'Mark as offensive', 'Already marked as offensive', 'Wordsmith', 'Man inappropriately touches women, held', 'Man inappropriately touches women, held', 'Parineeti reveals how she injured her foot', 'Parineeti reveals how she injured her foot', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Cop arrested for taking bribe of Rs 5000 in UP\\', 'Cop arrested for taking bribe of Rs 5000 in UP\\', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'Shilpa Shetty spotted on a dinner date with hubby Raj Kundra']

NLTK Programming-Statistics (Part 2)

NLTK Programming Statistics

Some of the statistical programming commands are as follows:

>>> f = [w for w in text1 if len(w) > 14 and w.startswith ('a')] // words with length > 14 and startswith 'a'//
>>> print(f)
['apprehensiveness', 'authoritatively', 'apprehensiveness', 'archiepiscopacy', 'amphitheatrical', 'apprehensiveness', 'apprehensiveness']

>>> set(f)
{'apprehensiveness', 'amphitheatrical', 'archiepiscopacy', 'authoritatively'}

>>> f.most_common(5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'most_common'

>>> f = [w for w in text3 if len(w) > 12 and w.endswith ('ed')] // words with length > 12 and ends with 'a'//
>>> set(f)
{'uncircumcised'}

>>> f = [w for w in text3 if len(w) > 12 and w.istitle]
>>> set(f)
{'plenteousness', 'womenservants', 'uncircumcised', 'EleloheIsrael', 'Zaphnathpaaneah', 'righteousness', 'interpretations', 'sheepshearers', 'interpretation', 'threshingfloor', 'Jegarsahadutha'}

>>> f = [w for w in text3 if 15 >= len(w) > 12 and w.islower()]
>>> set(f)
{'plenteousness', 'womenservants', 'uncircumcised', 'righteousness', 'interpretations', 'sheepshearers', 'interpretation', 'threshingfloor'}

>>> f= FreqDist(len(w) for w in text2 if len(w) > 14)
>>> print(f)
<FreqDist with 3 samples and 29 outcomes>
>>> set(f)
{16, 17, 15}
>>> f.plot()

Following plot is depicting counts of words with particular size.

>>> f.most_common(20)
[(15, 24), (17, 3), (16, 2)]

>>> f= FreqDist(w for w in text2 if len(w) > 14)
>>> print(f)
<FreqDist with 20 samples and 29 outcomes>
>>> set(f)
{'disinterestedness', 'conscientiously', 'disrespectfully', 'congratulations', 'misconstruction', 'representations', 'proportionately', 'companionableness', 'enfranchisement', 'unobtrusiveness', 'inquisitiveness', 'dissatisfaction', 'inconsiderately', 'disappointments', 'acknowledgments', 'disqualifications', 'instantaneously', 'connoisseurship', 'misapprehension', 'incomprehensible'}

>>> f.most_common(20) // most common top 20 words in terms of their repititions//
[('misapprehension', 4), ('disappointments', 3), ('acknowledgments', 3), ('incomprehensible', 2), ('congratulations', 2), ('conscientiously', 1), ('representations', 1), ('inconsiderately', 1), ('instantaneously', 1), ('inquisitiveness', 1), ('dissatisfaction', 1), ('disinterestedness', 1), ('companionableness', 1), ('proportionately', 1), ('disqualifications', 1), ('connoisseurship', 1), ('misconstruction', 1), ('enfranchisement', 1), ('disrespectfully', 1), ('unobtrusiveness', 1)]

>>> f.max() // words with highest word size//
'misapprehension'

>>> f.tabulate() // tabulate words in terms of their repetitions in text//
misapprehension disappointments acknowledgments incomprehensible congratulations conscientiously representations inconsiderately instantaneously inquisitiveness dissatisfaction disinterestedness companionableness proportionately disqualifications connoisseurship misconstruction enfranchisement disrespectfully unobtrusiveness
4 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>>> f.plot()

>>> f.plot(cumulative=True)

Comparison of Naive Bayes, Semi-Supervised and ensemble models

Comparison of Naive Bayes, Semi-Supervised & Ensemble methods on output

Python Program
>>> clf101=naive_bayes.BernoulliNB()
>>> clf102=tree.DecisionTreeClassifier()
>>> clf103=tree.ExtraTreeClassifier()

>>> from sklearn import ensemble
>>> clf104=ensemble.ExtraTreesClassifier()
>>> clf105=naive_bayes.GaussianNB()

>>> from sklearn import semi_supervised
>>> clf106=semi_supervised.LabelPropagation()
>>> clf107=semi_supervised.LabelSpreading()

>>> for clf in [clf101, clf102, clf103, clf104, clf105, clf106]:
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("(clf='%s')" %(clf))
...

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')

ExtraTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, random_state=None,
splitter='random')

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False)

GaussianNB(priors=None)

LabelPropagation(alpha=None, gamma=20, kernel='rbf', max_iter=1000, n_jobs=1,
n_neighbors=7, tol=0.001)

Comparison of Linear models

Linear Models output accuracy

Linear models are tested here to see their accuracy in terms of output.

Python program:

>>> import numpy as np

>>> import matplotlib.pyplot as plt

>>> from matplotlib.colors import ListedColormap

>>> from sklearn import neighbors, datasets

>>> n_neighbors = 24

>>> iris = datasets.load_iris()

>>> x = iris.data[:, :2]

>>> y = iris.target

>>> h = .02

>>> cmap_bold = ListedColormap(['firebrick', 'lime', 'blue'])

>>> cmap_light = ListedColormap(['pink', 'lightgreen', 'paleturquoise'])

//Defining different linear models//
>>> clf86 = linear_model.SGDClassifier()
>>> clf87 = linear_model.SGDRegressor()
>>> clf88 = linear_model.PassiveAggresiveRegressor()
>>> clf89 = linear_model.LinearRegression()
>>> clf90 = linear_model.Ridge()
>>> clf91 = linear_model.RidgeCV()
>>> clf92 = linear_model.Lasso()
>>> clf93 = linear_model.LassoLars()
>>> clf94 = linear_model.ElasticNet()

//Calling and plotting them//
>>> for clf in [clf86, clf87, clf88, clf89, clf90, clf91, clf92, clf93, clf94]:
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("Regressor (clf='%s')" %(clf))
...

Output:

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
shuffle=True, tol=None, verbose=0, warm_start=False)

SGDRegressor(alpha=0.0001, average=False, epsilon=0.1, eta0=0.01,
fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
loss='squared_loss', max_iter=5, n_iter=None, penalty='l2',
power_t=0.25, random_state=None, shuffle=True, tol=None, verbose=0,
warm_start=False)

PassiveAggressiveRegressor(C=1.0, average=False, epsilon=0.1,
fit_intercept=True, loss='epsilon_insensitive', max_iter=5,
n_iter=None, random_state=None, shuffle=True, tol=None,
verbose=0, warm_start=False)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)

RidgeCV(alphas=(0.1, 1.0, 10.0), cv=None, fit_intercept=True, gcv_mode=None,
normalize=False, scoring=None, store_cv_values=False)

Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)

LassoLars(alpha=1.0, copy_X=True, eps=2.2204460492503131e-16,
fit_intercept=True, fit_path=True, max_iter=500, normalize=True,
positive=False, precompute='auto', verbose=False)

ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

Comparison of different Neighbor Classifiers & Regressors

Neighbors Techniques

Neighbors techniques are tested here to see their accuracy in terms of output.

Python program:

>>> import numpy as np

>>> import matplotlib.pyplot as plt

>>> from matplotlib.colors import ListedColormap

>>> from sklearn import neighbors, datasets

>>> n_neighbors = 24

>>> iris = datasets.load_iris()

>>> x = iris.data[:, :2]

>>> y = iris.target

>>> h = .02

>>> cmap_bold = ListedColormap(['firebrick', 'lime', 'blue'])

>>> cmap_light = ListedColormap(['pink', 'lightgreen', 'paleturquoise'])

//Calling different neighbors with Discriminant Analysis techniques//

>>> from sklearn import neighbors, tree
>>> clf14 = neighbors.NearestNeighbors()
>>> clf15 = neighbors.KNeighborsClassifier()
>>> clf16 = neighbors.RadiusNeighborsClassifier()
>>> clf17 = neighbors.RadiusNeighborsRegressor()
>>> clf18 = neighbors.KNeighborsRegressor()
>>> clf19 = neighbors.NearestCentroid()
>>> from sklearn import discriminant_analysis
>>> clf20 = discriminant_analysis.LinearDiscriminantAnalysis()
>>> clf21 = discriminant_analysis.QuadraticDiscriminantAnalysis()

//Plotting techniques//
>>> for clf in [clf15, clf17, clf18, clf19, clf20, clf21]:
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("Regressor (clf='%s')" %(clf))
...

Output:

Comparison of different clustering techniques

Clustering Techniques

Clustering techniques are tested here to see their accuracy in terms of output.

Python program:

>>> import numpy as np

>>> import matplotlib.pyplot as plt

>>> from matplotlib.colors import ListedColormap

>>> from sklearn import neighbors, datasets

>>> n_neighbors = 24

>>> iris = datasets.load_iris()

>>> x = iris.data[:, :2]

>>> y = iris.target

>>> h = .02

>>> cmap_bold = ListedColormap(['firebrick', 'lime', 'blue'])

>>> cmap_light = ListedColormap(['pink', 'lightgreen', 'paleturquoise'])

//Defining different clustering techniques//

>>> clf51= cluster.KMeans()
>>> clf52= cluster.AffinityPropagation()
>>> clf53= cluster.MeanShift()
>>> clf54= cluster.SpectralClustering()
>>> clf55= cluster.AgglomerativeClustering()
>>> clf56= cluster.DBSCAN()
>>> clf57= cluster.Birch()
>>> from sklearn import mixture
>>> clf58= cluster.GaussianMixture()

//Plotting the output//

>>> for clf in [clf51, clf52, clf53, clf57, clf58]:
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("Regressor (clf='%s')" %(clf))
...

Output:

Gaussian Process Regressor: Effect of parameters

Gaussian Process Regressor-Effect of parameters on output

Gaussian Process Regressor technique is tested here to see their accuracy in terms of output.

Python program:

>>> import numpy as np

>>> import matplotlib.pyplot as plt

>>> from matplotlib.colors import ListedColormap

>>> from sklearn import neighbors, datasets

>>> n_neighbors = 24

>>> iris = datasets.load_iris()

>>> x = iris.data[:, :2]

>>> y = iris.target

>>> h = .02

>>> cmap_bold = ListedColormap(['firebrick', 'lime', 'blue'])

>>> cmap_light = ListedColormap(['pink', 'lightgreen', 'paleturquoise'])

//Plotting the analysis//
a) Effect of alpha:

>>> for alpha in [0.0001, 0.0005, 0.001, 0.05, 0.1, 0.2, 0.5]:
... clf = gaussian_process.GaussianProcessRegressor(alpha=alpha)
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("GaussianProcessRegressor (alpha='%s')" %(alpha))
...

>>> for alpha in [0.75, 1, 2, 5, 15, 25, 50, 100, 250, 500, 1000]:
... clf = gaussian_process.GaussianProcessRegressor(alpha=alpha)
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("GaussianProcessRegressor (alpha='%s')" %(alpha))
...

b) Effect of number of optimizer restarts (n_restarts_optimizer):

>>> for n_restarts_optimizer in [1, 2, 5, 25, 50, 250, 750, 2500, 5000]:
... clf = gaussian_process.GaussianProcessRegressor(n_restarts_optimizer=n_restarts_optimizer)
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("GaussianProcessRegressor (n_restarts_optimizer='%s')" %(n_restarts_optimizer))
...

c) Effect of normalization of y:

>>> for normalize_y in ['False', None, 'True']:
... clf = gaussian_process.GaussianProcessRegressor(normalize_y=normalize_y)
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("GaussianProcessRegressor (normalize_y='%s')" %(normalize_y))
...

No effect on output.

d) Effect of random state (random_state):

>>> for random_state in [1, 2, 5, 25, 100, 250]:
... clf = gaussian_process.GaussianProcessRegressor(random_state=random_state)
... clf.fit(x, y)
... x_min, x_max = x[:, 0].min() -1, x[:, 0].max() +1
... y_min, y_max = x[:, 1].min() -1, x[:, 1].max() +1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
... z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
... z = z.reshape(xx.shape)
... plt.figure()
... plt.pcolormesh(xx, yy, z, cmap=cmap_light)
... plt.scatter(x[:, 0], x[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=24)
... plt.xlim(xx.min(), xx.max())
... plt.ylim(yy.min(), yy.max())
... plt.title("GaussianProcessRegressor (random_state='%s')" %(random_state))
...

No effect on output.