Tuesday, 19 September 2017

Project 1: Finding words in webpage and other details

                                     Pruning words from webpage

Finding sentiments in webpage and online portal is important to judge reach of the the webpage as well as its popularity among customers.
We will use combination of Regular expression (regex) to sort words from the webpage content.

Steps followed are as follows:
Start with by looking at HTML format of the webpage.

Step1: Print all the necessary webpage content as text and store it somewhere as text file.

>>> res = re.findall(r'title=[\'"](.*?)[\'"]', str(page))// Here purpose was to print all things which start with 'title=' and end with '//
>>> print(res)
['The Times of India', 'Videos', 'City', 'India', 'World', 'Business', 'Tech', 'Cricket', 'Sports', 'Entertainment', 'TV', 'Life & Style', 'Photos', 'Travel', 'Live TV', 'TIMES NEWS - RADIO', 'Modi Government', 'Yoga Day', 'GST', 'Elections 2017', 'Delhi MCD 2017', 'Brandwire', 'Yearender 2016', 'Good Governance', 'Harvey survivers return home', 'Narendra Modi poses with Aung San Suu Kyi', 'Largest wildfire in Los Angeles history', 'PM Modi attends BRICS Summit 2017', 'Journey to Mecca', 'Texans refuse to leave pets behind', 'Biden works to help kids cope with flood trauma', 'CTE: How repeated head blows affect the brain', 'Richmond mulling fate of confederate monuments', 'Trump unveils rough outline of tax cut package', 'Irma lashes Puerto Rico, leaves Barbuda devastated', 'Texas: Woman in handcuffs steals police car', '', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil', 'Honeypreet Insan', 'Rahul Gandhi', 'Blue Whale Challenge', 'India China standoff', 'Yogi Adityanath', 'Arvind Kejriwal', 'Nitish Kumar', 'AIADMK news', 'Narendra Modi', 'Post', 'Facebook', 'Google', 'Email', '{{:user.points}} Points', '{{:name}}', '{{:name}}', '{{:name}}', '{{:abstime}}', 'Follow {{:user.name}} {{:user.follower_text}}', 'Toggle Replies', 'Toggle Replies', 'Up Vote', 'Down Vote', 'Mark as offensive', 'Already marked as offensive', 'Wordsmith', 'Man inappropriately touches women, held', 'Man inappropriately touches women, held', 'Parineeti reveals how she injured her foot', 'Parineeti reveals how she injured her foot', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Cop arrested for taking bribe of Rs 5000 in UP\\', 'Cop arrested for taking bribe of Rs 5000 in UP\\', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'Shilpa Shetty spotted on a dinner date with hubby Raj Kundra', 'Shilpa Shetty spotted on a dinner date with hubby Raj Kundra', 'Sanjay Dutt and Maanayata spotted on a dinner outing with their twins Shahraan and Iqra', 'Sanjay Dutt and Maanayata spotted on a dinner outing with their twins Shahraan and Iqra', 'Shahid Kapoor celebrates Mira Rajput\\', 'Shahid Kapoor celebrates Mira Rajput\\', 'Billions of dead trees force US fire crews to shift tactics', 'Billions of dead trees force US fire crews to shift tactics', 'Trump hails Kuwait\\', 'Trump hails Kuwait\\', 'Yamaha Fascino Miss Diva 2017 final auditions: Bikini Round', 'Yamaha Fascino Miss Diva 2017 final auditions: Bikini Round', 'Unveiling the Yamaha Fascino Miss Diva 2017', 'Unveiling the Yamaha Fascino Miss Diva 2017', 'First Glimpse at Yamaha Fascino Miss Diva 2017 Final Auditions', 'First Glimpse at Yamaha Fascino Miss Diva 2017 Final Auditions', 'TOI Sports\\', 'TOI Sports\\', 'Mumbai: Month on, Siddhi Sai residents get new homes', 'Mumbai: Month on, Siddhi Sai residents get new homes', 'Mumbai: Three women held with stolen cellphones worth Rs 3 lakh', 'Mumbai: Three women held with stolen cellphones worth Rs 3 lakh', 'Meda Meeda Abbayi Telugu Movie Review', 'Chicken Masala Recipe: How to Make Chicken Masala', 'Chicken Biryani Recipe: How to make Chicken Biryani', 'Advertisement', 'Advertisement', 'Advertisement', 'Mithali Raj', 'Hurricane Imra', 'PU Election 2017', '1993 Mumbai blast', 'Virat Kohli', 'Abu Salem', 'Shaktipunj express', 'Gauri Lankesh', 'Mumbai blasts verdict', 'Tarun Tejpal', 'Ind vs SL T20', 'Cricket News', 'Bombay Blast Verdict Video', 'Shaktipunj Express Video', 'Rajdhani Express Accident Video', 'Pregnancy Calculator', 'BMI Calculator', 'Ovulation Calculator', 'How to Get Pregnant', 'Stop Hair loss', 'Blue Whale Game', 'xXx: Xander Cage', 'Weight Loss Tips', 'Bigg Boss Tamil', 'Sunny Leone Photos', 'Telugu Movie News', 'Meda Meeda Abbayi review', 'Oonchi Hai Building 2.0', 'Parineeti Chopra', 'Baadshaho box office', 'Video', 'Daddy review', 'Yuddham Sharanam review', 'Jhavi Kapoor', 'New movies', 'Sunny Leone Songs', '~name~', '~name~', '~name~', '~name~', '~name~']

Step 2 : Finding particular words in saved text or selected text content

>>> for w in res: //We wish to find 'trees' in titles of the article. So, all words containing 'trees' in respective rows will be searched//
...   re.search('trees', w)
...   f = re.search('trees', w)
...   print(f)
...
None
None
None
......
<_sre.SRE_Match object; span=(17, 22), match='trees'>
<_sre.SRE_Match object; span=(17, 22), match='trees'>
<_sre.SRE_Match object; span=(17, 22), match='trees'>
<_sre.SRE_Match object; span=(17, 22), match='trees'>
None
None
......

Other useful NLTK commands are:
a) To find length of words in the text

>>> [len(w) for w in set(res)]
[0, 23, 13, 7, 18, 41, 37, 74, 13, 12, 20, 9, 15, 6, 45, 16, 16, 47, 12, 15, 27, 42, 6, 39, 46, 14, 23, 9, 49, 62, 28, 16, 46, 8, 19, 5, 21, 5, 20, 4, 17, 51, 15, 16, 6, 12, 9, 47, 68, 64, 18, 74, 19, 17, 5, 18, 4, 14, 13, 50, 59, 16, 14, 23, 11, 8, 17, 7, 45, 11, 9, 14, 5, 20, 17, 60, 8, 11, 16, 14, 16, 63, 12, 87, 26, 43, 12, 3, 33, 52, 20, 6, 11, 13, 43, 24, 34, 31, 59, 18, 2, 13, 15, 10, 6, 24, 6, 4, 39, 9, 20, 57, 12, 15, 7, 15, 14, 12, 14, 37, 68]

b) To find the word(s) which end with 'ful'

>>> sorted(w for w in set(res) if w.endswith('ful'))
[]

c) To find the word(s) which end with 'l'
>>> sorted(w for w in set(res) if w.endswith('l'))
['Arvind Kejriwal', 'Bigg Boss Tamil', 'Email', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil', 'Tarun Tejpal', 'Travel']

>>> res.collocations()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'collocations'

>>> sorted(w for w in set(res) if w.startswith('a'))
[]
>>> sorted(w for w in set(res) if w.startswith('A'))
['AIADMK news', 'Abu Salem', 'Advertisement', 'Already marked as offensive', 'Arvind Kejriwal']

>>> sorted(w for w in set(res) if w.startswith('B'))
['BMI Calculator', 'Baadshaho box office', 'Biden works to help kids cope with flood trauma', 'Bigg Boss Tamil', 'Billions of dead trees force US fire crews to shift tactics', 'Blue Whale Challenge', 'Blue Whale Game', 'Bombay Blast Verdict Video', 'Brandwire', 'Business']

>>> sorted(w for w in set(res) if w.startswith('C'))
['CTE: How repeated head blows affect the brain', 'Chicken Biryani Recipe: How to make Chicken Biryani', 'Chicken Masala Recipe: How to Make Chicken Masala', 'City', 'Cop arrested for taking bribe of Rs 5000 in UP\\', 'Cricket', 'Cricket News']

>>> sorted(w for w in set(res) if w.startswith('D'))
['Daddy review', 'Delhi MCD 2017', 'Down Vote']

>>> sorted(w for w in set(res) if w.startswith('E'))
['Elections 2017', 'Email', 'Entertainment']

>>> sorted(w for w in set(res) if w.startswith('F'))
['Facebook', 'First Glimpse at Yamaha Fascino Miss Diva 2017 Final Auditions', 'Follow {{:user.name}} {{:user.follower_text}}', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil']

d) Print first 80 'words'/ 'titles' in the content
>>> res[:80]
['The Times of India', 'Videos', 'City', 'India', 'World', 'Business', 'Tech', 'Cricket', 'Sports', 'Entertainment', 'TV', 'Life &amp; Style', 'Photos', 'Travel', 'Live TV', 'TIMES NEWS - RADIO', 'Modi Government', 'Yoga Day', 'GST', 'Elections 2017', 'Delhi MCD 2017', 'Brandwire', 'Yearender 2016', 'Good Governance', 'Harvey survivers return home', 'Narendra Modi poses with Aung San Suu Kyi', 'Largest wildfire in Los Angeles history', 'PM Modi attends BRICS Summit 2017', 'Journey to Mecca', 'Texans refuse to leave pets behind', 'Biden works to help kids cope with flood trauma', 'CTE: How repeated head blows affect the brain', 'Richmond mulling fate of confederate monuments', 'Trump unveils rough outline of tax cut package', 'Irma lashes Puerto Rico, leaves Barbuda devastated', 'Texas: Woman in handcuffs steals police car', '', 'For first time, Pakistan admits LeT, Jaish are based on its soil', 'For the first time, Pakistan admits Jaish, LeT are based on its soil', 'Honeypreet Insan', 'Rahul Gandhi', 'Blue Whale Challenge', 'India China standoff', 'Yogi Adityanath', 'Arvind Kejriwal', 'Nitish Kumar', 'AIADMK news', 'Narendra Modi', 'Post', 'Facebook', 'Google', 'Email', '{{:user.points}} Points', '{{:name}}', '{{:name}}', '{{:name}}', '{{:abstime}}', 'Follow {{:user.name}} {{:user.follower_text}}', 'Toggle Replies', 'Toggle Replies', 'Up Vote', 'Down Vote', 'Mark as offensive', 'Already marked as offensive', 'Wordsmith', 'Man inappropriately touches women, held', 'Man inappropriately touches women, held', 'Parineeti reveals how she injured her foot', 'Parineeti reveals how she injured her foot', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Mitali Sonawane: When I was 5 years old burn incident happened to me', 'Cop arrested for taking bribe of Rs 5000 in UP\\', 'Cop arrested for taking bribe of Rs 5000 in UP\\', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', '7 things you may not know about Miss Multinational PHL 2017 Sophia Senoron', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'Lara Dutta: I do want to see somebody who is not scared to express herself', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'NC chief Farooq Abdullah opposes crackdown on separatists', 'Shilpa Shetty spotted on a dinner date with hubby Raj Kundra']

No comments:

Post a Comment