loops - Python, if 'word' is in link print link else if '2ndword' in link print that one -

- March 15, 2013

so have made python spider gets links given site , prints out 1 contains 'impressum' in itself. wanted make elif function prints out link contains 'kontakt' in istelf if 1 'impressum' not found in links. code looks this:

import urllib import re import mechanize bs4 import beautifulsoup import urlparse import cookielib urlparse import urlsplit publicsuffix import publicsuffixlist  url = "http://www.zahnarztpraxis-uwe-krause.de"  br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_robots(false) br.set_handle_equiv(false) br.set_handle_redirect(true) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')] page = br.open(url, timeout=5)  htmlcontent = page.read() soup = beautifulsoup(htmlcontent)  newurlarray = []  link in br.links(text_regex=re.compile('^((?!img).)*$')):     newurl = urlparse.urljoin(link.base_url, link.url)     if newurl not in newurlarray:         newurlarray.append(newurl)         #print newurl         if 'impressum' in newurl:             print newurl          elif 'impressum' not in newurl , 'kontakt' in newurl:             print newurl

and despite of if elif loop i'm getting both links in console:

http://www.zahnarztpraxis-uwe-krause.de/pages/kontakt.html http://www.zahnarztpraxis-uwe-krause.de/pages/impressum.html

but in true situation need second 'kontakt' if 'impressum not found.

what doing wrong?

you see both links because occurring in separate iterations of for loop. single if block looks @ single url, , elif makes sure single url isn't printed twice in case contains both "impressum" , "kontakt". doesn't prevent more links being printed in later iterations.

to achieve want first have loop on links , decide after loop print, since want give precedence "impressum" in case. can know whether there "impressum" after you've seen links:

urls = set() contact_keys = ["impressum", "kontakt"] found_contact_urls = {} link in ...:     new_url = ...     urls.add(new_url)     key in contact_keys:         if key in new_url:             found_contact_urls[key] = new_url             break key in contact_keys:     if key in found_contact_urls:         print found_contact_urls[key]         break

this code allows add further fall-back strings list contact_keys.

Search This Blog

Naan

loops - Python, if 'word' is in link print link else if '2ndword' in link print that one -

Comments

Post a Comment

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

asp.net - Passing parameter to telerik popup -

node.js - ldapjs - write after end error -