loops - Python, if 'word' is in link print link else if '2ndword' in link print that one -


so have made python spider gets links given site , prints out 1 contains 'impressum' in itself. wanted make elif function prints out link contains 'kontakt' in istelf if 1 'impressum' not found in links. code looks this:

import urllib import re import mechanize bs4 import beautifulsoup import urlparse import cookielib urlparse import urlsplit publicsuffix import publicsuffixlist  url = "http://www.zahnarztpraxis-uwe-krause.de"  br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_robots(false) br.set_handle_equiv(false) br.set_handle_redirect(true) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')] page = br.open(url, timeout=5)  htmlcontent = page.read() soup = beautifulsoup(htmlcontent)  newurlarray = []  link in br.links(text_regex=re.compile('^((?!img).)*$')):     newurl = urlparse.urljoin(link.base_url, link.url)     if newurl not in newurlarray:         newurlarray.append(newurl)         #print newurl         if 'impressum' in newurl:             print newurl          elif 'impressum' not in newurl , 'kontakt' in newurl:             print newurl 

and despite of if elif loop i'm getting both links in console:

http://www.zahnarztpraxis-uwe-krause.de/pages/kontakt.html http://www.zahnarztpraxis-uwe-krause.de/pages/impressum.html 

but in true situation need second 'kontakt' if 'impressum not found.

what doing wrong?

you see both links because occurring in separate iterations of for loop. single if block looks @ single url, , elif makes sure single url isn't printed twice in case contains both "impressum" , "kontakt". doesn't prevent more links being printed in later iterations.

to achieve want first have loop on links , decide after loop print, since want give precedence "impressum" in case. can know whether there "impressum" after you've seen links:

urls = set() contact_keys = ["impressum", "kontakt"] found_contact_urls = {} link in ...:     new_url = ...     urls.add(new_url)     key in contact_keys:         if key in new_url:             found_contact_urls[key] = new_url             break key in contact_keys:     if key in found_contact_urls:         print found_contact_urls[key]         break 

this code allows add further fall-back strings list contact_keys.


Comments

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

node.js - ldapjs - write after end error -

DOM Manipulation in Wordpress (and elsewhere) using php -