loops - Python, if 'word' is in link print link else if '2ndword' in link print that one -
so have made python spider gets links given site , prints out 1 contains 'impressum'
in itself. wanted make elif
function prints out link contains 'kontakt'
in istelf if 1 'impressum'
not found in links. code looks this:
import urllib import re import mechanize bs4 import beautifulsoup import urlparse import cookielib urlparse import urlsplit publicsuffix import publicsuffixlist url = "http://www.zahnarztpraxis-uwe-krause.de" br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_robots(false) br.set_handle_equiv(false) br.set_handle_redirect(true) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'mozilla/5.0 (x11; u; linux i686; en-us; rv:1.9.0.1) gecko/2008071615 fedora/3.0.1-1.fc9 firefox/3.0.1')] page = br.open(url, timeout=5) htmlcontent = page.read() soup = beautifulsoup(htmlcontent) newurlarray = [] link in br.links(text_regex=re.compile('^((?!img).)*$')): newurl = urlparse.urljoin(link.base_url, link.url) if newurl not in newurlarray: newurlarray.append(newurl) #print newurl if 'impressum' in newurl: print newurl elif 'impressum' not in newurl , 'kontakt' in newurl: print newurl
and despite of if
elif
loop i'm getting both links in console:
http://www.zahnarztpraxis-uwe-krause.de/pages/kontakt.html http://www.zahnarztpraxis-uwe-krause.de/pages/impressum.html
but in true situation need second 'kontakt' if 'impressum not found.
what doing wrong?
you see both links because occurring in separate iterations of for
loop. single if
block looks @ single url, , elif
makes sure single url isn't printed twice in case contains both "impressum"
, "kontakt"
. doesn't prevent more links being printed in later iterations.
to achieve want first have loop on links , decide after loop print, since want give precedence "impressum"
in case. can know whether there "impressum"
after you've seen links:
urls = set() contact_keys = ["impressum", "kontakt"] found_contact_urls = {} link in ...: new_url = ... urls.add(new_url) key in contact_keys: if key in new_url: found_contact_urls[key] = new_url break key in contact_keys: if key in found_contact_urls: print found_contact_urls[key] break
this code allows add further fall-back strings list contact_keys
.
Comments
Post a Comment