python - Re.match does not restrict urls -

- February 15, 2010

i school urls in table on wiki page lead page information. bad urls colored red contain phrase 'page not exist' in side 'title' attr. trying use re.match() filter urls such return not contain aforementioned string. why isn't re.match() working?

url:

districts_page = 'https://en.wikipedia.org/wiki/list_of_school_districts_in_alabama'

function:

def url_check(url):      all_urls = []      r = requests.get(url, proxies = proxies)     html_source = r.text     soup = beautifulsoup(html_source)      link in soup.find_all('a'):         if type(link.get('title')) == str:             if re.match(link.get('title'), '(page not exist)') == none:                  all_urls.append(link.get('href'))             else: pass      return

this not address fixing problem re.match, may valid approach without using regex:

  link in soup.find_all('a'):     title = link.get('title')     if title:       if not 'page not exist' in title:          all_urls.append(link.get('href'))

Search This Blog

Naan

python - Re.match does not restrict urls -

Comments

Post a Comment

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

asp.net - Passing parameter to telerik popup -

node.js - ldapjs - write after end error -