python - Re.match does not restrict urls -


i school urls in table on wiki page lead page information. bad urls colored red contain phrase 'page not exist' in side 'title' attr. trying use re.match() filter urls such return not contain aforementioned string. why isn't re.match() working?

url:

districts_page = 'https://en.wikipedia.org/wiki/list_of_school_districts_in_alabama' 

function:

def url_check(url):      all_urls = []      r = requests.get(url, proxies = proxies)     html_source = r.text     soup = beautifulsoup(html_source)      link in soup.find_all('a'):         if type(link.get('title')) == str:             if re.match(link.get('title'), '(page not exist)') == none:                  all_urls.append(link.get('href'))             else: pass      return  

this not address fixing problem re.match, may valid approach without using regex:

  link in soup.find_all('a'):     title = link.get('title')     if title:       if not 'page not exist' in title:          all_urls.append(link.get('href')) 

Comments

Popular posts from this blog

ios - UICollectionView Self Sizing Cells with Auto Layout -

node.js - ldapjs - write after end error -

DOM Manipulation in Wordpress (and elsewhere) using php -