python - Re.match does not restrict urls -
i school urls in table on wiki page lead page information. bad urls colored red contain phrase 'page not exist' in side 'title' attr. trying use re.match() filter urls such return not contain aforementioned string. why isn't re.match() working?
url:
districts_page = 'https://en.wikipedia.org/wiki/list_of_school_districts_in_alabama'
function:
def url_check(url): all_urls = [] r = requests.get(url, proxies = proxies) html_source = r.text soup = beautifulsoup(html_source) link in soup.find_all('a'): if type(link.get('title')) == str: if re.match(link.get('title'), '(page not exist)') == none: all_urls.append(link.get('href')) else: pass return
this not address fixing problem re.match
, may valid approach without using regex:
link in soup.find_all('a'): title = link.get('title') if title: if not 'page not exist' in title: all_urls.append(link.get('href'))
Comments
Post a Comment