Get links to all articles from site with python
Hi, everybody!
In this post will tell you about one small python script. Recently I have met task to get all links to articles from my blog. I have made some Google search and find some points to this task. I used python and its package BeautifulSoup for html parsing. Also used re
module and urllib3
for http request.
Here is it, my script. I will be happy if you find it usefull.
from bs4 import BeautifulSoup
import urllib3
import re
links = []
site = "https://alextech18.blogspot.com"
pattern = "://alextech18.blogspot.com"
def getLinks(url, pattern, html_only=True):
http = urllib3.PoolManager()
html_page = http.request('GET', url)
soup = BeautifulSoup(html_page.data, features="html.parser")
# Use separate pattern, because, blogspot for example, can use links leading with http and https
for link in soup.findAll('a', attrs={'href': re.compile(pattern)}):
if link.get('href') not in links:
if html_only:
if ".html" == link.get('href')[-5:]:
links.append(link.get('href'))
else:
links.append(link.get('href'))
print(len(links))
return links
def main():
getLinks(site, pattern)
print("----------Links from main page-----------")
print(links)
for link in links:
print("----------LINK-----------")
print(link)
getLinks(link, pattern)
print(links)
if __name__ == "__main__":
main()
For your convenience all code available on Github https://github.com/A1esandr/alextech/tree/master/src/get_links
Comments
Post a Comment