Get links to all articles from site with python

Wednesday, January 02, 2019

Hi, everybody!

In this post will tell you about one small python script. Recently I have met task to get all links to articles from my blog. I have made some Google search and find some points to this task. I used python and its package BeautifulSoup for html parsing. Also used re module and urllib3 for http request.

Here is it, my script. I will be happy if you find it usefull.

from bs4 import BeautifulSoup
import urllib3
import re

links = []
site = "https://alextech18.blogspot.com"
pattern = "://alextech18.blogspot.com"
 
def getLinks(url, pattern, html_only=True):

    http = urllib3.PoolManager()
    html_page = http.request('GET', url)

    soup = BeautifulSoup(html_page.data, features="html.parser")

    # Use separate pattern, because, blogspot for example, can use links leading with http and https
    for link in soup.findAll('a', attrs={'href': re.compile(pattern)}):
      if link.get('href') not in links:
        if html_only:
            if ".html" == link.get('href')[-5:]:
              links.append(link.get('href'))
        else:
            links.append(link.get('href'))

    print(len(links))
 
    return links

def main():
    getLinks(site, pattern)
    print("----------Links from main page-----------")
    print(links)

    for link in links:
        print("----------LINK-----------")
        print(link)
        getLinks(link, pattern)

    print(links)

if __name__ == "__main__":
    main()

For your convenience all code available on Github https://github.com/A1esandr/alextech/tree/master/src/get_links

Search This Blog

Alex tech

Get links to all articles from site with python

Comments

Post a Comment

Popular posts from this blog

Methods for reading XML in Java

XML, well-formed XML and valid XML

ArrayList and LinkedList in Java, memory usage and speed