Simple Web Crawler using python

12 Sep

Simple Web Crawler using python

Here, I am making simple Python Program to crawl a Website and get all the anchor links. This is done by using functions, masterCrawl(url) and get_sub_links(item_href) functions.

__author__ = 'sureshkumarmukhiya'
#program to crawl http://kagbeni.com/ and get all the links from that website

import requests
from bs4 import BeautifulSoup
import re

def masterCrawl(url):
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')

    for links in soup.findAll('a'):
        href = links.get('href')
        title = links.string

        print (title)
        print (href)
        if href == 'None':
            print (href + 'None')

        if href.startswith('http'):
            get_sub_links(href)

def get_sub_links(item_href):
    source_code = requests.get(item_href)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, 'html.parser')

    for link in soup.findAll('a'):
        href = link.get('href')
        title = link.string
        print(href)


masterCrawl('http://kagbeni.com/')

I have used few modules like requests and BeautifulSoup

Bydr.code.skm

I am backend developer with passion in Web Application development using latest technologies like Laravel, PHP7, React, ECMAScript 6 and WordPress. I prefer spending time analyzing big data with Apache Spark. Apart from that, I do photography.