paint-brush
A Guide to Extracting All Links on a Website Using Pythonby@kalebujordan
3,236 reads
3,236 reads

A Guide to Extracting All Links on a Website Using Python

by Kalebu Jordan October 26th, 2020
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A Guide to Extracting All Links on a Website Using Python using Python's BeautifulSoup and requests. We will use the requests library to get the raw HTML page from the website and then Beautiful Soup to extract all the links from the HTML. The code below is a code that will prompt you to enter a link to a website. It will use requests to send a GET request to the server to request the HTML page and then use BeautifulSoups to extract the links. The original article can be found on kalebujordan.com/learn-how-to-extract-all-links.

Companies Mentioned

Mention Thumbnail
Mention Thumbnail
featured image - A Guide to Extracting All Links on a Website Using Python
Kalebu Jordan  HackerNoon profile picture

In this tutorial, you’re going to learn how to extract all links from a given website or URL using BeautifulSoup and requests.

If you’re new to web scraping I would recommend starting first with a beginner tutorial to Web scraping and then move to this once you are comfortable with the basics.

We will use the requests library to get the raw HTML page from the website and then we are going to use BeautifulSoup to extract all the links from the HTML page.

Requirements

To follow through with this tutorial you need to have requests and Beautiful Soup library installed.

Installation

$ pip install requests
$ pip install beautifulsoup4

Below is a code that will prompt you to enter a link to a website and then it will use requests to send a GET request to the server to request the HTML page and then use BeautifulSoup to extract all link tags in the HTML.

import requests
from bs4 import BeautifulSoup
def extract_all_links(site):
    html = requests.get(site).text
    soup = BeautifulSoup(html, 'html.parser').find_all('a')
    links = [link.get('href') for link in soup]
    return links
site_link = input('Enter URL of the site : ')
all_links = extract_all_links(site_link)
print(all_links)

Output

kalebu@kalebu-PC:~/$ python3 link_spider.py
Enter URL of the site: https://kalebujordan.com/​
['#main-content', 'mailto://kalebjordan.kj@gmail.com', 
'https://web.facebook.com/kalebu.jordan', 'https://twitter.com/j_kalebu',
'https://kalebujordan.com/'.....]

I hope you found this useful, feel free to share it with your fellow developers.

Previously published here: https://kalebujordan.com/learn-how-to-extract-all-links-from-a-website-in-python/

The Original Article can be found on kalebujordan.com