This Repo required for Asac labs class 2
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
For web scraping we are going to use the very popular Python library called BeautifulSoup.
pip install beautifulsoup4
or
poetry add beautifulsoup4
The first step in scraping is to select the website you wish to scrape data from and inspect it.
To inspect a website right click anywhere on the page and choose ‘Inspect Element’ / ‘View Page Source’
Now we can begin parsing the webpage and searching for the specific elements we need using BeautifulSoup. For connecting to the website and getting the HTML we will use Python’s urllib. Let us import the required libraries-
from urllib.request import urlopen
from bs4 import BeautifulSoup
3a. Get the url-
url = "https://www.bbc.com/sport/football/46897172"
3b. Connecting to the website-
try:
page = urlopen(url)
except:
print("Error opening the URL")
3c. Create a BeautifulSoup object for parsing-
soup = BeautifulSoup(page, 'html.parser')
We now use BeautifulSoup’s soup.find() method to search for the tag
content = soup.find('div', {"class": "story-body sp-story-body gel- body-copy"})
We now iterate through content to find all the <p> (paragraph) tags in it to get the entire body of the article.
article = ''
for i in content.findAll('p'):
article = article + ' ' + i.text
We can save the information we scraped in a .txt or .csv file.
with open('scraped_text.txt', 'w') as file:
file.write(article)
Make the crawling slower, do not slam the server, treat websites nicely