Web Scraping using Python.

Web Scraping using Python.

Table of content:-

  1. Introduction.

  2. Prerequisites.

  3. Setting up the environment.

  4. Understanding the basics of web Scrapping.

  5. Ethical Considerations in Web Scraping.

  6. Use cases of web scraping.

  7. Conclusion.

Introduction:-

Web Scraping is the process of collecting data from various websites. Instead of manually copying and pasting data , web scraping allows you to collect large amount of data programmatically and in a structured format. There are various websites which provide API’s that allows you to get their data in a structured format but there are some websites which do not provide API’s in such cases you can extract data by web scraping.

Python is commonly used for web scraping because of it’s beginner friendly, clean and simple syntax. The libraries of python like Beautiful soup, Scrapy and Selenium makes web scraping easy, efficient and powerful.

Prerequisites:-

In order to start with web scraping you should have some software and libraries installed.

  1. Download python.

  2. Use any IDE PyCharm, visual studio code , Jupyter Notebook, etc.

  3. libraries like beautifulsoup4, pandas, requests.

Having basic knowledge of python is essential.

Setting up the environment:-

Download Python

python --version
pip install requests beautifulsoup4 pandas

Understanding the basics of Web Scraping:-

Once you have installed all the requirements then you are ready to scrape data from websites. Web Scraping works by sending an HTTP request to the webpage you want to scrape. After getting response from that page, The HTML of the page is parsed using library like BeautifulSoup. Once parsing is done then you can extract the required data using HTML tags and attributes. Having a proper understanding of the HTML structure of a page is necessary to get the required data efficiently.

Importing Libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

Sending HTTP requests

url = "https://yourwebpage.com"
response = requests.get(url)

Sometimes you might get an error with status code 403. A 403 forbidden error occurs when the server denies access to the request. This happens because the server implements mechanism to restrict automated scripts or bots. You can solve this problem by including headers, making the server believe that the request is coming from a legitimate browser.

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",}
response=requests.get('https://yourwebpage.com/data?page=1',headers=headers)

By setting appropriate headers and following ethical practices, you can minimize the chances of encountering a 403 error while scraping websites.

Parsing the HTML

soup = BeautifulSoup(response,'lxml')
print(soup.pretiffy())

Inspect elements, use HTML tags, classes and IDs to target specific data.

Right click on a webpage→ inspect.

Extracting data

titles = soup.find_all('h1',class="titles")
for title in titles:
    print(title.text)

If you have to extract titles from 30 different pages of the webpage and you want to make a titles list then you will have to run a loop to get the titles from those 30 pages and append them into your titles list.

titles_list = []
for j in range(1,31):
    url = 'https://yourwebpage.com/data?page={}'.format(j)
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",}
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content,'html.parser')
        titles = soup.find_all("h1",class="titles")
        for title in titles:
            titles_list.append(titles.text.strip())
    else:
        print(f"failed to fetch page {j}:status code {response.status_code}")
df=pd.DataFrame(titles_list,columns=['title']
df.to_csv("titles.csv", index=False)

Ethical considerations in web scraping

  • Respect the website policies. Many websites have robots.txt file which specifies which pages can and cannot be accessed by automated bots.

  • Scraping too frequently or fetching large amounts of data in a short period can overload the server, potentially affecting the website's functionality for other users.

  • Scraping personal or sensitive information (e.g., contact details, medical records) without consent is unethical and often illegal.

Where feasible, use APIs provided by the website instead of scraping. APIs are designed for structured data access and are more likely to comply with the website's policies.

Use cases and applications of web scraping

  • In market research and competitive analysis.

  • For E-commerce Price Monitoring.

  • For Collecting news articles, blog posts, or social media updates to curate content for dashboards or apps.

  • For Gathering stock prices, financial reports, or market news for investment decisions and predictions.

  • Gathering data on movies, shows, music, or gaming trends for recommendation systems.

  • Scraping posts, hashtags, or user activity for trend analysis and sentiment tracking.

Web scraping has a vast range of applications, empowering businesses, researchers, and developers with actionable data.

Conclusion

Web scraping is a powerful tool for extracting valuable data from websites, helping us gain insights and make informed decisions. From tracking market trends to building personalized recommendations, its applications are vast. However, it’s essential to approach web scraping responsibly by respecting website policies, avoiding overloading servers, and complying with legal and ethical guidelines. With the right tools and practices, web scraping can unlock endless possibilities for businesses, researchers, and developers.