Web scraping is the process of extracting data from various websites and parsing it. In other words, it’s a technique to extract unstructured data and store that data either in a local file or in a database. There are many ways to collect data that involve a huge amount of hard work and consume a lot of time. Web scraping can save programmers many hours.
The basic steps involved in web scraping are:
Loading the document (HTML content)
Parsing the document
Extraction
Transformation
Beautiful Soup is a Python web scraping library that allows us to parse and scrape HTML and XML pages. You can search, navigate, and modify data using a parser. It’s versatile and saves a lot of time. In this article we will learn how to scrape data using Beautiful Soup.
Beautiful Soup can be installed using the pip command. You can also try pip3 if pip is not working.
pip install requests
pip install beautifulsoup4
The requests module is used for getting HTML content.
The next step is to inspect the website that you want to scrape. Start by opening your site in a browser. Go through the structure of the site and find the part of the webpage you want to scrape. Next, inspect the site using the developer tools by going to More tools>Developer tools. Finally, open the Elements tab in your developer tools.
Next, get the HTML content from a web page. We use the requests module for doing this task. We call the “get” function by passing the URL of the webpage as an argument to this function as shown below:
1 2 3 4 5 6 7
#import requests library import requests #the website URL url_link = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States" result = requests.get(url_link) .text print(result)
In the above code, we are issuing an HTTP GET request to the specified URL. We are then storing the HTML data that is received by the server in a Python object. The .text attribute will print the HTML data.
Step 4: Parsing an HTML Page with Beautiful Soup
Now that we have the HTML content in a document, the next step is to parse and process the data. For doing so, we import this library, create an instance of BeautifulSoup class and process the data.
1 2 3 4 5 6 7 8 9 10 11
from bs4 import BeautifulSoup #import requests library import requests #the website URL url_link = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States" result = requests.get(url_link) .text doc = BeautifulSoup(result, "html.parser") print(doc.prettify())
The prettify() function will allow us to print the HTML content in a nested form that is easy to read and will help extract the available tags that are needed.
There are two methods to find the tags: find and find_all().
Find(): This method finds the first matched element.
Find_all(): This method finds all the matched elements.
We all know that every element of the HTML page is assigned a unique ID attribute. Let us now try to find an element by using the value of the ID attribute. For example, I am looking to find an ID attribute that has the value “content” as shown below:
1 2
res = doc.find(id = "content") print(res)
If we want to find an element by class name in the above res for example, we will extract the h1 element with the class name “firstHeading”,
<h1 class=“firstHeading” id=“firstHeading”>List of states and territories of the United States</h1>
1 2
heading = res.find(class_ = "firstHeading") print(heading)
[<h1 class=“firstHeading” id=“firstHeading”>List of states and territories of the United States</h1>]
If we want to find only the text from the above heading tag, we can do so by the following:
print(heading.text)
“List of states and territories of the United States”
For example, we will try to find all the h2 tags in the <main> element with id=”content”.
1 2 3
res = doc.find(id = "content") for ele in res: print(res.find("h2"))
1 2 3 4 5 6 7 8 9 10 11
<h2 id="mw-toc-heading">Contents</h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2> < h2 id = "mw-toc-heading" > Contents < /h2>
For example, if you want to search for the text “Set Me Free” , you can do it by using the below code.
1 2
res = doc.find_all(text = "California") print(res)
['California', 'California', 'California']
You can also pass a list to the find_all() function and Beautiful Soup will find all the elements that match any item in that list.
For example, the below code will find all the <a>, <p> and <div> tags in the document.
res=doc.find_all(["a","p","div"])
If you pass in a regular expression, Beautiful Soup will filter using the regular expression. We will have to import re as shown below.
1 2 3 4
import re for str in doc.find_all(text = re.compile("1788")): print(str)
1
2
3
4
5
6
7
8
Jan 9, 1788
Jan 2, 1788
Apr 28, 1788
Feb 6, 1788
Jun 21, 1788
Jul 26, 1788
May 23, 1788
Jun 25, 1788
Further, if you want to get only a limited number of results, you can do so by using limit.
1
2
for str in doc.find_all(text = re.compile("1788"), limit = 2):
print(str)
1
2
Jan 9, 1788
Jan 2, 1788
Beautiful Soup has a .select method that allows us to filter using a CSS selector.
print(doc.select("title"))
You can also select tags beneath other tags as
print(doc.select("html head title"))
1 2 3
print(doc.select(".vector-menu-content")) or print(doc.select("[class~=vector-menu-content]"))
print(doc.select("#p-logo"))
1 2 3
<div id="p-logo" role="banner"> <a class="mw-wiki-logo" href="/wiki/Main_Page" title="Visit the main page"></a> </div>
print(doc.select("div#mw-panel"))
print(doc.select("footer[role]"))
print(doc.select("a[href]"))
In this exercise, we will be taking a webpage from Wikipedia with the URL (https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States) as an example. This page contains the list of states in the U.S, population, and other details. We will try to get the names of the states and the population columns of the table.
The initial step is to identify the text or area of the webpage which is to be scraped. To find that, simply select the area of the page, right-click and then click on inspect.
You can see that the element I am looking for is in the table with the class name “wikitable sortable plainrowheaders” and it is a string of a <a> tag that is nested inside the <th> tag.
Let us now write the code to fetch the data.
Install the essential libraries.
1 2 3 4
from bs4 import BeautifulSoup #import requests library import requests
In the next step, we will use a get request by passing the URL of the webpage that is to be parsed. Further, we create a Beautiful Soup object with “html.parser”.
#the website url
url_link=“https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States”
result=requests.get(url_link).text
doc=BeautifulSoup(result, “html.parser”)
Then, we use the BeautifulSoup object created above and collect the required table data by using the class name:
my_table=doc.find(“table”, class_=“wikitable sortable plainrowheaders”)
We then extract all the <th> tags in our table and finally get the text inside the <a> tags.
1 2 3 4 5 6 7 8 9
th_tags = my_table.find_all('th') names = [] for elem in th_tags: #finding the < a > tag a_links = elem.find_all("a") #getting the text inside the < a > tag for i in a_links: names.append(i.string) print(names)
1
['postal abbreviation', '[13]', '[C]', '[15]', '[16]', '[16]', '[16]', None, '[17]', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', '[D]', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', '[D]', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', '[D]', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', '[D]', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
In the above result, you can observe that our list starts from index 9, and also we have a few [D] in between. We will prepare the final list by removing the unwanted strings.
1
2
3
4
5
6
final_list = names[9: ]
states = []
for str in final_list:
if len(str) > 3:
states.append(str)
print(states)
And finally, the result goes here:
1
['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
In a similar way, we will now try to scrape the population columns from the same table. When I inspect the column element, I can find that it is contained inside the <div> tags as shown below:
The code goes as follows:
1 2 3 4 5
divs = my_table.find_all("div") pop = [] for i in divs: pop.append(i.string) print(pop)
1
['5,024,279', '7', '733,391', '1', '7,151,502', '9', '3,011,524', '4', '39,538,223', '53', '5,773,714', '7', '3,605,944', '5', '989,948', '1', '21,538,187', '27', '10,711,908', '14', '1,455,271', '2', '1,839,106', '2', '12,812,508', '18', '6,785,528', '9', '3,190,369', '4', '2,937,880', '4', '4,505,836', '6', '4,657,757', '6', '1,362,359', '2', '6,177,224', '8', '7,029,917', '9', '10,077,331', '14', '5,706,494', '8', '2,961,279', '4', '6,154,913', '8', '1,084,225', '1', '1,961,504', '3', '3,104,614', '4', '1,377,529', '2', '9,288,994', '12', '2,117,522', '3', '20,201,249', '27', '10,439,388', '13', '779,094', '1', '11,799,448', '16', '3,959,353', '5', '4,237,256', '5', '13,002,700', '18', '1,097,379', '2', '5,118,425', '7', '886,667', '1', '6,910,840', '9', '29,145,505', '36', '3,271,616', '4', '643,077', '1', '8,631,393', '11', '7,705,281', '10', '1,793,716', '3', '5,893,718', '8', '576,851', '1']
We will now remove the unwanted strings in between.
1
2
3
4
5
pop_final = []
for i in pop:
if len(i) > 3:
pop_final.append(i)
print(pop_final)
And the final result goes here:
1
['5,024,279', '733,391', '7,151,502', '3,011,524', '39,538,223', '5,773,714', '3,605,944', '989,948', '21,538,187', '10,711,908', '1,455,271', '1,839,106', '12,812,508', '6,785,528', '3,190,369', '2,937,880', '4,505,836', '4,657,757', '1,362,359', '6,177,224', '7,029,917', '10,077,331', '5,706,494', '2,961,279', '6,154,913', '1,084,225', '1,961,504', '3,104,614', '1,377,529', '9,288,994', '2,117,522', '20,201,249', '10,439,388', '779,094', '11,799,448', '3,959,353', '4,237,256', '13,002,700', '1,097,379', '5,118,425', '886,667', '6,910,840', '29,145,505', '3,271,616', '643,077', '8,631,393', '7,705,281', '1,793,716', '5,893,718', '576,851']
1 2 3 4 5 6 7 8
import pandas as pd df = pd.DataFrame() df['state'] = states df['population'] = pop_final print(df)
index state population
0 Alabama 5,024,279
1 Alaska 733,391
2 Arizona 7,151,502
3 Arkansas 3,011,524
4 California 39,538,223
5 Colorado 5,773,714
6 Connecticut 3,605,944
7 Delaware 989,948
8 Florida 21,538,187
9 Georgia 10,711,908
10 Hawaii 1,455,271
11 Idaho 1,839,106
12 Illinois 12,812,508
13 Indiana 6,785,528
14 Iowa 3,190,369
15 Kansas 2,937,880
16 Kentucky 4,505,836
17 Louisiana 4,657,757
18 Maine 1,362,359
19 Maryland 6,177,224
20 Massachusetts 7,029,917
21 Michigan 10,077,331
22 Minnesota 5,706,494
23 Mississippi 2,961,279
24 Missouri 6,154,913
25 Montana 1,084,225
26 Nebraska 1,961,504
27 Nevada 3,104,614
28 New Hampshire 1,377,529
29 New Jersey 9,288,994
30 New Mexico 2,117,522
31 New York 20,201,249
32 North Carolina 10,439,388
33 North Dakota 779,094
34 Ohio 11,799,448
35 Oklahoma 3,959,353
36 Oregon 4,237,256
37 Pennsylvania 13,002,700
38 Rhode Island 1,097,379
39 South Carolina 5,118,425
40 South Dakota 886,667
41 Tennessee 6,910,840
42 Texas 29,145,505
43 Utah 3,271,616
44 Vermont 643,077
45 Virginia 8,631,393
46 Washington 7,705,281
47 West Virginia 1,793,716
48 Wisconsin 5,893,718
49 Wyoming 576,851
We then write the data frame to the CSV file using the line of code below.
df.to_csv('us_info.csv')
Beautiful Soup is easy to learn and beginner-friendly. In this article we have completed the basics of web scraping using Beautiful Soup and also did a sample project to better understand the concepts. In short, the request library allows you to fetch static HTML content from the Internet and the Beautiful Soup package allows you to parse the HTML using a parser. However, there are many more advanced, interesting concepts to learn regarding this topic. You can find the documentation here.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/