Hello, readers! Here, we will be learning How to Scrape Google Search Results using BeautifulSoup in Python.
In this article, we will be having a look at one of the most interesting concept in Python — Scraping a website.
So, let us begin!
What is Web Scraping?
At times, when we surf through the web, we come across some user-related data that we believe would be beneficial for us in the future. And, then we try to copy it and save it to clipboard each time.
Now, let’s analyze the next scenario
We often need data to analyze the behavior of certain factors in terms of data modeling. Thus, we begin creating a dataset from scratch by copy-pasting the data.
This is when, Web Scraping or Web Crawling comes into picture.
Web Scraping is an easy way to perform the repetitive task of copy and pasting data from the websites. With web scraping, we can crawl/surf through the websites and save and represent the necessary data in a customized format.
Let us now understand the working of Web Scraping in the next section.
How Does Web Scraping Work?
Let us try to understand the functioning of Web Scraping through the below steps:
- Initially, we write a piece of code that requests the server for the information with regards to the website we want to crawl or the information we want to scrape on the web.
- Like a browser, the code would let us download the source code of the webpage.
- Further, instead of visualizing the page in the manner that the browser does, we can filter the values based on the HTML tags and scrape only the needed information in a customized manner.
By this, we can load the source code of the webpage in a fast and customized manner.
Let us now try to implement Web Scraping in the upcoming section.
Bulk Scraping APIs
If you are looking to build some service by scraping bulk search, chances are high that Google will block you because of an unusually high number of requests. In that case, online APIs like Zenserp is a big help.
Zenserp performs searches through various IPs and proxies and allows you to focus on your logic rather than infrastructure. It also makes your job easier by supporting image search, shopping search, image reverse search, trends, etc. You can try it out here, just fire any search result and see the JSON response.
Implementing steps to Scrape Google Search results using BeautifulSoup
We will be implementing BeautifulSoup to scrape Google Search results here.
BeautifulSoup is a Python library that enables us to crawl through the website and scrape the XML and HTML documents
, webpages
, etc.
Scrape Google Search results for Customized search
Example 1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import requests from bs4 import BeautifulSoup import random text="python" url="https://google.com/search?q=" + text A = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36", ) Agent = A[random.randrange(len(A))] headers = {'user-agent': Agent} r = requests.get(url, headers=headers) soup = BeautifulSoup(r.text, 'lxml') for info in soup.find_all('h3'): print(info.text) print('#######') |
Line by line explanation of the above code:
- Importing the necessary libraries In order to make use of BeautifulSoup for scraping, we need to import the library through the below code:
1 |
from bs4 import BeautifulSoup |
Further, we need the Python requests library to download the webpage. The request module sends a GET request
to the server, which enables it to download the HTML contents of the required webpage.
1 |
import requests |
2. Set the URL: We need to provide the url
i.e. the domain wherein we want our information to be searched and scraped. Here, we have provided the URL of google and appended the text ‘Python’ to scrape the results with respect to text=’Python’.
3. Setting User-Agent: We need to specify the User Agent Headers
which lets the server identify the system and application, browsers wherein we want the data to be downloaded as shown below–
1 2 3 4 |
A = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36", ) |
4. The requests.get(url, header) sends the request to the web server so as to download the requested HTML content of the web page or the search results.
5. Create an object of BeautifulSoup with the requested data from ‘lxml
‘ parsing headers. The ‘lxml‘ package must be installed for the below code to work.
1 |
soup = BeautifulSoup(r.text, 'lxml') |
6. Further, we use object.find_all('h3')
to scrape and display all the Header 3
content of the web browser for the text=’Python’.
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
<strong><span style="color: #008000;">Welcome to Python.org ####### Downloads ####### Documentation ####### Python For Beginners ####### Python 3.8.5 ####### Tutorial ####### Python Software Foundation ####### Python (programming language) - Wikipedia ####### Python Tutorial - W3Schools ####### Introduction to Python - W3Schools ####### Python Tutorial - Tutorialspoint ####### Learn Python - Free Interactive Python Tutorial ####### Learn Python 2 | Codecademy ####### </span></strong> |
Scrape Search results from a Particular Webpage
In this example, we have scraped the HTML tag values
from the website as shown:
Example 2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import requests from bs4 import BeautifulSoup import random url="https://www.askpython.com/python/examples/python-predict-function" A = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36", ) Agent = A[random.randrange(len(A))] headers = {'user-agent': Agent} r = requests.get(url, headers=headers) soup = BeautifulSoup(r.content, 'lxml') title = soup.find('title') print("Title of the webpage--n") print(title.string) search = soup.find_all('div',class_="site") print("Hyperlink in the div of class-site--n") for h in search: print(h.a.get('href')) |
Further, we have scraped the title tag
values and all the a href
values present in the div tag
of class value = site. Here, the class value differs for each website according to the structure of the code.
Output:
1 2 3 4 5 |
<strong><span style="color: #008000;">Title of the webpage-- Python predict() function - All you need to know! - AskPython Hyperlink in the div of class-site-- https://www.askpython.com/ </span></strong> |
Conclusion
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more such posts related to Python, stay tuned and till then, Happy Learning!! 🙂
Hi,
Thank you very much for you amazing work, it’s really helpful!
I was wondering if it would be possible to scrape all Google Titles for a given set of URLs.
I’m not a developer but I mixed your scripts 🙂
import requests
from bs4 import BeautifulSoup
import random
url=”https://www.google.es/search?q=site:https://yourURL.html”
A = (“Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36”,
“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36”,
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36”,
)
Agent = A[random.randrange(len(A))]
headers = {‘user-agent’: Agent}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, ‘lxml’)
for info in soup.find_all(‘h3’):
print(info.text)
print(‘#######’)
It works, but what will be really useful, is to make this process at scale, for example for 1.000 URLs
Do you think is it possible?
Best regards
Congratulations, amazing work!
I’m not a developer but just unified both scripts to a new one to extract the SERP Title for the given URL:
import requests
from bs4 import BeautifulSoup
import random
url=”https://www.google.es/search?q=site:https://www.yoururl.html”
A = (“Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36”,
“Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36”,
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36”,
)
Agent = A[random.randrange(len(A))]
headers = {‘user-agent’: Agent}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, ‘lxml’)
for info in soup.find_all(‘h3’):
print(info.text)
print(‘#######’)
What could be really helpful is to run this script at scale (1000 URLs foe example). I’ve tried to do that with an array and foe statement but It didn’t work (I don’t know python at all).
Do you think it could pe possible?
Best regards and thanks again