Python urllib module allows us to access URL data programmatically.
Python urllib
- We can use Python urllib to get website content in python program.
- We can also use it to call REST web services.
- We can make GET and POST http requests.
- This module allows us to make HTTP as well as HTTPS requests.
- We can send request headers and also get information about response headers.
Python urllib GET example
Let’s start with a simple example where we will read the content of Wikipedia home page.
1 2 3 4 5 |
import urllib.request response = urllib.request.urlopen('https://www.wikipedia.org') print(response.read()) |
Response read()
method returns the byte array. Above code will print the HTML data returned by the Wikipedia home page. It will not be in human readable format, but we can use some HTML parser to extract useful information from it.
Let’s see what happens when we try to run the above program for JournalDev.
1 2 3 4 5 |
import urllib.request response = urllib.request.urlopen('https://www.journaldev.com') print(response.read()) |
We will get below error message.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/pankaj/Documents/PycharmProjects/BasicPython/urllib/urllib_example.py Traceback (most recent call last): File "/Users/pankaj/Documents/PycharmProjects/BasicPython/urllib/urllib_example.py", line 3, in <module> response = urllib.request.urlopen('https://www.journaldev.com') File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 223, in urlopen return opener.open(url, data, timeout) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 532, in open response = meth(req, response) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 642, in http_response 'http', request, response, code, msg, hdrs) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 570, in error return self._call_chain(*args) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 504, in _call_chain result = func(*args) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/urllib/request.py", line 650, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 403: Forbidden |
It’s because my server doesn’t allow programmatic access to the website data because it’s meant for browsers that can parse HTML data. Usually we can overcome this error by sending User-Agent
header in request. Let’s look at the modified program for this.
1 2 3 4 5 6 7 8 9 10 |
import urllib.request # Request with Header Data to send User-Agent header url="https://www.journaldev.com" headers = {} headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17' request = urllib.request.Request(url, headers=headers) resp = urllib.request.urlopen(request) print(resp.read()) |
We are creating request headers using dictionary and then sending it in the request. Above program will print HTML data received from JournalDev home page.
Python urllib REST Example
REST web services are accessed over HTTP protocols, so we can easily access them using urllib module. I have a simple JSON based demo rest web service running on my local machine created using JSON Server. It’s a great Node module to run dummy JSON REST web services for testing purposes.
1 2 3 4 5 6 |
import urllib.request response = urllib.request.urlopen('https://localhost:3000/employees') print(response.read()) <img class="alignnone wp-image-22703 size-full" src="http://all-learning.com/wp-content/uploads/2018/05/Python-urllib-Python-3-urllib-With.png" alt="Python urllib - Python 3 urllib With Examples" width="1200" height="718" /> |
Notice the console output is printing JSON data.
We can get response headers by calling info()
function on response object. This returns a dictionary, so we can also extract specific header data from response.
1 2 3 4 5 6 |
import urllib.request response = urllib.request.urlopen('https://localhost:3000/employees') print(response.info()) print('Response Content Type is=", response.info()["content-type"]) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
X-Powered-By: Express Vary: Origin, Accept-Encoding Access-Control-Allow-Credentials: true Cache-Control: no-cache Pragma: no-cache Expires: -1 X-Content-Type-Options: nosniff Content-Type: application/json; charset=utf-8 Content-Length: 260 ETag: W/"104-LQla2Z3Cx7OedNGjbuVMiKaVNXk" Date: Wed, 09 May 2018 19:26:20 GMT Connection: close Response Content Type is = application/json; charset=utf-8 |
Python urllib POST
Let’s look at an example for POST method call.
1 2 3 4 5 6 7 8 9 10 11 12 |
import urllib.request import urllib.parse post_url = "https://localhost:3000/employees' headers = {} headers['Content-Type'] = 'application/json' # POST request encoded data post_data = urllib.parse.urlencode({'name' : 'David', 'salary' : '9988'}).encode('ascii') #Automatically calls POST method because request has data post_response = urllib.request.urlopen(url=post_url, data=post_data) print(post_response.read()) |
When we call urlopen
function, if request has data
then it automatically uses POST
http method. Below image shows the output of above POST call for my demo service.
Reference: API Doc