Python lxml is the most feature-rich and easy-to-use library for processing XML and HTML data. Python scripts are written to perform many tasks like Web scraping and parsing XML. In this lesson, we will study about python lxml library and how we can use it to parse XML data and perform web scraping as well.

Python lxml library

Python lxml is an easy to use and feature rich library to process and parse XML and HTML documents. lxml is really nice API as it provides literally everything to process these 2 types of data. The two main points which make lxml stand out are:

  • Ease of use: It has very easy syntax than any other library present
  • Performance: Processing even large XML files takes very less time

Python lxml install

We can start using lxml by installing it as a python package using pip tool:

Once we are done with installing this tool, we can get started with simple examples.

Creating HTML Elements

With lxml, we can create HTML elements as well. The elements can also be calles as the Nodes. Let’s create basic structure of an HTML page using just the library:

When we run this script, we can see the HTML elements being formed:
python-lxml-making-html-1
We can see HTML elements or nodes being made. The pretty_print parameter helps to print indented version of HTML document.

These HTML elements are basically a list. We can access this list normally:

And this will just print head as that is the tag present right inside html tag. We can also print all elements inside the root tag:

This will print all tags:

python-lxml-making-html-1

Checking validity of HTML Elements

With iselement() function, we can even check if given element is a valid HTML element:

We just used the last script we wrote. This will give a simple output:

python-lxml-check-valid-html

Using attributes with HTML Elements

We can add metadata to each HTML element we construct by adding attributes to the elements we make:

When we run this, we see:
python-lxml-check-valid-html
We can now access these attributes as:

Value is printed to the console:
python-lxml-check-valid-html
Note that is the attribute doesn’t exist for given HTML element, we will get None as output.

We can also set attributes for an HTML element as:

When we print the value, we get the expected results:

python-lxml-check-valid-html

Sub-Elements with values

Sub-elements we constructed above were empty and that is no fun! Let’s make some sub-elements and put some values in it using lxml library.

This looks like some healthy data. Let’s see the output:

python-lxml-check-valid-html

Feeding RAW XML for Serialisation

We can provide RAW XML data directly to etree and parse it as well as it completely understands what is passed to it.

Let’s see the output:
python-lxml-serialisation
If you want the data to include the root XML tag declaration, even that is possible:

Let’s see the output now:

python-lxml-serialisation-xml

Python lxml etree parse() function

The parse() function can be used to parse from files and file-like objects:

Let’s see the output now:

python-lxml-parse-function

Python lxml etree fromstring() function

The fromstring() function can be used to parse Strings:

Let’s see the output now:

python-lxml-check-valid-html

Python lxml etree XML() function

The fromstring() function can be used to write XML literals directly into the source:

Let’s see the output now:

python-lxml-check-valid-html

Reference: LXML Documentation.

By admin

Leave a Reply