In this tutorial, we’ll be implementing Web Scraping in our Android Application. We will be scraping to get all the words listed on the home page. We’ll be using the Retrofit library to read web pages.

Android Retrofit Converters

We’ve covered a lot on Retrofit in the below tutorials:

Most of the times we have used Gson to serialise/deserialise JSON responses.
For this, we’ve used GsonConverters in our Retrofit Builder.

There can be instances when you just need plain text as the response body from the network call.
In such cases, instead of GsonConverters, we need to use Scalars Converter

In order to use Scalar Converters, you need to add the following dependency along with Retrofit and OkHttp dependencies in the build.gradle

To add Scalar Converters to the Retrofit Builder, do the following:

We can add multiple converters to the builder as well. But the order is important since retrofit chooses the first compatible converter.

We can use RequestBody and ResponseBody class from OkHttp as the types if we don’t want to use Scalar Converters.
RequestBody and ResponseBody allows receiving any type of response data using request.body() in the enqueue method.The only disadvantage: You need to handle the RequestBody object creation yourself.

Web pages are in Html, so in order to parse them, we’ll use Jsoup library.

In the following section, we’ll be using ScalarConverter to parse the website passed in the Retrofit request. We’ll fetch all text words and keep a count of each word in the RecyclerView.

Also, we’ll add a filter function that filters the words by the count. We’ll use a Hashmap to store the word/count pair and sort it by value.

Project Structure


The dependencies in the build.gradle is:


The code for the activity_main.xml is defined below:

The code for the class is given below:

. is used to specify no path. Thus the base url only would be used.

The code for the is given below:

The following code parses the string from HTML format;

Inside createHashMap we remove all special characters and omit all numerics from the hashmap.
sortByValueDesc uses a Comparator to compare the values and sort the HashMap in a descending order.

The code for the list_item_words.xml which contains the layout for RecyclerView rows is given below:

The code for the class is given below:

The output of the above application is given below:

android retrofit web scraping output

So the above output shows all words present on the home page of JournalDev at the time of writing this tutorial with their frequency.

This brings an end to this tutorial. You can download the project from the link below:

By admin

Leave a Reply