We live in an information age where large volumes of data abound and the ability to extract meaningful information from data is a key differentiator for success. Fields such as analytics, data mining and data science are devoted to the study of data. In this article we will look at an essential, simple and powerful tool in the data scientist’s toolbox – the regular expression or regex for short. We will learn about regex and how to use them in python scripts to process textual data.
Text is one of the basic forms of data and humans use text for communicating and expressing themselves such as in web pages, blog posts, documents, twitter/ RSS feeds, etc. This is where Regular Expressions are handy and powerful. Be it filtering data from web pages, data analytics or text mining – Regular expressions are the preferred tool to accomplish these tasks. Regular expressions make text processing tasks, like (NLP) simpler, thereby reducing efforts, time and errors which are bound to occur while writing manual scripts.
In this article, we will understand what are regular expressions and how they can be used in Python. Next, we will walk through usage and applications of commonly used regular expressions.
By the end of the article, you will learn how you can leverage the power of regular expressions to automate your day-to-day text processing tasks.
What is a Regular Expression?
A regular expression (RE or regex) is a sequence of characters which describes textual patterns. Using regular expressions we can match input data for certain patterns (aka searching), extract matching strings (filtering, splitting) as well as replace occurrences of patterns with substitutions, all with a minimum amount of code.
Most programming languages have built-in support for defining and operating with regular expressions. Perl, Python & Java are some notable programming languages with first-class support for regular expressions. The standard library functions in such programming languages provide highly-performant, robust and (almost) bug-free implementations of the regular expression operations (searching, filtering, etc.) that makes it easy to rapidly produce high-quality applications that process text efficiently.
Getting started with Python Regular expressions
Python provides a built-in module called re
to deal with regular expressions. To import Python’s re
package, use:
import re
The re
package provides set of methods to perform common operations using regular expressions.
Searching for Patterns in a String
One of the most common tasks in text processing is to search if a string contains a certain pattern or not. For instance, you may want to perform an operation on the string, based on the condition that the string contains a number. Or, you may want to validate a password by ensuring it contains numbers and special characters. The`match` operation of RE provides this capability.
Python offers two primitive operations based on regular expressions: re.match()
function checks for a pattern match at the beginning of the string, whereas re.search()
checks for a pattern match anywhere in the string. Let’s have a look at how these functions can be used:
The re.match()
function
The re.match()
function checks if the RE matches at the beginning of the string. For example, initialise a variable “text” with some text, as follows:
text = ['Charles Babbage is regarded as the father of computing.', 'Regular expressions are used in search engines.']
Let’s write a simple regular expression that matches a string of any length containing anything as long as it starts with the letter C:
regex = r"C.*"
For now, let’s not worry about how the declaration above is interpreted and assume that the above statement creates a variable called regex that matches strings starting with C.
We can test if the strings in text match the regex as shown below:
for line in text:
ans = re.match(regex, line)
type(ans)
if(ans):
print(ans.group(0))
Go ahead and run that code. Below is a screenshot of a python session with this code running.
Regex Match Search Example 1
The first string matches this regex, since it stats with the character “C”, whereas the second string starts with character “R” and does not match the regex. The `match` function returns _sre.SRE_Match
object if a match is found, else it returns None
.
In python, regular expressions are specified as raw string literals. A raw string literal has a prefix r
and is immediately followed by the string literal in quotes. Unlike normal string literals, Python does not interpret special characters like ''
inside raw string literals. This is important and necessary since the special characters have a different meaning in regular expression syntax than what they do in standard python string literals. More on this later.
Once a match is found, we can get the part of the string that matched the pattern using group()
method on the returned match
object. We can get the entire matching string by passing 0 as the argument.
ans.group(0)
Sample Output:
Charles Babbage is regarded as the father of computing.
Building blocks of regular expressions
In this section we will look at the elements that make up a regex and how regexes are built. A regex contains groups and each group contains various specifiers such as character classes, repeaters, identifiers etc. Specifiers are strings that match particular types of pattern and have their own format for describing the desired pattern. Let’s look at the common specifiers:
Identifiers
An identifier matches a subset of characters e.g., lowercase alphabets, numeric digits, whitespace etc.,. Regex provides a list of handy identifiers to match different subsets. Some frequently used identifiers are:
- d = matches digits (numeric characters) in a string
- D = matches anything but a digit
- s = matches white space (e.g., space, TAB, etc.,.)
- S = matches anything but a space
- w = matches letters/ alphabets & numbers
- W = matches anything but a letter
- b = matches any character that can separate words (e.g., space, hyphen, colon etc.,.)
- . = matches any character, except for a new line. Hence, it is called the wildcard operator. Thus, “.*” will match any character, any nuber of times.
Note: In the above regex example and all others in this section we omit the leading
r
from the regex string literal for sake of readability. Any literal given here should be declared as a raw string literal when used in python code.
Repeaters
A repeater is used to specify one or more occurrences of a group. Below are some commonly used repeaters.
The `*` symbol
The asterisk operator indicates 0 or more repetitions of the preceding element, as many as possible. ‘ab*” will match ‘a’, ‘ab’, ‘abb’ or ‘a’ followed by any number of b’s.
The `+` symbol
The plus operator indicates 1 or more repetitions of the preceding element, as many as possible. ‘ab+’ will match ‘a’, ‘ab’, ‘abb’ or ‘a’ followed by at least 1 occurrence of ‘b’; it will not match ‘a’.
The `?` symbol
This symbol specifies the preceding element occurs at most once, i.e., it may or may not be present in the string to be matched. For example, ‘ab+’ will match ‘a’ and ‘ab’.
The `{n}` curly braces
The curly braces specify the preceding element to be matched exactly n times. b{4} will match exactly four ‘b’ characters, but not more/less than 4.
The symbols *,+,? and {} are called repeaters, as they specify the number of times the preceding element is repeated.
Miscellaneous specifiers
The `[]` square braces
The square braces match any single character enclosed within it. For example [aeiou] will match any of the lowercase vowels while [a-z] will match any character from a-z(case-sensitive). This is also called a character class.
The `|`
The vertical bar is used to separate alternatives. photo|foto matches either “photo” or “foto”.
The `^` symbol
The caret symbol specifies the position for the match, at the start of the string, except when used inside square braces. For example, “^I” will match a string starting with “I” but will not match strings that don’t have “I” at the beginning. This is essentially same as the functionality provided by the re.match
function vs re.search
function.
When used as the first character inside a character class it inverts the matching character set for the character class. For example, “[^aeiou]” will match any character other than a, e, i, o or u.
The `$` symbol
The dollar symbol specifies the position for a match, at end of the string.
The `()` paranthesis
The parenthesis is used for grouping different symbols of RE, to act as a single block. ([a-z]d+) will match patterns containing a-z, followed by any digit. The whole match is treated as a group and can be extracted from the string. More on this later.
Typical use-cases for Python Regular Expressions
Now, we have discussed the building blocks of writing RE. Let’s do some hands-on regex writing.
The re.match()
function revisited
It is possible to match letters, both uppercase and lowercase, using match function.
ans = re.match(r"[a-zA-Z]+", str)
print(ans.group(0))
The above regex matches the first word found in the string. The `+` operator specifies that the string should have at least one character.
Sample Output:
The
As you see, the regex matches the first word found in the string. After the word “The”, there is a space, which is not treated as a letter. So, the matching is stopped and the function returns only the first match found. Let’s say, a string starts with a number. In this case, the match()
function returns a null value, though the string has letters following the number. For example,
str = "1837 was the year when Charles Babbage invented the Analytical Engine"
ans = re.match(r"[a-zA-Z]+", str)
type(ans)
The above regex returns null, as the match function returns only the first element in the string. Though the string contains alphabets, it is preceded by a number. Therefore, match()
function returns null. This problem can be avoided using the search()
function.
The re.search()
function
The search()
function matches a specified pattern in a string, similar to match()
function. The difference is, the search()
function matches a pattern globally, unlike matching only the first element of a string. Let’s try the same example using search()
function.
str = "1837 was the year when Charles Babbage invented the Analytical Engine"
ans = re.search(r"[a-zA-Z]+", str)
type(ans)
Sample Output:
was
This is because the search()
function returns a match, though the string does not start with an alphabet, yet found elsewhere in the string.
Matching strings from start and from end
We can use regex to find if a string starts with a particular pattern using caret operator ^
. Similarly, $
a dollar operator is used to match if a string ends with a given pattern. Let’s write a regex to understand this:
str = "1937 was the year when Charles Babbage invented the Analytical Engine"
if re.search(r"^1837", str):
print("The string starts with a number")
else:
print("The string does not start with a number")
type(ans)
Sample Output:
The string starts with a number
The re.sub()
function
We have explored using regex to find a pattern in a string. Let’s move ahead to find how to substitute a text in a string. For this, we use the sub() function. The sub()
function searches for a particular pattern in a string and replaces it with a new pattern.
str = "Analytical Engine was invented in the year 1837"
ans = re.sub(r"Analytical Engine", "Electric Telegraph", str)
print(ans)
As you see, the first parameter of the sub()
function is the regex that searches for a pattern to substitute. The second parameter contains the new text you wish to substitute for the old one. The third parameter is the string on which the “sub” operation is performed.
Sample Output:
Electric Telegraph was invented in the year 1837
Writing Regexes with identifiers
Let’s understand using regex containing identifiers, with an example. To remove digits in a string, we use the below regex:
str = "Charles Babbage invented the Analytical Engine in the year 1937"
ans = re.sub(r"d", "", str)
print(ans)
The above script locates for digits in a string using the identifier “d” and replaces it with an empty string.
Sample Output:
Charles Babbage invented the Analytical Engine in the year
Splitting a string
The re
package provides the split()
function to split strings. This function returns a list of split tokens. for example, the following “split” function splits string of words, when a comma is found:
str = "Charles Babbage was considered to be the father of computing, after his invention of the Analytical Engine, in 1837"
ans = re.split(r",", str)
print(ans)
Sample Output:
['Charles Babbage was considered to be the father of computing', 'after his invention of the Analytical Engine', 'in 1837']
The re.findall()
function
The findall()
function returns a list that contains all the matched utterances in a string.
Let’s write a script to find domain type from a list of email id’s implementing the findall()
function:
result=re.findall(r'@w+.w+','[email protected], [email protected], [email protected])
print result
Sample Output:
['@gmail.com', '@yahoo.in', '@samskitchen.com']
Conclusion
In this article, we understood what regular expressions are and how they can be built from their fundamental building blocks. We also looked at the re
module in Python and its methods for leveraging regular expressions. Regular expressions are a simple yet powerful tool in text processing and we hope you enjoyed learning about them as much as we did building this article. Where could you use regex in your work/ hobby projects? Leave a comment below.