d

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore.

15 St Margarets, NY 10033
(+381) 11 123 4567
ouroffice@aware.com

 

KMF

Regex in Python Using The re Module

What is Regex?

Regex, or a more known terminology, Regular Expressions, is used to create complex patterns for searching substring inside Strings.

We have other methods of finding substring as well, but regex can cope up with way more complicated patterns than just normal searching for a substring.

In Python, the module “re” is used to work with Regular Expressions.

Installation

The “re” module is already installed in Python standard library, which means that we don’t need to install it by ourselves.

Note: 

“All the modules or libraries pre-installed in Python are included in The Python standard library.” 

Getting Started

We will start with the basic implementation of regular expressions, then moving forward we will be looking at some complex data extracting techniques used in the regex.

We will be covering the following functions, and then we will discuss how to use them? What are their drawbacks and which is a better choice over another function?

  • re.match()
  • re.search()
  • re.findall()

To use the re module and access its functionality we need to import the module using the “import” keyword followed by the module name.

We will be assuming that you are using this import statement before using the module’s functionality. 

re.match()

The re.match functions have three parameters.

  • Pattern: is the substring that you are searching for in the main String. 
  • String: This is the main String that we are going to extract the pattern from. 
  • re.IGNORECASE: this parameter is optional that can be used to ignore the case sensitivity. This means it will match the substring, irrespective of the letter case, i.e., uppercase or lowercase (a or A).

Syntax:

Implementation

Code:

Output:

” data-lang=”text/x-python”>

If the pattern is found in the mainString, this function will return a Match object, the span of the pattern (which includes the starting and ending index), and lastly the match itself.

However, if the pattern is not found, the re.match() function will return None.

We can also access the pattern’s index using the span() method:

Output: 

Let’s check the authenticity of the span using the slicing operator.

Output: 

Drawback

re.match() only matches if the pattern is present at the beginning of mainString. Otherwise, if the pattern is present somewhere in the middle. The re.match function will return None.

Output:

To cover this drawback we can use re.search().

re.search()

The syntax of re.search() function is the same as re.match(), but it can search for the pattern anywhere present in the string.

Syntax:

Implementation

Code: 

Output: 

” data-lang=”text/x-python”>

The return type re.search() is the same as of re.match()

Drawback

The search and match function only checks for the first occurrence in the mainString. What if we have multiple occurrences of the pattern in the string? To handle this we can use the re.findall() method. 

re.findall()

Instead of returning a match object, re.findall() returns a list of all the pattern occurrences in the string.

Syntax:

Implementation

Let’s have a look at how this can be implemented.

Code:

Output:

Explanation

In the code above we have simply used an English quote as our mainString, then we have searched for the pattern i.e., ‘now’ in mainString using the re.findall() method.

Next, we have checked if the list is empty or not using the bool() functions, and printed the number of patterns in the list using a format string.

Note: 

“If the bool function is empty it returns False.”

Click here to learn about bool() and format Strings.

Meta Characters and Their Usage

Till now we have only learned the basic implementation of regex, Up next we are going to learn about metacharacters which are the backbone of a regular expression.

“Metacharacters are special characters that describe the pattern to be searched in the String. Each metacharacter has a special meaning.”

Below a table is given for metacharacters and their usage. It’s okay if you aren’t able to get the core concept. Look for the examples below to have a better understanding:

Metacharacter

Purpose

^

matches the start of the string

$

matches the end of the string

[ ]

Specifies the range of characters or digits Eg : [0-9], [A-Z]

{ }

Specifies the number of characters Eg: [1-5]{3}

( )

A character set that must match exactly.

+

Matches the character if it occurs 1 or more times.

*

Matches with any character. Eg:

 

“jam* = james”

or

“jam* = jamaica.

 

Both of these will match.

?

Matches the character if it matches 0 or 1 time in the String.

|

Matches the alternate, or in other words, it can be used as an OR operator.

 

Eg: James ( Charles | Bond ). It will match either James Bond or James Charles.

Escape character

.(period)

Matches with any character including digits and special characters like #,%&, etc.

Some of the character classes are given below, which we are going to use with metacharacters up next.

d

Matches with a single-digit

“d = 4” or “ddd = 321”

w

Matches with a single character

“w = c” or “wwwww = James”

Learn more about character classes here.

Non-Optimized Example

Code: 

Output:

In the example above we have used a w and d.

Where wwwww matches with james followed by ddd matching 321.

Note:

The r used before the string indicates that the pattern is a raw string and all the escape characters will be ignored.

We have another method that is easy to write using metacharacters:

Output: 

Here we have used w{5} which matches the pattern wherever there are 5 characters and followed by d{3} which means the pattern where there are 3 digits.

Let’s have a look at the bigger picture.

Code:

This will return every username matching this pattern r'w+d{3}'.

Here w+ means that any character that exists 1 or more times followed by 3 digits, i.e., d{3}.

Output:

Gmail Extraction

Before searching for Gmail first I want you to break down a Gmail pattern into parts. Name + @gmail.com.

The Name part can contain characters or digits. The @gmail.com is compulsory and is present in every Gmail.

Code:

Output:

Now there are multiple ways to lookup for the same pattern. So there is no need to remember the patterns, focus on the usage of metacharacters and how they collectively work. 

Phone Number Extractions

I have generated a random list of contact numbers and we are going to validate the number (11 digits only) using regex.

Code:  

Output:

In the example above we have used the pattern ‘+[0-9]{11}’.

Where + is used literally by escaping (+)

[0-9]{11} means to match with only “11 sets of digits from 0 to 9.”

Note:

  • Any metacharacter used after will be escaped and used as a literal, not as a metacharacter. 
  • In the example above, we used + which means to search for + in the string, not a metacharacter (“+”). Similarly “$, * , ?” Means search for $, * and ? in the string respectively. 

Where to Use Regex?

After all the learning, I don’t want you to wonder where to use regex in real projects? Here are some of the use cases. 

  • Email and Password validation.
  • Web-scraping.
  • Credit card validation.
  • Validating date format.

Credit: Source link

Previous Next
Close
Test Caption
Test Description goes like this