RegEx

Today we are talking about RegEx, which stands for Regular Expression.

These are text strings that allow you to extract information by matching, searching and sorting.

The history of regular expressions dates back to 1950 when they were defined by the American mathematician Stephen Cole Kleene. Subsequently, regular expressions were first implemented by computer scientist Kenneth Lane Thompson in the editor QED between 1966 and 1967.

We can use regular expressions to find text that matches a given pattern and optionally replace those matches with new text.

To be more precise, let’s imagine a text from which we want to extract parts of it, check their accuracy (matching), search for specific content, and use search and replace functions.

More specifically, as we said, we can use regular expressions to find text that matches a given pattern and optionally replace those matches with new text. As we will see in the example we show ahead there is no universal syntax.

To be more precise, let’s imagine a text from which we want to extract parts of it, check their accuracy (matching), search for specific content, and use search and replace functions. The syntaxes of RegEx are known in POSIX systems and Perl language.

RegExes are often considered very complex, as the following comics shows:

Credits xkcd - https://xkcd.com/1171/

For example, let’s imagine that we want to check the syntactic accuracy of an email address; we know that the email address must comply with an international standard known as RFC 5322. Therefore, the email address must conform to the criteria stated in the RFC 5322 mentioned above.

So, we should use the following RegEx:

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

In fact, with the string mentioned above, you can verify, for example, the email address is in the format:

username@domain.tld

and as

name.surname@domain.tld

We explain the syntax of the string above:

  • the character ^ (circumflex accent) is defined as “anchor” and indicates that it starts a regex string;
  • the first part, namely [a-zA-Z0-9_.+-], indicates that the username can be composed of any alphanumeric character, i.e. it can contain characters from “a” to “z”, lowercase or uppercase, it can have numbers, the dot and/or the dash.
  • the sign + indicates one or more occurrences and then follows the symbol @;
  • the second part [a-zA-Z0-9-]+\. refers to the domain and indicates - as already said for the username, that it can be composed by any alphanumeric character, that is, it can contain characters from “a” to “z”, lowercase or uppercase, it can have numbers, and that there must be the dot;
  • the following sign + indicates that another string follows the domain;
  • the final string \.[a-zA-Z0-9-.]+ indicates the composition of the TLD with the same logic followed for the username;
  • the character $ is defined as “anchor” (like the symbol ^ and indicates the closing of the regex string.

We specify that there is not only one formulation of the regex string to control the email address; in fact, we report below other examples with which the same objective is reached.

^([a-zA-Z0-9._%-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})*$

or

^([a-z0-9_\.\+-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$

We don’t consider the verification of the email address simplistically because there could be several reasons that lead us to carry out a check, from the accuracy of the syntax to security requirements.

RegEx editors

The regexes must respect a syntax, and on the Internet, several free tools allow you to verify their accuracy. We mention some of them:

  1. Regular Expressions 101
  2. RegEx Testing from Dan’s Tools
  3. RexExr
  4. Cooding.Tools
  5. pythex

However, RegEx editors are available that you can use as an app on your machine for necessary tasks.

In computer science, RegEx are essential and used, for example, in searches within SQL databases or for Google Analytics, Apache web server, tax code calculation, HTML code, etc.

grep and RegEx

After creating the regexes, the global regular expression print (grep) was created, which is an independent application that is part of the UNIX based operating system commands.

grep is an application of the Portable Operating System Interface for Unix (POSIX), a family of international standards defined by IEEE for Unix-like operating systems including macOS, and is described by the ISO/IEC 9945 standard.

Those familiar with Linux and macOS operating systems know very well the functions of grep and its potentialities.

It’s enough to recall the help with the command grep --help to have available the syntax. However, many resources on the Internet explain very well the syntax to use with grep.

In summary, grep allows you to search for a specific string inside one or more files.

In fact, by typing the command :

grep text filename

It will start searching for what we have indicated as “text” inside the file that - to simplify - we have declared as “filename”.

The syntax of grep allows us to extend the search to more files, as in the following example:

grep text filename1 filename2 filename3

When we are into a folder, we can use the command

grep text *

To start a search for “text” in all the files in the folder itself.

We can also start a search inside the folder and all the others that are inside with the command:

grep -r text *

We won’t continue with listing the different grep commands because you can retrieve them from the help or the numerous online resources, as we mentioned.

In conclusion, grep is a valuable resource for those who need to perform searches within files on Linux and macOS systems.

Differences between RegEx and search or find functions

It might seem that RegExs have the same effects as the “search” and “find” functions.

In reality, they are different functions.

When we search for a specific string, we use the functions “find” and “search”, but if we want to know if there is information that matches a given pattern, we have to use regular expressions (RegEx).

By the way, these concepts are known - for example - in Python where there are different RegEx functions, precisely:

  • re.match()
  • re.search()
  • re.findall()

Here is not the right place to delve into such very technical topics.

Can RegEx be helpful in the legal field?

The topic of RegEx seems to be the exclusive competence and interest of those who work in IT and computer science.

Our opinion is different in that - as IT and computer science are ubiquitous in both personal and business environments - we should pay attention to productivity to maximise the results of our workflows. Said like that, it sounds simple. In reality, we think that each of us wants to get the best results, both in terms of time spent and the final result obtained, unless we believe that these are the expectations or goals of users considered “Pro”.

In our humble opinion, this is not a classification between “Pro” and “basic” users (or any other terminology) but to improve, with computer science, in general terms, the efficiency of those who carry out certain activities both in the personal and work field.

We consider that the use of RegEx in the “legal” field is important and relevant. Indeed, in that domain, for example, by using RegEx, we can extract information from a file or find the text corresponding to a given pattern and optionally replace those correspondences with new text.

Let’s imagine that we have to perform activities on a medium-long text to check the correspondence of specific data or to replace some of them; we might have to perform extractions from a CSV file.

Someone might object that they can do this activity simply by using apps like Microsoft Word or Excel.
That may be so, but in our opinion, it would not maximise the result and achieve the accuracy of RegEx, mainly when the search is focused on a particular pattern.

The advantage of plain text and Markdown is unquestionable.

Only with plain text or Markdown files, you can achieve the best results in RegEx searches. Only plain text can be used in other apps and be “treated” with formatting; the opposite can hardly happen unless you use a format converter like Pandoc and if the conversion is possible.

In conclusion, we believe that the functions of RegEx are in any case valid and valuable also in the juridical-legal field, and the applications are multiple and not simply enumerable, also because everyone could have a different purpose of use from others.

For those who approach RegEx for the first time, we suggest trying, maybe even with a powerful editor that maximises the results.

In our view, logic and mathematics are not unrelated to law and contribute to realising common goals.


If this resource was helpful, you could contribute by

Buy me a coffee

Or donate via

Liberapay


Follow us on Mastodon

Stay tuned!