Sample Video Frame

Created by Zed A. Shaw Updated 2024-02-17 04:54:36
 

Exercise 31: Regular Expressions

A regular expression (regex) is a succinct way to encode how a sequence of characters should be matched in a string. They are normally thought of as "scary" but, as you know, anything wrapped in fear is usually just taught wrong. The reality of regular expressions is they are a set of about eight symbols that tell a computer how to match a pattern. Used simply they are easy to understand. Where people run into trouble is trying to use incredibly complex regular expressions where an actual parser would be better. Once you understand these eight symbols and the limitations of regular expressions you'll see they aren't scary at all.

I'm going to have you do some more memorization to prime your brain for the discussion. The important symbols to memorize are:

  • ^: Anchor beginning of the string. This will match only if the match starts right at the beginning.
  • $: Anchor end of the string. This will match only if it goes to the end.
  • .: Any one char. Accept any single character input.
  • ?: Optional previous. The previous part of the regex is optional, so A? means an optional "A" character.
  • *: 0 or more previous any number of times. Take the previous part of the regex and accept it repeatedly or skip over it. A* will accept "AAAAAAA" or "BQEFT" since there are zero A characters in it.
  • +: 1 or more previous at least once. Same as * but it only accepts if the regex has 1 or more of those characters. A+ will accept "AAAAAAA" but not "BQEFT".
  • [X-Y]: Class (range) of chars from X-Y. Accepts any of the characters listed in the range from X to Y. Using [A-Z] is all capital English letters. There are \ short cuts for many common character ranges you can use instead of this.
  • (): Capture this part of the regular expression for later. Many regular expression libraries are used to also replace, extract, or alter text. A capture will take the part of the regex inside the (), and save it for later use. Many libraries then let you reference these captures. If you did ([A-Z]+) that would capture 1 or more capital English letters.

The Python re library lists many more symbols, but most of them are some modifier to these eight or extra features not commonly found in regular expression libraries. You'll start by creating flash cards for these eight, focusing on the bold phrases (anchor end, optional previous) so you can recall them quickly and explain what they do.

Once you've memorized these symbols take the following regular expressions and translate them to English and use the Python re library to try the listed strings or any other strings you can think of.

  • ".*BC?%HTML%quot;: helloBC, helloB, helloA, helloBCX
  • "[A-Za-z][0-9]+": A1232344, abc1234, 12345, b493034
  • "^[0-9]?a*b?.%HTML%quot;: 0aaaax, aaab9, 9x, 88aabb, 9zzzz
  • "A+B+C+[xyz]*": AAAABBCCCCCCxyxyz, ABBBBCCCxxxx, ABABABxxxx

Once you've translated them, use the Python re module to try them out in the shell like this:

Previous Lesson Next Lesson

Register for Learn More Python the Hard Way

Register today for the course and get the all currently available videos and lessons, plus all future modules for no extra charge.