Regex 2021.4.4

Поиск доменных имен

Иногда в большом текстовом документе нужно найти телефонные номера, адреса электронной почти или доменные номера.

Возьмем для примера такой текст:

<div class="reflist" style="list-style-type: decimal;"><ol class="references"><li id="cite_note-1"><span class="mw-cite-backlink"><b>^ (http://www.askoxford.com/concise_oed/train?view=uk). <i>(definition – Compact OED)</i>. Oxford University Press<span class="reference-accessdate">. Retrieved 2008-03-18</span>.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.atitle=Train+%28noun%29&rft.genre=article&rft_id=http%3A%2F%2Fwww.askoxford.com%2Fconcise_oed%2Ftrain%3Fview%3Duk&rft.jtitle=%28definition+%E2%80%93+Compact+OED%29&rft.pub=Oxford+University+Press&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal" class="Z3988"><span style="display:none;"> </span></span></span></li><li id="cite_note-2"><span class="mw-cite-backlink"><b>^</b></span> <span class="reference-text"><span class="citation book">Atchison, Topeka and Santa Fe Railway (1948). <i>Rules: Operating Department</i>. p. 7.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.au=Atchison%2C+Topeka+and+Santa+Fe+Railway&rft.aulast=Atchison%2C+Topeka+and+Santa+Fe+Railway&rft.btitle=Rules%3A+Operating+Department&rft.date=1948&rft.genre=book&rft.pages=7&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook" class="Z3988"><span style="display:none;"> </span></span></span></li><li id="cite_note-3"><span class="mw-cite-backlink"><b>^ (http://www.hydrogencarsnow.com/blog2/index.php/hydrogen-vehicles/i-hear-the-hydrogen-train-a-comin-its-rolling-round-the-bend/)</span></li><li id="cite_note-4"><span class="mw-cite-backlink"><b>^ (http://www.bnsf.com/media/news/articles/2008/01/2008-01-09a.html)</span></li><li id="cite_note-5"><span class="mw-cite-backlink"><b>^</b></span> <span class="reference-text"><span class="citation book">Central Japan Railway (2006). <i>Central Japan Railway Data Book 2006</i>. p. 16.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.au=Central+Japan+Railway&rft.aulast=Central+Japan+Railway&rft.btitle=Central+Japan+Railway+Data+Book+2006&rft.date=2006&rft.genre=book&rft.pages=16&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook" class="Z3988"><span style="display:none;"> </span></span></span></li><li id="cite_note-6"><span class="mw-cite-backlink"><b>^ (http://web.archive.org/web/20080620033027/http://www.mrvc.indianrail.gov.in/overview.htm). _Official webpage of Mumbai Railway Vikas Corporation_. Archived from (http://www.mrvc.indianrail.gov.in/overview.htm) on 2008-06-20<span class="reference-accessdate">. Retrieved 2008-12-11</span>.</span><span title="ctx_ver=Z39.88-2004&rfr_id=info%3Asid%2Fen.wikipedia.org%3ATrain&rft.atitle=Overview+Of+the+existing+Mumbai+Suburban+Railway&rft.genre=article&rft_id=http%3A%2F%2Fwww.mrvc.indianrail.gov.in%2Foverview.htm&rft.jtitle=Official+webpage+of+Mumbai+Railway+Vikas+Corporation&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal" class="Z3988"><span style="display:none;"> </span></span></span></li></ol></div>

А нужно найти все основные домены в тексте

Как это сделать?

match=re.findall(r'http(s:|:)\/\/(www.|ww2.|)(*\.\w{2,3})',string)for elem in match:    print(elem)--------------------------------------------------------------------(':', 'www.', 'askoxford.com')(':', 'www.', 'hydrogencarsnow.com')(':', 'www.', 'bnsf.com')(':', '', 'web.archive.org')(':', 'www.', 'mrvc.indianrail.gov.in')(':', 'www.', 'mrvc.indianrail.gov.in')

| — здесь это оператор or, который возвращает наборы, содержащие паттерн внутри ().

Search and Replace

Регулярные выражения php

One of the most important re methods that use regular expressions is sub.

Syntax

re.sub(pattern, repl, string, max=0)

This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max provided. This method returns modified string.

Example

#!/usr/bin/python
import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num : ", num

When the above code is executed, it produces the following result −

Phone Num :  2004-959-559
Phone Num :  2004959559

re.search

Javascript регулярные выражения

The re.search() method takes a regular expression pattern and a string and
searches for that pattern within the string.

The syntax is re.search(pattern, string).

where:pattern
regular expression to be matched.

string
the string which would be searched to match the pattern anywhere in the string.

It searches for first occurrence of RE pattern within string with optional flags.

If the search is successful, search() returns a match object or None otherwise.

Therefore, the search is usually immediately followed by an if-statement to test
if the search succeeded.

It is common to use the ‘r’ at the start of the pattern string, that designates
a python “raw” string which passes through backslashes without change which is
very handy for regular expressions.

This example searches for the pattern ‘word:’ followed by a 3 letter word.

The code match = re.search(pat, str) stores the search result in a variable
named “match”.

Then the if-statement tests the match, if true the search succeeded and
match.group() is the matching text (e.g. ‘word:cat’).

If the match is false, the search did not succeed, and there is no matching text.

As you can see in the example below, I have used the | operator, which search for either pattern I specify.

The output of above script will be:

Python FOUND
Perl FOUND
PHP FOUND
C++ NOT FOUND

Sets

Маска для полей формы.маски ввода для текстовых полей

A set is a set of characters inside a pair of square brackets with a special meaning:

Set	Description	Try it
	Returns a match where one of the specified characters (, , or ) are present	Try it »
	Returns a match for any lower case character, alphabetically between and	Try it »
	Returns a match for any character EXCEPT , , and	Try it »
	Returns a match where any of the specified digits (, , , or ) are present	Try it »
	Returns a match for any digit between and	Try it »
	Returns a match for any two-digit numbers from and	Try it »
	Returns a match for any character alphabetically between and , lower case OR upper case	Try it »
	In sets, , , , , , , has no special meaning, so means: return a match for any character in the string	Try it »

Сложности с обратными косыми

Обратные косые немного усложняют жизнь в мире регулярных выражений Python. Это связанно с тем, что регулярные выражения используют обратные косые для определения специальных форм, или для того, чтобы искать определенный символ, вместо того, чтобы вызывать его. Как если бы мы искали символ доллара $. Если мы не используем обратную косую для этого, нам нужно просто создать анкор. Проблема возникает по той причине, что Python использует символ обратной косой по той же причине в литеральных строках.

Давайте представим, что вам нужно найти строку на подобии этой: «python». Для её поиска в регулярном выражении, вам нужно будет использовать обратную косую, но, так как Python также использует обратную косую, так что на выходе вы получите следующий поисковый паттерн: «\\python» (без скобок). К счастью, Python поддерживает сырые строки, путем подстановки буквы r перед строкой. Так что мы можем сделать выдачу более читабельной, введя следующее: r”\python”. Так что если вам нужно найти что-то с обратной косой в названии, убедитесь, что используете сырые строки для этой цели, иначе можете получить совсем не то, что ищете.

Согласуемые символы

Когда вам нужно найти символ в строке, в большей части случаев вы можете просто использовать этот символ или строку. Так что, когда нам нужно проверить наличие слова «dog», то мы будем использовать буквы в dog. Конечно, существуют определенные символы, которые заняты регулярными выражениями. Они так же известны как метасимволы. Внизу изложен полный список метасимволов, которые поддерживают регулярные выражения Python:

Python

. ˆ $ * + ? { } | ( )

1	. ˆ $ * + ? { } \| ( )

Давайте взглянем как они работают. Основная связка метасимволов, с которой вы будете сталкиваться, это квадратные скобки: . Они используются для создания «класса символов», который является набором символов, которые вы можете сопоставить. Вы можете отсортировать символы индивидуально, например, так: . Это сопоставит любой внесенный в скобки символ. Вы также можете использовать тире для выражения ряда символов, соответственно: . В этом примере мы сопоставим одну из букв в ряде между a и g. Фактически для выполнения поиска нам нужно добавить начальный искомый символ и конечный. Чтобы упростить это, мы можем использовать звездочку. Вместо сопоставления *, данный символ указывает регулярному выражению, что предыдущий символ может быть сопоставлен 0 или более раз. Давайте посмотрим на пример, чтобы лучше понять о чем речь:

Python

‘a*f

‘ab-f*f

Этот шаблон регулярного выражения показывает, что мы ищем букву а, ноль или несколько букв из нашего класса, и поиск должен закончиться на f. Давайте используем это выражение в Python:

Python

import re
text = ‘abcdfghijk’

parser = re.search(‘a*f’)
print(parser.group()) # ‘abcdf’

1
2
3
4
5

importre

text=’abcdfghijk’

parser=re.search(‘a*f’)

print(parser.group())# ‘abcdf’

В общем, это выражение просмотрит всю переданную ей строку, в данном случае это abcdfghijk.Выражение найдет нашу букву «а» в начале поиска. Затем, в связи с тем, что она имеет класс символа со звездочкой в конце, выражение прочитает остальную часть строки, что бы посмотреть, сопоставима ли она. Если нет, то выражение будет пропускать по одному символу, пытаясь найти совпадения. Вся магия начинается, когда мы вызываем поисковую функцию модуля re. Если мы не найдем совпадение, тогда мы получим None. В противном случае, мы получим объект Match. Чтобы увидеть, как выглядит совпадение, вам нужно вызывать метод group. Существует еще один повторяемый метасимвол, аналогичный *. Этот символ +, который будет сопоставлять один или более раз. Разница с *, который сопоставляет от нуля до более раз незначительна, на первый взгляд.

Символу + необходимо как минимум одно вхождение искомого символа. Последние два повторяемых метасимвола работают несколько иначе. Рассмотрим знак вопроса «?», применение которого выгладит так: “co-?op”. Он будет сопоставлять и “coop” и “co-op”. Последний повторяемый метасимвол это {a,b}, где а и b являются десятичными целыми числами. Это значит, что должно быть не менее «а» повторений, но и не более «b». Вы можете попробовать что-то на подобии этого:

Python

xb{1,4}z

Это очень примитивный пример, но в нем говорится, что мы сопоставим следующие комбинации: xbz, xbbz, xbbbz и xbbbbz, но не xz, так как он не содержит «b».

Следующий метасимвол это ^. Этот символ позволяет нам сопоставить символы которые не находятся в списке нашего класса. Другими словами, он будет дополнять наш класс. Это сработает только в том случае, если мы разместим ^ внутри нашего класса. Если этот символ находится вне класса, тогда мы попытаемся найти совпадения с данным символом. Наглядным примером будет следующий: . Так, выражения будет искать совпадения с любой буквой, кроме «а». Символ ^ также используется как анкор, который обычно используется для совпадений в начале строки.

Существует соответствующий якорь для конце строки – «$». Мы потратим много времени на введение в различные концепты применения регулярных выражений. В следующих параграфах мы углубимся в более подробные примеры кодов.

Компилирование

Модуль re позволяет вам «компилировать» выражение, которое вы ищите чаще всего. Это также позволит вам превратить выражение в объект SRE_Pattern. Вы можете использовать этот объект в вашей функции поиска в будущем. Давайте используем код из предыдущего примера и изменим его, чтобы использовать компилирование:

Python

import re

text = «The ants go marching one by one»

strings =

for string in strings:
regex = re.compile(string)
match = re.search(regex, text)
if match:
print(‘Found «{}» in «{}»‘.format(string, text))
text_pos = match.span()
print(text)
else:
print(‘Did not find «{}»‘.format(string))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

importre

text=»The ants go marching one by one»

strings=’the’,’one’

forstringinstrings

regex=re.compile(string)

match=re.search(regex,text)

ifmatch

print(‘Found «{}» in «{}»‘.format(string,text))

text_pos=match.span()

print(textmatch.start()match.end())

else

print(‘Did not find «{}»‘.format(string))

Обратите внимание на то, что здесь мы создаем объект паттерна, вызывая compile в каждой строке нашего списка, и назначаем результат переменной – регулярному выражению. Далее мы передаем это выражение нашей поисковой функции

Остальная часть кода остается неизменной. Основная причина, по которой используют компилирование это сохранить выражение для повторного использования в вашем коде в будущем. В любом случае, компилирование также принимает флаги, которые могут быть использованы для активации различных специальных функций. Мы рассмотрим это далее.Обратите внимание: когда вы компилируете паттерны, они автоматически кэшируются, так что если вы не особо используете регулярные выражения в своем коде, тогда вам не обязательно сохранять компилированный объект как переменную.

PAN номера

В Индии используются PAN номера для налоговой идентификации вместо SSN номеров в США. Основной критерий действительности PAN — все буквы должны быть заглавными, а символы должны располагаться в следующем порядке:

<char><char><char><char><char><digit><digit><digit><digit><char>

Вопрос:

‘ABcDE1234L’ — действительный PAN?

Как решается эта задача без регулярных выражений? Возможно, будет написан цикл for с индексом, проходящим через строку. С регулярными выражениями все проще:

match=re.search(r’{5}{4}’,'ABcDE1234L')if match:    print(True)else:    print(False)-----------------------------------------------------------------False

Использование модуля Python Re

Давайте посмотрим на несколько распространенных примеров модуля Python Re. Это встроенный модуль Python, поэтому нам не нужно его устанавливать.

1.1) Re.Search ()

используется для поиска последовательности , что является регулярным выражением, внутри (Строка поиска) и возвращает соответствие, если шаблон найден.

Давайте посмотрим на пример для того же

import re

str = 'This is a sample text which we use to search a pattern within the text.'

pat = r'text'

match = re.search(pat, str)

if match is None:
    print('Pattern not found')
else:
    print('Pattern found!')
    print('Match object', match)

Выход

Pattern found!
Match object

Как видите, вывод показывает, что там действительно существует совпадение шаблона, мы искали простое слово в и пролет обозначает индексы матча. То есть к соответствует, что соответствует подстроке , как и ожидалось. Но это только дает первый матч.

1.2) Re.findall ()

Чтобы дать список всех сопоставленных строк, мы используем Чтобы вернуть список всех подобранных строк (которые могут быть пустыми).

>>> matches = re.findall(pat, str)
>>> print(matches)

Это чрезвычайно мощная функция для извлечения узоров, и это можно использовать на что-либо, например, поиск в файлах.

import re
with open('text.txt', 'r') as f:
    matches = re.findall(r'pattern', f.read())
print(matches)

Метасимволы нулевой ширины в регулярном выражении.

Некоторые из оставшихся метасимволов, являются утверждениями нулевой ширины. Они не заставляют движок регулярных выражений продвигаться по строке. Они вообще не сравниваются с символами, они просто сообщают движку регулярных выражений о присутствии данного условия в строке или терпят неудачу.

Например — это утверждение, что текущая позиция находится на границе слова. Позиция не изменяется с помощью вообще. Это означает, что утверждения нулевой ширины никогда не должны повторяться, потому что если они совпадают один раз в данном месте, следовательно они могут быть сопоставлены бесконечное число раз.

— Метасимвол :

Метасимвол обозначает чередование, или оператор . Если и являются регулярными выражениями, то выражение будет соответствовать любой строке, которая соответствует либо , либо . Метасимвол имеет очень низкий приоритет, когда чередуются многосимвольные строки. Выражение будет соответствовать или или .

Чтобы сопоставить литерал , используйте или заключите его в символьный класс, как .

— Метасимвол :

Метасимвол — обозначает совпадение с началом строки. Если флаг не установлен, он будет совпадать только с началом строки. В режиме также совпадает сразу после каждой новой строки в строке.

Например, если сопоставить слово только в начале строки, используйте шаблон .

>>> print(re.search('^From', 'From Here to Eternity'))  
# <_sre.SRE_Match object; span=(0, 4), match='From'>
>>> print(re.search('^From', 'Reciting From Memory'))
# None

— Метасимвол :

Метасимвол соответствует концу строки, который определяется как конец строки или любое место, за которым следует символ новой строки.

>>> print(re.search('}$', '{block}'))  
# <_sre.SRE_Match object; span=(6, 7), match='}'>
>>> print(re.search('}$', '{block} '))
# None
>>> print(re.search('}$', '{block}\n'))  
# <_sre.SRE_Match object; span=(6, 7), match='}'>

Чтобы сопоставить литерал , используйте конструкцию или заключите его в класс символов, как .

— Метасимвол :

Метасимвол обозначает обозначает совпадение только с началом строки. Когда режим выключен и фактически одинаковы. В многострочном режиме они различны: по- прежнему совпадает только в начале строки, а может совпадать в любом месте внутри строки, которая следует за символом новой строки .

— Метасимвол :

Совпадает только c концом строки.

— Метасимвол :

Метасимвол обозначает границу слова. Это утверждение с нулевой шириной, которое соответствует только началу или концу слова. Слово определяется как последовательность буквенно-цифровых символов, поэтому конец слова обозначается пробелом или не буквенно-цифровым символом.

Следующий пример соответствует набору букв только тогда, когда это полное слово. Оно не будет совпадать, если набор букв содержится внутри другого слова.

>>> p = re.compile(r'\bclass\b')
>>> print(p.search('no class at all'))  
# <_sre.SRE_Match object; span=(3, 8), match='class'>
>>> print(p.search('the declassified algorithm'))
# None
>>> print(p.search('one subclass is'))
# None

Есть две тонкости, которые должны помнить при использовании этой специальной последовательности. Во-первых, это худшее столкновение между строковыми литералами Python и последовательностями регулярных выражений. В строковых литералах Python это символ — значение ASCII 8. Если не использовать необработанные строки, то Python преобразует \b в и регулярное выражение не будет соответствовать ожидаемому. Следующий пример выглядит так же, как предыдущий RrgExp, но не использует ‘r’ перед строкой шаблона.

>>> p = re.compile('\bclass\b')
>>> print(p.search('no class at all'))
# None
>>> print(p.search('\b' + 'class' + '\b'))  
# <_sre.SRE_Match object; span=(0, 7), match='\x08class\x08'>

Во вторых, внутри символьного класса метасимвол не используется, а представляет символ для совместимости со строковыми литералами Python.

— Метасимвол :

Метасимвол это противоположность . Происходит сопоставление только когда текущая позиция движка не находится на границе слова.

And now for something completely different

trong>Arthur (Eric Idle): Good morning, I’d like to buy a book please.
Bookseller (John Cleese): Oh, well I’m afraid we don’t have any. (trying to hide them)
Arthur: I’m sorry?
Bookseller: We don’t have any books. We’re fresh out of them. Good morning.
Arthur: What are all these?
Bookseller: All what? Oh! All these, ah ah ha ha. You’re referring to these… books.
Arthur: Yes.
Bookseller: They’re um… they’re all sold. Good morning.
Arthur: What all of them?
Bookseller: Every single man-Jack of them. Not a single one of them in an unsold state. Good morning.
Arthur: Wait a minute, there’s something going on here.
Bookseller: What, where? You didn’t see anything did you?
Arthur: No, but I think there’s something going on here.
Bookseller: No no, well there’s nothing going on here at all (shouts off) and he didn’t see anything. Good morning.
Arthur: Oh, well, I’d like to buy a copy of an ‘Illustrated History of False Teeth’.
Bookseller: My God you’ve got guts.
Arthur: What?
Bookseller: (pulling gun) Just how much do you know?
Arthur: What about?
Bookseller: Are you from the British Dental Association?
Arthur: No I’m a tobacconist.
Bookseller: Stay where you are. You’ll never leave this bookshop alive.
Arthur: Why not?
Bookseller: You know too much, my dental friend.
Arthur: I don’t know anything.
Bookseller: Come clean. You’re a dentist aren’t you.
Arthur: No, I’m a tobacconist.
Bookseller: A tobacconist who just happens to be buying a book on teeth?

Version	Operating System	Description	MD5 Sum	File Size	GPG
Gzipped source tarball	Source release		429ae95d24227f8fa1560684fad6fca7	25372998	SIG
XZ compressed source tarball	Source release		61981498e75ac8f00adcb908281fadb6	18897104	SIG
macOS 64-bit Intel installer	Mac OS X	for macOS 10.9 and later	74f5cc5b5783ce8fb2ca55f11f3f0699	29795899	SIG
macOS 64-bit universal2 installer	Mac OS X	for macOS 10.9 and later, including macOS 11 Big Sur on Apple Silicon (experimental)	8b19748473609241e60aa3618bbaf3ed	37451735	SIG
Windows embeddable package (32-bit)	Windows		96c6fa81fe8b650e68c3dd41258ae317	7571141	SIG
Windows embeddable package (64-bit)	Windows		e70e5c22432d8f57a497cde5ec2e5ce2	8402333	SIG
Windows help file	Windows		c49d9b6ef88c0831ed0e2d39bc42b316	8787443	SIG
Windows installer (32-bit)	Windows		dde210ea04a31c27488605a9e7cd297a	27126136	SIG
Windows installer (64-bit)	Windows	Recommended	b3fce2ed8bc315ad2bc49eae48a94487	28204528	SIG

This is the third maintenance release of Python 3.9

NOTE: The release you’re looking at has been recalled due to unintentional breakage of ABI compatibility with C extensions built in Python 3.9.0 — 3.9.2. Details in bpo-43710. Please use Python 3.9.4 or newer instead.

Python 3.9.3 is an expedited release which includes a number of security fixes and is recommended to all users:

bpo-43631: high-severity CVE-2021-3449 and CVE-2021-3450 were published for OpenSSL, it’s been upgraded to 1.1.1k in CI, and macOS and Windows installers.
bpo-42988: CVE-2021-3426: Remove the getfile feature of the pydoc module which could be abused to read arbitrary files on the disk (directory traversal vulnerability). Moreover, even source code of Python modules can contain sensitive data like passwords. Vulnerability reported by David Schwörer.
bpo-43285: ftplib no longer trusts the IP address value returned from the server in response to the PASV command by default. This prevents a malicious FTP server from using the response to probe IPv4 address and port combinations on the client network. Code that requires the former vulnerable behavior may set a trust_server_pasv_ipv4_address attribute on their ftplib.FTP instances to True to re-enable it.
bpo-43439: Add audit hooks for gc.get_objects(), gc.get_referrers() and gc.get_referents(). Patch by Pablo Galindo.

Elements that define a regular expression?

Anchors	Description
	restricts the match to the start of string
	restricts the match to the end of string
	restricts the match to the start of line
	restricts the match to the end of line
	newline character is used as line separator
or	flag to treat input as multiline string
	restricts the match to the start/end of words
	word characters: alphabets, digits, underscore
	matches wherever doesn’t match

, and are metacharacters in the above table, as these characters have special meaning. Prefix a character to remove the special meaning and match such characters literally. For example, will match a character instead of acting as an anchor.

Feature	Description
	multiple RE combined as conditional OR
	each alternative can have independent anchors
	group pattern(s), also a capturing group
	is same as
	non-capturing group
	named capture group
	Match any character except the newline character
	Character class, matches one character among many

Greedy Quantifiers	Description
	Match zero or more times
	Match one or more times
	Match zero or one times
	Match to times (inclusive)
	Match at least m times
	Match up to times (including times)
	Match exactly n times
	any number of characters between and
	match both and in any order

Greedy here means that the above quantifiers will match as much as possible that’ll also honor the overall RE. Appending a to greedy quantifiers makes them non-greedy, i.e. match as minimally as possible. Quantifiers can be applied to literal characters, groups, backreferences and character classes.

Character class	Description
	Match any vowel
	inverts selection, so this matches any consonant
	defines a range, so this matches any of abcdef characters
	Match a digit, same as
	Match non-digit, same as or
	Match word character, same as
	Match non-word character, same as or
	Match whitespace character, same as
	Match non-whitespace character, same as or

Lookarounds	Description
lookarounds	custom assertions, zero-width like anchors
	negative lookahead assertion
	negative lookbehind assertion
	positive lookahead assertion
	positive lookbehind assertion
	multiple assertions can be specified in any order
	as they mark a matching location without consuming characters
	Negate a grouping, similar to negated character class

Flags	Description
or	flag to ignore case
or	allow metacharacter to match newline character
	multiple flags can be combined using operator
or	allow and anchors to match line wise
or	allows to use literal whitespaces for aligning purposes
	and to add comments after the character
	escape spaces and if needed as part of actual RE
or	match only ASCII characters for , , ,
	and their opposites, applicable only for Unicode patterns
or	use locale settings for byte patterns and 8-bit locales
	another way to add comments, not a flag
	inline flags only for this , overrides argument
	flags is for , for , etc, except for
	negate flags only for this
	apply and negate particular flags only for this
	apply flags for whole RE, can be used only at start of RE
	anchors if any, should be specified after

Matched portion	Description
object	details like matched portions, location, etc
or	entire matched portion of object
or	matched portion of nth capture group
	tuple of all the capture groups’ matched portions
	start and end+1 index of entire matched portion
	pass a number to get span of that particular capture group
	can also use and
	backreference, gives matched portion of Nth capture group
	applies to both search and replacement sections
	possible values: , up to provided no more digits
	backreference, gives matched portion of Nth capture group
	possible values: , , etc (not limited to 99)
	refers to entire matched portion
	named capture group
	refer as in object
	refer as in search section
	refer as in replacement section
	method applied on a object
	gives named capture group portions as a

Notes on named capture groups

All capture groups have a group number, starting from 1.

Groups with the same group name will have the same group number, and groups with a different group name will have a different group number.

The same name can be used by more than one group, with later captures ‘overwriting’ earlier captures. All of the captures of the group will be available from the captures method of the match object.

Group numbers will be reused across different branches of a branch reset, eg. (?|(first)|(second)) has only group 1. If capture groups have different group names then they will, of course, have different group numbers, eg. (?|(?P<foo>first)|(?P<bar>second)) has group 1 (“foo”) and group 2 (“bar”).

In the regex (\s+)(?|(?P<foo>+)|(\w+) (?P<foo>+) there are 2 groups:

(\s+) is group 1.
(?P<foo>+) is group 2, also called “foo”.
(\w+) is group 2 because of the branch reset.
(?P<foo>+) is group 2 because it’s called “foo”.

macOS users

For Python 3.8, we provide one installer: 64-bit-only that works on macOS 10.9 (Mavericks) and later systems.
Please read the «Important Information» displayed during installation for information about SSL/TLS certificate validation and the running the «Install Certificates.command».

Version	Operating System	Description	MD5 Sum	File Size	GPG
Gzipped source tarball	Source release		41a5eaa15818cee7ea59e578564a2629	24493475	SIG
XZ compressed source tarball	Source release		51b5bbf2ab447e66d15af4883db1c133	18271948	SIG
macOS 64-bit Intel installer	Mac OS X	for macOS 10.9 and later	2323c476134fafa8b462530019f34394	29843142	SIG
Windows embeddable package (32-bit)	Windows		40830c33f775641ccfad5bf17ea3a893	7335613	SIG
Windows embeddable package (64-bit)	Windows		cff9e470ee6b57c63c16b8a93c586b28	8199294	SIG
Windows help file	Windows		678cdc8e46b0b569ab9284be689be807	8592697	SIG
Windows installer (32-bit)	Windows		1b5456a52e2017eec31c320f0222d359	27150976	SIG
Windows installer (64-bit)	Windows	Recommended	f69d9c918a8ad06c71d7f0f26ccfee12	28233448	SIG

Regular Expression Patterns

Except for control characters, (+ ? . * ^ $ ( ) { } | \), all characters match themselves. You can escape a control character by preceding it with a backslash.

Following table lists the regular expression syntax that is available in Python −

Sr.No.	Pattern & Description
1	^ Matches beginning of line.
2	$ Matches end of line.
3	. Matches any single character except newline. Using m option allows it to match newline as well.
4	Matches any single character in brackets.
5	Matches any single character not in brackets
6	re* Matches 0 or more occurrences of preceding expression.
7	re+ Matches 1 or more occurrence of preceding expression.
8	re? Matches 0 or 1 occurrence of preceding expression.
9	re{ n} Matches exactly n number of occurrences of preceding expression.
10	re{ n,} Matches n or more occurrences of preceding expression.
11	re{ n, m} Matches at least n and at most m occurrences of preceding expression.
12	a\| b Matches either a or b.
13	(re) Groups regular expressions and remembers matched text.
14	(?imx) Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.
15	(?-imx) Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.
16	(?: re) Groups regular expressions without remembering matched text.
17	(?imx: re) Temporarily toggles on i, m, or x options within parentheses.
18	(?-imx: re) Temporarily toggles off i, m, or x options within parentheses.
19	(?#…) Comment.
20	(?= re) Specifies position using a pattern. Doesn’t have a range.
21	(?! re) Specifies position using pattern negation. Doesn’t have a range.
22	(?> re) Matches independent pattern without backtracking.
23	\w Matches word characters.
24	\W Matches nonword characters.
25	\s Matches whitespace. Equivalent to .
26	\S Matches nonwhitespace.
27	\d Matches digits. Equivalent to .
28	\D Matches nondigits.
29	\A Matches beginning of string.
30	\Z Matches end of string. If a newline exists, it matches just before newline.
31	\z Matches end of string.
32	\G Matches point where last match finished.
33	\b Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
34	\B Matches nonword boundaries.
35	\n, \t, etc. Matches newlines, carriage returns, tabs, etc.
36	\1…\9 Matches nth grouped subexpression.
37	\10 Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.