- Denoted by
[] - [ae] -> matches an 'a' and an 'e', not multiple:
- i.e. gr[ae]y will match
greyandgray
- i.e. gr[ae]y will match
- [a-z] -> hyphen specifies a range of characters:
- i.e. [0-9] matches a single digit
- i.e.2 [a-f] matches a char in a-f
- [a-z0-9A-F] -> groups of hyphen specifies a group of ranges
- qx -> caret after openning bracket negates the character class:
- i.e. qX matches
quinquestion - i.e. must write it at the front of a class/set
- i.e. qX matches
Shorthand Character Classes
\dmatches a single digit == [0-9]\wmathces a "work character" (alphanumeric characters plus underscore)\smatches a whitespace character (includes tabs and line breaks)
the shorthand character class is software depended
Non-Printable Characters
\t== tab\r== carriage return\n== new line\a== bell\e== escape\f== form feed\v== vertical tab\r\n== windows new line- Unicode:
\uFFFFor\x{FFFF}
Dot
.matches any char, except line break (depends on software and mode - single line mode includes line breaks)
Anchors
^matches at the start of the string / line break:^gforgarywill matchg
$matches at the end of the string / line break:y$forgarywill matchy
\bmatches at a position that is called a "word boundary". This match is zero-length:\b5\bwill match5in5 555 55
Alternation
|matchescatanddoginAbout cats and dogs(|)can group multiple:- in
cat food or dog food,(cat|dog) foodwill match bothcat foodanddog food
- in
Repetition
?means the preceding token in the regular expression optional:- for
carl and carol,caro?lwill match bothcarlandcarol
- for
*means the preceding token in the regular expression appears 0 or more- for
carl, carol and carool,caro*lwill match bothcarl,carolandcarool
- for
+means the preceding token in the regular expression appears 1 or more- for
carl, carol and carool,caro+lwill match bothcarolandcarool - usecase:
- Matching HTML tags without any attributes:
- <[A-Za-z0-9]+> is a bad implementation, becase <1> will be matched
- <[A-Za-z][A-Za-z0-9]*> is a good implemenation
- for
{}specifies a specific amount of repetition:\b[1-9][0-9]{3}\bmatches a number between 1000 and 9999\b[1-9][0-9]{2,4}\bmatches a number between 100 and 99999
Greedy/Lazy Quantifiers ✨
- repetition operators or quantifiers are greedy, they will expand as much as possible in general:
<.+>matches<html> apsdpdapdapjaid </html>inblablabla <html> apsdpdapdapjaid <html> balabla
- but if they need to satisfy the remainder of the regex, they will expand to a certain depth:
<.+?>matches<html>and</html>inblablabla <html> apsdpdapdapjaid <html> balabla
- another solution, use negation:
<[^<>]+>will do the same job
Backreferences:
\1refers to first capturing group:([abc])=\1can matcha=a,b=b,c=c
\2\3are all referring to the indexed capturing group
Name groups:
- use
(?<groupname>)and\k<groupname>to call and refer previously declared groups
Lookaround:
q(?=u)is positive lookahead:- it matches the
qinquestion, but not inIraq.
- it matches the
q(?!u)is negative look ahead:- it matches the
qinIraqbut not inquestions
- it matches the
(?<=a)bis positive look backwards:- it matches the
binabc
- it matches the
(?<!a)bis negative look backwards:- it doesn't match the
binabc, but matches thebincbc - Best example: how to include whole directory except
./node_modules/^(?!.*node_modules).*.js^means starting of the line(?!-> negative look ahead (from the start of the line.*node_mudoles-> match all files that has node_modules in thier directory string.*.jsall js files
- it doesn't match the