38

Regex

Embed Size (px)

DESCRIPTION

Regex. [email protected]. char. 不可再分 (在当前讨论范围内) 比如: A,b “,”, “@” “ ” 汉字 大多可以视觉识别 也包括 whitespace ,如空格,换行, Tab. Set. 有穷 的字符集成为字母表( alphabet ) 一般非空 比如:英文字母,数字,中文和标点. String. 把字符按顺序连起来,称为 string. 一般是有限长度 可以是 1 个,两个, … ,或者 0 个( ε , “” ,空字符串) 比如: “abcdaaab” - PowerPoint PPT Presentation

Citation preview

Page 2: Regex

char不可再分

(在当前讨论范围内)比如:

A,b“,”, “@”“ ”汉字

大多可以视觉识别也包括 whitespace ,如空格,换行, Tab

Page 3: Regex

Set<char>有穷的字符集成为字母表( alphabet )

一般非空比如:英文字母,数字,中文和标点

Page 4: Regex

String把字符按顺序连起来,称为 string.

一般是有限长度可以是 1 个,两个,…,或者 0 个( ε, “” ,空字符

串)比如:

“abcdaaab”

在 ES 中,字符串用单引号或双引号括起来。

Page 5: Regex

Algebra of string+ 字符串连接

空字符串 + 其它字符串 = 其它字符串 + 空串 = 其它字符串

Page 6: Regex

Alphabet vs string集合 列表

无序不重复势

有序可重复长度

Page 7: Regex

Set<string>字符串的集合

一般为无穷 长度不受限制

也可以是有穷的 比如:空集 , {“a”,”ab”}

Page 8: Regex

Algebra of Set<string>|

相当于集合的并集,结果仍是 Set<string>Ф=The Union of Zero Set<string>

{“a”,”bc”,”e”} | {“a”,”1”}={“a”,”bc”,”e”,”1”}

• Many use ∪, +, or ∨ for alternation

Page 9: Regex

类似于 Cartesian Product 笛卡尔乘积 , 结果仍是Set<string>比如:{“a”,”bc”,”e”} {“a”,”1”}=

{“aa”,”a1”,”bca”,”bc1”,”ea”,”e1”}

Page 10: Regex

乘方{“a”,”1”} 2 ={“a”,”1”} {“a”,”1”}

={“aa”,”a1”,”1a”,”11”}{“a”,”1”}3={“a”,”1”} {“a”,”1”} {“a”,”1”}

={“aaa”,”aa1”,”a1a”,”a11”,”1aa”,”1a1”,”11a”,”111”}

定义{”a”,”1”}1={“a”,”1”}

这样 {“a”,”1”} 2= {”a”,”1”}1+1= {”a”,”1”} {”a”,”1”}{”a”,”1”}0={ε}

这样, {”a”,”1”}1= {”a”,”1”}0+1={ε} {”a”,”1”}

Page 11: Regex

乘和或 复合S{m,n} =Sm | Sm+1 | Sm+2… | Sn

S{m,} =Sm | Sm+1 | Sm+2… S ? =S0 |SS+ = S1 | S2 | S3…

S* =S0 | S1 | S2 | S3…* is called Kleene Star

Page 12: Regex

Priority of ops* highestConcatenationalternation. parentheses may be omitted. For example,

(ab)c can be written as abc, and a|(b(c*)) can be written as a|bc*.

Page 13: Regex

Regular ExpressionSome set<string> is called regular

expression, or RE, RegExp, RegexThe following are RegExp

{“a”} is regular expression, for any char in alphabet

RS is RegExp, if R and S are both RegexR* is RegExp, if R is Regex

So {ε} is RegExpIf a Set<string> cannot be represented by

above process, it’s not RegExp

Page 14: Regex

Note Ф is often included in RegExp

Page 15: Regex
Page 16: Regex

See Standard

Page 17: Regex

Empty Empty allowed[]()|

Page 18: Regex

Assertion^$\b

World boundary Not _, [0-9], [A-z]

\BNot \b

(?=expression)(?!expression)

Page 19: Regex

quantifier?+*{m,}{m,n}

The following will be lazy if appended by another ?

Page 20: Regex

capture()(?: expression)

Page 21: Regex

Atom Escape

Page 22: Regex

\c followed by lower or upper letter \a =a

For a is not designated special meaningsSo are some other letters

\u002F \0

Page 23: Regex

Character Class[][abd]={“a”,”b”,”c”}[a-c] = {“a”,”b”,”c”}

[-ca] where – is literal[ac-] where – is literal

[^a-c] = alphabet / [a-c]

Page 24: Regex

Escape Class[\b]={“backspace”}[\]] = {“]”}[\B] error[\1] error

\1 will be the captured group

Page 25: Regex

. Any char except newline\d digit\D not digit\w word char\W not word char\s whitespace\S not whitespace

Page 26: Regex

Back reference\1\0

<NUL>\1000000000

Error if no such many matches.

Page 27: Regex
Page 28: Regex

/ … /gimWhere g for globali for case insensitivem for multiline

Page 29: Regex

Note:// will be taken as comments

Use /(?:)/

Page 30: Regex
Page 31: Regex

RegExp.RegExp is a function

Can construct Regular Expressions RegExp(pattern, flags) new RegExp(pattern, flags)

RegExp.prototype

Page 32: Regex

RegExp.prototype.constructorexec

Return matches, an array Ordered by the appearance of ( There is one implicit () around the whole pattern

testReturn bool

toStringReturn string

Page 33: Regex

Members of RegExp instancesourceglobalignoreCasemultilinelastIndex integer

{ [[Writable]]: true, [[Enumerable]]: false, [[Configurable]]: false }.

Page 34: Regex

Thank You!

The End

Page 35: Regex

<ZWNJ> and <ZWJ> are format-control characters that are used to make necessary distinctions when forming words or phrases in certain languages.

Page 36: Regex

The Unicode format-control characters (i.e., the characters in category ―Cf‖ in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages).

All format control characters may be used within comments, and within string literals and regular expression literals.

In ECMAScript source text, <ZWNJ> and <ZWJ> may also be used in an identifier after the first character.

Page 37: Regex

<BOM> is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <BOM> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files. <BOM> characters are treated as white space characters (see 7.2).

Page 38: Regex

The special treatment of certain format-control characters outside of comments, string literals, and regular expression literals is summarised in Table 1.

Table 1 — Format-Control Character Usage Code

Unit Value Name Formal Name Usage

\u200C Zero width non-joiner <ZWNJ> IdentifierPart

\u200D Zero width joiner <ZWJ> IdentifierPart

\uFEFF Byte Order Mark <BOM> Whitespace