字幕表 動画を再生する
There's a parsing theme, devotees will know, and there's a Regexp theme. And our
aim, in the end, is to unite these all into one, at the top, so that you're all
totally happy - or completely unhappy - as the case may be. We have done Regular
Expressions and the last one I did was really very simple and straightforward,
and showed you absolutely nothing about what can go wrong. And some of you out
there, coming up with the usual old jokes, quite right, about: "Oh! you're implementing
using Regular Expressions. Now you've got two problems", Yeah! We'll look on in glee
as we now take the story a bit further and see what can go right and how
powerful Regular Expressions are. But also a little bit of what can go wrong
and why you've got to be careful in your use. So what can it be this marvellous new
example that consumes Life the Universe and Everything and exhibits a whole load
of new features? And you're going to be quite convinced I'm off my head.
It's Roman Numbers. Way way back when we first, here at Nottingham, got Lex and Yacc, in
UNIX Version 7 - it was in the late 70s. So, believe me, I have been ... or, back in
those days I was setting Roman Numbers as an example of what Lex - remember the
Regular-Expression-based front-end to your compiler. Does tour identifiers and
all that. I thought, all right, as an exercise, just write a Regular Expression
that will accept Roman Numbers - all of the good ones and reject all the bad ones.
Now, I can't emphasize too strongly: that rider is important! It is very easy, in a
Regular Rxpression, quite easy, to get it to recognize what you want, but will it [also]
reject all the stuff that's bad? And it is sometimes so easy to overlook [things].
I say with shame because, when I'd set that question, I scratched down a quick and
easy answer - which I now see is full of holes .... Because if you input, into my
regular expression MMII - again, looking ahead, Roman Numbers
MMII [is] 2002. Fine. But then I thought ... well I thought now, I didn't think then:
"What happens if I do MIIM? [winces at the thought]?" [Answer = ] 2002. [It] shouldn't be. There's a strict hierarchy
in the way these work. So, at this stage we don't know how much, if anything, you
remember about Roman Numerals. So, as an Extra Bits we've put out a simple
tutorial on Roman Numerals based around this example, But for those of you that
are totally happy with Roman numerals, we can just carry on with the main video now.
So, just to locate ourselves then, and do a few examples to make everybody
comfortable I'm going to write down MCDXCII. Start at the left. We've got a
thousand [M] Then we see CD and, remember, this is one of these subtraction things
CD is "a hundred less than 500" which is 400. MCD. So we're up to 1400 so far. XC?
>> Sean: OK That's 90 >> DFB: 90. II? >> Sean: 2 >> DFB: 1492. What happens in 1492 Sean, in Western European folklore?
"Columbus discovered horizons new in America (1492) >> Sean: I should have known that one
>> DFB: Yeah! right. And, of course, the year we're in at the moment, as we film all of this,
is - that's simple MMXX, 2020. And one which is dear to me, I'm sure you'll
work out why. Yes, it's got a lot of these prepended abbreviation things in there look,
MCM; 1900, XL: 10 less than 50 (40) IV: one less than 5 (4) -- 1944. 244 let's focus
Let's focus ourselves now on how would we do a Regular Expression for all of this?
In fact, let's make things very very simple and say: "All I'm gonna do at the moment
is concentrate on getting a Regular
Expression pattern that will either generate me one or two, or three, or four, or five -
[writes on paper] forgive me I'll carry on down here. Or six, seven, eight, or nine. And I won't go
any further than that. Because tens is a new adventure
further to the left. So, those numbers then, from one to nine, which I've written
know like that now. I hope you can see that if you're completely simple-minded
about this and don't want to do complex Regular Expressions you really could
write yourself a great long 'alternation' as it's called. In other words a great
big sequence of ORs. And you could do everything from one up to thousands and
just have this walloping great OR clause, that your Regular Expression parser will
probably go bananas and say: "too many alternations", or something like this.
So, no, that's not the way to do it. What one wants to do is to generalize and use,
as it were, elegant Regular Expression capabilities to shorten this a bit.
So, there we are the Roman numbers from I to IX,
All written out. First of all let's try and separate
these things into classes, and groups, that make sense. There are two weird
things which involved the subtraction rule, you know, like IV (1 less than 5)
IX (one less than 10). I think we will do those together, because they're weird in the
same way, if you like. The other things are much more regular like you go from 1
2 and 3, which is with III. And similarly, here, you get the same pattern but coming
after a V - a V on its own, or VI, VII, VIII, and so on. So, we've got three
groupings: things involving Is with or without a V in front of them;
and the two special cases IV and IX. So I would conjecture there's a wonderful
piece of Regular Expression covering so far what we've done, but something new
coming up - so do pay attention - would be this: my RE is going to be I
- now this is square 'choice' brackets [], We have seen these before.
But whereas, before, I put ranges inside them - you know like [A-Za-z]
[0-9], any one of those digits.
These aren't ranges; these are just explicit possibilities inside square
brackets. It's called a 'choice' so I've got to have an 'I', if I'm on this side of the
expression. But then it could be followed by either an X or a V, but it must be one
or other of them, and only once. There's no repetition outside; there's no
star there's no plus; no nothing. So those are the special cases taken care of,
the IV and the IX. Now we put a walloping great big OR bar down the middle. Or if
it's not a special case it's a regular straightforward case. But we've been
clever with our REs now. Oh yes! look at this. Now, again, I think this is
something we may have mentioned but probably not another piece of regular
expression notation is: " ... if you put a question mark after something it means
0 or 1 of". So, what I'm saying here is I can either have a V, or I don't need
a V. How can you not need a V? Well, what's following on here is going to be
something that really is new. This is a counted repetition in curly braces zero
- [should be 0,3]. Look at that! Bound to win an award! Let's just run through it.
The two special cases are off on that left-hand side of the OR bar. But if you choose not
to go down that route you go to here and it says: "Oh well if it's not one of
the two special cases then it's like the following: anything between 0 and 3 Is.
I mean you may not have Is at all - you can just have V on its own if you want -
preceded by an optional V, nought or one these, >> Sean: Tthat's giving you the choice of
either 1 2 3, or 5, or 6 7 8, right? >> DFB: Yeah, yeah, that's right.
But it's all to do with sometimes having zero of something. Now, hereby
hangs a tale about Regular Expressions. And those of you that keep on posting
these notices about " ... you've got multiple problems with Regular Expressions !". Well. here's
another one. It's the strange case of the 'empty
alternative', which really can gang up to get you. Because what you're saying here,
in this expression, is: " ... Is it possible for it to match nothing at all and yet
that's a valid match?" Yes it is! Because if you choose to ignore what's on the
left of the OR bar here - which you can do, it's either/or - we come here, and we say: " V,
oh! I can have 0 or 1 of those. So, let's have 0 of them". The I ? I can have
anything between 0 to 3 of those. So let's choose 0 of them. So, in other words
choosing nothing at followed by nothing is a valid match for a Roman number [according to this RE].
And if you think about it it's true. Nothing in Roman Numbers is compulsory. You can have
M's, or no M. Yyou can have C's or no C's. You know all of them.
It's all optional. But as you start getting down to the units digits you think: "I've got
to have something !" The paradox, then, about Roman Numbers
done as a Regular Expression, is: you've got all sorts of cases that yield you
nothing at all and yet you're going to be very disappointed if you don't have something!
But saying "... overall you must have something" That's not a context-free
assertion. That's a context-sensitive assertion. So we'll live with that for the
moment. And I think also what I'd like to just show you now, if I can get the
screen fired up ... and I'm using 'awk' here to demonstrate this, I'm going to add yet
another little piece of notation to all of this. I'm going to say that you can, if
you want to, insist that what you're matching must just be a valid Roman
Number, on a line on its own, with no writing or messing about on either side
of it. Just the number. Start of the number (^), end of the number ($). Finish. Right?
And you denote that by saying start of string anchor, ^ End of string anchor, $.
You're basically saying whatever pattern there is must fit between
the start of the string, on a line on its own, and the end of the string - the
newline right at the end. Now, this is the key thing about this. Those things [^$ markers]
if you put them in, are not options. You can't have zero of them. If you say
you must match the beginning of the line, you must match the beginning of the line. If you
say you must match the end of the line, you cannot ignore it. You can't have zero
of them. So, in a way, this is a kind of reality check and it's generally the
case that if you use these correctly you won't end up with null strings being a
pseudo-match. Because this thing is saying: "Well, whatever the finite [non-zero] length
string is, it's got to fit between beginning of line end of line. Here's the
'awk' script that I hacked together, to try out a few alternatives. And I decided to
go for this simple little piece first. Can I do everything from 1 to 9 in a
single piece of Regular Expression. And you'll see here this is classic 'awk'.
You, give a Regular Expression pattern, then you give an action. It's a C-like syntax,
even in 'awk'. Basically what I'm saying is; "If I match that then I print out 'Pattern
p1 matched' and so on". Now, if you look down here four of these things - because
this is just a test-bed - four of them are commented out, they will not be executed.
The one that will be executed is the one I've just been discussing with you, here,
which says: "I must match start of string then here's my Regular Expression
pattern - which I think will work from I to IX inclusive, and then it must match
end of string". So, I'm hoping that by uncommenting this that this will work.
And I'll type some good and bad Roman Numbers at it. And for every one it's happy
with, it will say 'Pattern P0 matched is' And it will echo back what it your Roman
Number [was] that you put in. So, what I want to do, now, is actually run that script, that
we've been looking at. 'awk - f ' take your instructions from a file called 'test-roman.awk'
OK, so when that runs now it should just hang there waiting for its
input of a Roman number. And remember it is going
to match that beginning of string, piece of regular expression, end of string.
>> DFB: So. which one do you want me to try first Sean? >> Sean: OK, let's go really simply and just go
for a number 5 [V]. >> DFB: V all on its own [waits] Look at that! That's a 'P0 matched is V
>> Sean: All right - a bit trickier 4 [IV] because it's got that 'minus one' sort of idea. Let's put one in that shouldn't work.
>> DFB: Go on that what do you want ? OK just put in C for Computerphile.
>> DFB: C for Computerphile? Now you might say that's a valid Roman number.
Yes, but it's it's not being covered by the bit of Regular Expression I've done, I've done a
piece of Regular Expression that literally, I hope, does I to IX and
nothing else. Let's see what happens there [with C] It echoes it back but it doesn't
do anything with it. So let's now try IX, looking good.
All right how about the ultimate test [which] is type in something utterly idiotic like HELLO
No, doesn't do anything with it, doesn't recognize it. So, we're back
showing the 'awk' script again and let me just remind you that what we've shown is
that as long as we put the anchors in at the beginning and the end of this
pattern then our minimal testing, but I think we've tried all the options, it's
recognized V it's recognized IX. IV all this stuff. That one is absolutely
fine and it rejects rubbish. I've actually put in, as a comment,
StackOverflow's assertion that this [points at screen] is the thing that's the answer to life the
universe and everything. Let's move right across to the right and the first thing
you'll see here is the piece we've just done for the units as it were, I through to IX.
But as you go leftwards can you see that the beauty of this is ...
that has handled the units part of the Roman Numbers. It's dead easy to see, therefore,
that it's like a repetition of the same idea, but with different symbols? When you
come to the tens range What are the exceptions there XC (ten less than a hundred)
XL (ten less than fifty). I's just like the I's and the X's and V's were before, look.
Similar. X[CL]. Or it is is an optional L, for 50, followed by a tens digit, zero
{0,3} of them. You can't say just {1,3} because L on its own is
perfectly valid as a Roman Number Number. So, you see that piece of pattern - similar to
the far-right one but just with a relabeling almost of the digits you're
using- that copes with the tens. Up here [points to left of screen] it's coping with the hundreds, going on to
the thousands. So, the story finishes off in here by saying: Ah! you can have CM
(a hundred less than a thousand) i.e. 900 in other words. CD for 400.
And the midpoint of the range, which everything pivots around. It was V
there, it was L here, it is going to be D here. I hope you all know that D is
500? So there we are. how about you see three very similar
pieces of pattern for the hundreds the tens and the units and right out here at
the far left, M*. And M* means you can have zero or more 1000s. So, we've
got the basis there for something that looks like it would do everything. The
only problem is these empty matches. But that's such a big problem we probably
ought to have an episode devoted to that all on its own.