構文解析の説明 - Computerphile (Parsing Explained - Computerphile)

字幕表動画を再生する

I actually think that if we're going to do parsing it might be easiest not to
dive into compilers straight away, as some of you might want me to do, but to
actually just take a simple, very simple, subset of the English language and see
how we would parse that. And if you look up in a dictionary "what does `parsing' mean"
it says, essentially, one of the definitions is: " ... breaking up sentence
structure into its component parts under the direction of a grammar." so the
So the grammar is the rules of your language. Well, I have drawn up here a language
involving furry dice and robots and people, men and women, and cats and dogs
and they can 'bite' they can 'stroke' they can ... whatever. And I thought: that little
grammar, which I'll walk you through, is capable of generating between 200 and
300 sentences and they are all valid English. But there's no way it
captures the spirit of English. The idea that your sentences have
subject-verb-object is pretty universal, I think, among what's called the
Indo-European languages. However, sometimes it's not always in the order
of subject-verb-object, although that's common. Sometimes they get twisted up
particularly in German, where there's a definite option to have the verb last.
In this example I'm pointing to here, I'm using a notation that we did cover about
three or four years ago now. We'll put out a link.
I think the episode was called Angle Brackets. And they invented a notation
that looks like this: pointy brackets. Yes, back in the very late 50s and early 60s
people who wanted to write compilers realized they had to write
down a formal specification of what were the rules in their language. And two
computer scientists actually won the Turing award - [the] ACM Turing Award - for their
notation and their development of a grammar
for the language Algol 60. And it uses angle brackets all over the place.
So, that was so-called Backus-Naur form. Those are the two computer scntists
who developed it. People really liked it. And I've used it here because it's
verbose enough to make some sense. I'll read it out to you. What I'm saying, up at
the top of the grammar - using the angle brackets for what's going to become
parts of a so-called parse tree. A is defined as then a
then an . That's the most common form of English sentence the
subject is the person, or thing, doing the action. The verb says what the action is:
'kicked' 'bit', 'stroked' ... whatever. The object says what it is doing it to. [e.g.] 'the man
kicked the robot' And it's such a powerful language! You can have either
'the' or 'a' [as] definite or indefinite article. You can have 'the robot' or
'a robot'. You can have a choice of verbs either 'bit', 'kicked' or 'stroked' - you're
going to be very expressive in this language (!) Your subject of the sentence
can either be 'dog', 'cat', 'man', 'woman' or 'robot'. And finally the tailpiece. The object
i've decided could either be any followed by a again, like 'a dog',
'the robot bites a dog' [is] perfectly possible. But just to show that you can put in
shortcuts, if you like, and put bits of structure in there that aren't parsed -
they're just globbets you can put in - I have defined the phrase: 'two furry dice'
as being a valid 'object'. Just the string "two furry dice". So the target
sentence, that ultimately we're going to try and construct from all of this, is
the [in lower case] robot stroked [writes on paper] ... running out of space here, carry on on the line below ...
"the robot stroked two furry dice" We all agree it makes sense [but] is it parsable
against that set of rules? And how would you go about showing that it was?
>> Sean: So by 'parsable' you mean "can we decode it"? >> DFB: [Yes] can we decode it? Can we assign everything to
be the right flavour of structural component according to those [grammar] rules.
You can either start with a string, which we've got here, and start trying to group
things together and work upwards. Now we all know about trees - the root is at the
top. We're computer scientists, the root is at the top; the leaves are at the bottom.
They're always upside down. Do we start with and leaves and say:"Ooh! the
robot stroked ... The robot stroked ... Looking sideways at the grammar I can
see there's a shortcut for "the robot". So I'm gonna start coming from the string
upwards and see if I end up back at the root of the tree, or shall I start at the
root of the tree and say: "I must have a subject then a verb then an object". Can I
see anything that could likely be the start of a ? That's called coming
top-down. So it's been there in computer science ever since the dawn of computer
science: the top-down approach versus the bottom-up approach.
There's no prizes for guessing that functional languages like Haskell tend,
by their very structure, to favour the top-down approach. It's messier, is the
bottom-up approach. I shall try to indicate both, OK, but let's
just this once, let's go top down rather than bottom-up.
If we're coming top-down it is telling me, is this grammar, that I must have a
then a then an . Now, for those
of you who are concerned with a mechanistics -- you can actually think
of these things, if you're coming top down, as being pushed as targets, that
you've got to satisfy, on a stack which .... We all know about stacks! This one is not
building upwards and downwards it's building from right to left. So, at the
top of my stack currently, because I'm coming top down, I say my target my
sub-target is . I must find a . But I must keep half an eye on the target
string to say: "does it really match?" If we cross-refer now, to this grammar,
for inspiration, we're trying to match a If I look at the grammar over
there it tells me that a can either be an article and a noun like
"the" and "a" , with a like 'cat', 'dog', 'robot' or you can take a shortcut. Because it
tells you here that the just could be the phrase "the robot". So let's
go for broke. I'm looking at the input string and I see
"the robot". So let's take the shortcut just to be devils, right?! I can say
straight away then, taking the shortcut, that can be the string
"the robot". That's good! Now, having satisfied , I pop it
off the top of my stack. I've done that. Next thing, to match this sentence,
you've got to be able to match the . Well, we're going along the string we've coped
with "the robot". Next word - next valid leaf word - that has to match up with being a
. I'm looking at 'stroked'. Is 'stroked' a possibility? Look at . What are the
possibilities for ? And I see 'bit' 'kicked' or 'stroked'. Great! I choose 'stroked'.
But, once again, if you can imagine the stack I then pop that off the top of the stack
-- we've done it! The stack is shrinking! all we've got left to identify
is an that matches [in] that sentence. Again, I've pulled a fast one here! In the
rules for I said you can either break it down into an and a ,
just like the , so nice having the action done to it. Or you can take a
whizzy short-cut. An allowed shortcut, for an is what you might call an
and I put it in deliberately, just to make a point: "two furry dice" Wonderful!
Simplest parse going: an can be "two furry dice". We declare success and we say
we have matched "the robot strokes two furry dice" by going .
came down into being shortcut "the robot". developed straight away
into being "stroked". shortcut to It's a very shallow tree, really.
But it's developed the target sentence: "the robot stroked two furry dice".
Now, the only thing you might say: "Well, yes, wonderful, we've parsed it. But would
it have worked if the instead of taking the shortcut. Could you have
done it with the option?" Yes, you could.
I could have basically, come here: let's write out:
and, yes, right at the beginning instead of taking the shortcut phrase
"the robot", I could have said: "Ah! but I'm going to do it as can be an
followed by a ". So, the head of the stack then would be - when we're
looking at the sentence and starting off from the beginning again - it's got to be
an to start with. is 'a' or 'the'. Oh! I'm looking at the string, I see "the".
Bingo! So, article can be "the" - no problem.
Pop that off the stack. What's the next thing I've got to look for? . Can be "robot",
which is the next thing I'm looking at? Yes, "robot" is an option, so I could have done
"robot". Now the is as before. It's got to be "stroked". And "stroked" is
there. And equally the is as before. There's only one way to develop
the string: "two furry dice" and that's to take the declared short-cutk that does it for
you. The reason for doing this rather artificial example is to say: "Oh dear!
Does this matter? We have got a sentence that makes perfect sense to us: "the robot
stroked two furry dice". But you've drawn two slightly different trees. First time
you took a shortcut subject-phrase, just for "the robot". Next time you said: "Ah! but
we could also do it by breaking it down further into a 'the' and 'robot'. Does that
matter? In this particular case, in informal spoken English, no it doesn't.f
If people understand you the fact that your grammar is technically called 'ambiguous'
- so if you say to yourself what does 'ambiguous' mean? It means there's more
than one parse tree that seems to satisfy and give the final target phrase.
Does that matter? In this case 'no'. Does it matter in general? Yes, because you might
get very different effects under certain circumstances. What I think I'll show you
is a favourite one almost. Which is to check your 4-function desk calculator.
When you start typing things into a 4-function (so-called) hand-calculator.
You're basically putting in, shall we say, integer numbers - let's simplify it.
But the burning question is something like this: that if you type in 8 * 4 * 2
You may not think of it in those terms but computer scientists
would say: "Yes ,if you've got the right grammar rules that's an example of a
"sentence". That is a - what might you call it? - a sort
of subsection of a long calculation, that you do on the right-hand side [say] of a C statement.
It's just 8 times 4 times 2. There's no problem is there?. Eight 4s are 32.
Two 32s are 64. But then you say: "Ah! but we're into tree structures now. How would
you parse that sentence, given that multiply can only multiply two things
at a time. So, are you going to arrange it so that your overall expression groups
as 8 * 4, done on the left. Then you do the multiply by 2, which is
that tree structure. So, the question is you've got a little bit of depth on your
tree structure here. It's gone down a level whereas finally you've got one [operand]
up at the top level. Do you want that sub-structure to be on the left or on the right?
Because it would be equally OK, as we've got things at the moment, for
you to draw something that's got 8 here, multiplied by here 4 * 2 with a
multiplier between. So have we got a little bits of tree structure on the
left or on the right? And does it matter? And you say no it doesn't matter. We all
know that you can't complete the expression until it's left operand
is known and that involves forcing you to do the multiply. 8 * 4 is 32.
Fine, 32. What's the other operand here? It's multiply by 2: 32 * 2 = 64.
Equally, over here, 8 x (four twos are 8; eight 8s are 64) So, you're doing it two different ways
but you're getting the same answer. And the reason that's OK
is that multiply is well-behaved [over the integers] Mathematicians will tell you it commutes
a * b is the same as b * a for integers [and * is also associative over the integers]
However, just look at the ambiguity, OK, Two different parse trees and it
didn't matter for multiply. [But] just suppose those multipliers weren't multiplies but
were divides [ / ] . Does it then matter that you do 8 / 4 first in
this one? You get your 2 and the multiply is now replaced by a divide:
2 / 2 is 1. But look at this other possibility and
just mentally substitute divides for multiplies. 8t divided by the result
of 4 / 2 ? Well 4 /2 is 2. 8 / 2 is 4.
Two completely different answers! Is it 1 or is it 4. If you're panicking take
out your calculator now and check the app. Just do 8 / 4 / 2 and
if the answer isn't 1 demand your money back!
Here we see it then. Here is an example of where ambiguity [two differing parse trees] does lead to problems,
because depending which tree you choose you get a different answer. So, there we
are then. We now know what a grammar is. We know that against a grammar you can
try and parse things like sentences but "sentences" [is] used in its widest sense.
It could be a program statement from C and just think of this: 8 * 4 * 2
has been as a part of a phrase on the right-hand side of an arithmetic statement.
And grammar is the guiding principle for the whole thing.