ウェブクローラー - CS101 - Udacity (Web Crawler - CS101 - Udacity)

字幕表動画を再生する

[Sebastian Thrun] So what's your take on how to build a search engine,

[セバスチャンスラン]だから何を取るの検索エンジンを構築する方法は、
you've build one before, right?

ビルドの 1 つ右の前にいるか。
[Sergey Brin - Co-Founder, Google] Yes. I think the most important thing

[セルゲイ・ブリンの共同創設者、Google]はい。最も重要なことを考える
if you're going to build a search engine

検索エンジンを構築しようとしている場合
is to have a really good corpus to start out with.

本当に良いコーパスを開始することです。
In our case we used the world wide web, which at time was certainly smaller than it is today.

我々の場合に時間では今日よりも確かに小さく world wide web を使用します。
But it was also very new and exciting.

しかし、それも非常に新しい、刺激的だった。
There were all sorts of unexpected things there.

あらゆる種類の予期しないものがあった。
[David Evans] So the goal for the first three units for the course is to build that corpus.

[デビッド・エバンス]従ってそのコーパスを構築するコースの最初の 3 つの単位のための目標であります。
And we want to build the corpus for our search engine

私たちの検索エンジンのためのコーパスを構築したいです。
by crawling the web and that's what a web crawler does.

web とは、クロールでは、web クローラーは何です。
What a web crawler is, it's a program that collects content from the web.

どのような web クローラー、それを web サイトからコンテンツを収集するプログラムです。
If you think of a web page that you see in your browser, you have a page like this.

ブラウザーに表示する web ページのだと思う場合は、このようなページがあります。
And we'll use the udacity site as an example web page.

我々は web ページの例として、udacity サイトを使用します。
It has lot's of content, it has some images, it has some text.

コンテンツの多くは、いくつかのイメージが、いくつかのテキスト。
All of this comes into your browser when you request the page.

すべての機能はあなたのブラウザーにページを要求するとき。
The important thing that it has is links.

それが、重要なことは、リンクです。
And what a link is, is something that goes to another page.

どのようなリンクは何か別のページに行く。
So we have a link to the frequently asked questions,

だから私たちは、よく寄せられる質問へのリンクがあります。
we have a link to CS 101 page.

我々は CS 101 ページへのリンクがあります。
There's some other links on the page.

ページにいくつかの他のリンクです。
And that link may show in you browser with an underscore,

そのリンクをブラウザーに、アンダースコアで表示可能性があります、
it may not, depending on how your browser is set.

お使いのブラウザーの設定方法によってことはできません。
But the important thing that it does,

しかし、それは重要なこと、
is it's a pointer to some other web page.

いくつかの他の web ページへのポインターです。
And those other web pages may also have links

およびその他の web ページにもリンクがあります。
so we have another link on this page.

我々がこのページに別のリンクがあるので。
Maybe it's to my name, you can follow to my home page.

たぶんそれに私の名前は、私のホームページに従うことができます。
And all the pages that we can find with our web crawler

我々は web クローラを見つけることができますすべてのページ
are found by following the links.

次のリンクで発見されます。
So it won't necessarily find every page on the web

それは必ずしも、web 上のすべてのページを見つけることができません。
If we start with a good seed page

我々は良い種子のページを開始する場合
we'll find lot's of pages, though.

我々は多くのページを見つけるでしょう。
And what the crawler's gonna do is start with one page,

1 ページをスタートするクローラーのつもりである、
find all the links on that page, follow them to find other pages

そのページのすべてのリンクを見つける、彼らの他のページを見つけるに従う
and then on those other pages it will follow the links on those pages

それらの他のページにそれそれらのページにリンクされます。
to find other pages and there will be lot's more links on those pages.

他のページを見つけるし、そこに多くのそれらのページにリンクします。
And eventually we'll have a collection of lot's of pages on the web.

そして、最終的に我々の web 上のページの多くのコレクションがあります。
So that's what we want to do to build a web crawler.

だから、web クローラーを作成するにはしたいです。
We want to find some way to start from one seed page,

1 つの種のページから開始するいくつかの方法を見つけるしたいと考えて、
extract the links on that page,

そのページにリンクを抽出、
follow those links to other pages,

これらのリンクは他のページに、
then collect the links on those other pages,

[リンクこれら他のページで収集、
follow them, collect all that.

それらは、すべてを収集します。
So that sounds like a lot to do.

行うにはたくさんのように聞こえるように。
We're not going to all that this first class.

我々はするすべてがこの最初のクラスつもりではないです。
What we're going to do this first unit, is just extract a link.

この最初のユニットを行うつもりですちょうどリンクを抽出します。
So we're going to start with a bunch of text.

だから我々テキストの束を始めるつもりです。
It's going to have a link in it with a URL.

それはそれで URL がへのリンク起こっています。
What we want to find is that URL,

我々が検索する URL は、
so we can request the next page.

だから我々の次のページを要求できます。
The goal for the second unit

2 番目のユニットのための目標
is be able to keep going.

ですが続けることができます。
if there's many links on one page, you will want to be able to find them all.

1 ページに多くのリンクがある場合は、それらすべてを見つけることができるしたいと思うでしょう。
So that's what we'll do in unit 2,

ユニット 2 でやるよそれで、
is to figure out how to keep going to extract all those links.

これらのすべてのリンクを抽出しようとして維持する方法を理解することです。
In unit three, well, we want to go beyond just one page.

ユニット 3 では、まあ、我々ちょうど 1 つのページを超えて移動します。
So by the end of unit two we can print out all the links on one page.

だからユニットの終わりまでに 2 つの私たちへのリンク 1 つのページを印刷できます。
For unit 3 we want to collect all those links, so we can keep going,

ユニット 3 の私たちは私たちを続けることができますので、これらのすべてのリンクを収集したいと思う、
end up following our crawler to collect many, many pages.

次の多くは、多くのページを収集する当社のクローラーに終わる。
So by the end of unit three we'll have built a web crawler.

だからユニットの終わりまでに 3 つ私たち web クローラー築いてきましたよ。
We'll have a way of building our corpus.

私たちは私たちのコーパスを構築する方法があります。
Then the remaining three units will look at how to actually respond to queries.

残りの 3 つのユニットは実際にクエリに応答する方法を見ていきます。
So in unit four we'll figure out how to give a good response.

だから単位で 4 つの我々は良い反応を与える方法を理解します。
So if you search for a keyword, you want to get a response that's a list of the pages

キーワードを検索がページのリストの応答を取得する場合は、
where that keyword appears.

ここでは、キーワードが表示されます。
And we'll figure out in unit five a way to do that, that scales, if we have a large corpus.

大規模コーパスがあるあれば我々ユニット 5 スケール、それする方法理解でしょう。
And then in unit six what we want to do is, well, we don't just want to find a list,

・ [6 単位で私たちが何をしたいです、よく、私たちだけを一覧を見たくないです。
we want to find the best one.

我々は最高の 1 つを検索します。
So we'll figure out how to rank all the pages where that keyword appears.

だからそのキーワードが表示される場所のすべてのページをランク付けする方法を理解します。
So we're getting a little ahead of ourselves now,

だから私たちは少し私達自身に先んじて今なっている、
because all we're going to do for unit one,

すべての私たちのため 1 単位をやろうとしているので、
is to figure out how to extract a link from the page.

ページからのリンクを抽出する方法を把握することです。
And the search engine that we'll build at the end of this

我々はこれの終わりを構築します検索エンジン
will be a functional search engine.

機能する検索エンジンになります。
It will have the main components that a search engine like Google has.

それは、Google などの検索エンジンをしている主要なコンポーネントがあります。
It certainly won't be as powerful as Google will be,

確かに、Google のようになるように強力ではないと
we want to keep things simple.

我々は物事をシンプルにします。
We want to have a small amount of code to write.

我々は少量のコードを書くことをしたいです。
And we should remember that our real goal

私たちが覚えている必要があります私たちの本当の目的
is not as much to build a search engine,

検索エンジンを構築するほどではない、
but to use the goal of building a search engine as a vehicle

しかし、検索エンジンとして、車を建物の目標を使用するには
for learning about computer science

コンピューター科学についての学習
and learning about programming

プログラミングについて学ぶ
so the things we learn by doing this

だからもの私たちこれを行うことによって学ぶ
will allow us to solve lot's and lot's of other problems.

多くの多くの他の問題を解決することができます。