字幕表 動画を再生する
Hello, world.
This is CS 50 on Twitch.
My name is Colton Ogden.
And today, for the first time, we're joined by Emily Hong.
Welcome to Emily.
What are we talking about today?
Today we're gonna talk a little bit about Web scraping.
Nice.
What is whips creeping?
Yeah, Web scraping is basically you have the whole World Wide Web, right?
And I know as a normal computer user, you can kind of, like, go on a website and read information and so kind of There's so much information available on the Web and Web scraping is a way to kind of programmatically take that information off of those websites and do whatever you want to sort of parse it programmatically as opposed to an end user who might be going on Wikipedia or Google interacting with things to do with the mouse is reading, you know, information.
So we can make a script to maybe do some analytics and sort of the U.
S.
So if you think of, like, maybe the non tech way are like same concept of web scooping, maybe you're reading a website, say Wikipedia, and you're like copying down the words or information on paper.
And then we decided to do something with paper that obviously takes a really long time fishing.
Yeah.
And so with our computer science college, we can do some, uh, meet little tricks to speed up that process.
We're gonna be using Wikipedia at all by chance.
And yes.
What?
Yeah.
Fun fact.
Also, Emily and I are from the same exact Well, not the exact same place, but same county in California, Which is amazing.
Yeah.
Let's let's bring it over to your computer here just so we can see that.
And hopefully we did a little bit cropping in advance, hoping that actually, I didn't realize that might screw up the slide.
For example, we don't see that the image way if we're gonna stay.
Yeah, that's fine.
If you just want t o keep everything off of that.
But there is an image credit there.
Although I did think that Emily originally had created the image.
Or so.
Which is it?
It's a very beautiful representation of what?
Beautiful soup.
Yeah, exactly.
So what is beautiful soup?
Actually, because I'm curious.
Yeah.
So we talked a little bit about generally what Web scripting is.
That's kind of, like, a general computer science concept.
But then there are different ways to, like, carry out that idea on DSO today, our little like intro demonstration is gonna be doing web scraping, particularly with python.
Okay, and, ah, beautiful soup is one particular library that someone has so graciously made and, like, allowed us to use on and they called a beautiful suit kind of because they visualized all of the Internet as kind of this huge unorganized stoop in alphabet soup exactly like an alphabet soup, like on the screen.
And, um, and beautiful soup will help us make it beautiful.
Among other Is this sort of like the most popular?
Um, I guess when scraping library and Ivan, um, I'm actually not 100% sure.
I do know that it's one of the more easy to use one.
So maybe not as heavy duty as some other ones that you confined, but very like friendly to.
I'm very excited to dig in here a little bit.
It's making me a little hungry, too, but we have a few people in the chat.
Let me see.
Some folks are asking what is the difference between here and X here likely being a reference to CS 50 on Twitch.
This is just a twitch channel where we take in guests and we have from scratch implementations of projects.
We talk about concepts.
Ed X is more like a traditional lecture based set up a lecture based course platform, for example, sees fifties on X.
But also, of course, that I taught at Brian Todd that Jordan taught.
Those are all on Ed X.
This is a little bit more of a traditional Senate.
This is more collaborative teaching approach.
And ut dough 988 asking, What's the topic of today?
Web scraping with beautiful soup.
Hence the area.
Hence the title slide.
There is beautiful soup.
J s, uh, Pop.
What is it?
J.
J s pop?
Is that a thing?
Sure does.
I'm not entirely sure what you mean by that ex polar, uh, dreams.
Oh, maybe like O j soup.
Um, uh, J soup is like a job.
Oh, it's a Java Asian male part, sir.
Probably similar.
Imagine.
It's very so.
I'm not sure I'm not familiar with it, but it looks like in based on at least what I can read here, it says open source job HTML part, sir.
With Dom.
CSS and Jake were like methods for easy data extraction.
Sounds to me like probably spirit that I have to imagine is probably, like, across language version of this idea.
Yeah.
Yeah, probably.
I'm not sure.
Yeah, um, good questions all around.
And then Shin was saying Hello.
Emily and Colton.
Awesome.
Um, why don't we dive into a little bit of alphabet?
Beautiful suit.
Yeah.
Yeah, s o.
We talked a little bit about this, but there Sorry way.
Have a little bit of our chat windows a little bit large.
Don't want to cut off your slides.
I mean, you go ahead and shrink this down just a little bit like that.
Boom.
That should be okay.
Um, generically web scraping is the idea of using computer software.
It worked code thio extract information from websites.
And that's any website that you, as a user, could actually normally, um and so the idea would beautiful soup.
And, um, no more chat.
We're gonna We're gonna hide that for little finger back apologies, But, um, with beautiful soup is one the idea of making maybe unorganized websites more organized.
So if, like I know when in school, like sometimes I'll look up like a more niche academic concept and abuse him like plain HTML page with just some textbooks.
Say you wanted to take that content and maybe like, spruce up the look of it, make it have, like, fancy bonds or whatever, and you could do that.
So that's the idea of bringing unstructured data, or maybe like poorly structured data and making it like more user friendly.
Thea.
Other idea is, there's just a lot of information on the Web site on the Internet, and that's constantly updating.
And so you can grab that information through some kind of what's creeping through the issue Mel Page and put it in some kind of database or spreadsheet that you can then manipulate your sure, almost taking other people's database information, putting into our own database.
Yeah, yeah, we'll talk a little bit about some things to consider before you scrape, but no, my webs grapel.
I kind of I think everyone knows there's so much information and content being produced on the Internet.
These are just something like, I don't know, little statistics, uh, with, uh, if you can imagine all that kind of stuff you can access on 1.8 billion Web sites.
I'm actually kinda surprised.
I almost expected there to be, like, maybe 10 times more than Yeah, this might be a little outdated.
There's, uh Let's see if this website is still alive.
And also I wanna shoutout We have a couple followers.
Thio, Tasha Pen.
Well, cool game.
What?
12 and a meat Microsoft India think follows.
Oh, wow.
Okay, so actual live website that shows all this stuff in it, like, really not sure how beautiful it looks pretty professional.
Random number generator.
I don't know.
It seems it seems legit.
So let's say that wow.
146 1,000,000,000 email sent just today s so we probably can't access that kind of internet information because it's not publicly available, But that is happening on the Internet.
Wow, this is fascinating.
Okay?
Yeah.
So just some things to consider the next time you post something on Facebook, I'm a little surprised at 800 instagram post per second to I was like, makes you feel like that would probably be more, but let's say at least, so why rob scream there's again just all kinds of stuff that happens on the Internet, something that's particularly interesting, you could say like, Oh, why don't I just, like, download the HTML page, get the information that I never have to deal with it again?
Sure.
But there are also a lot of websites that constantly update right, so an example of that always comes.
Mine is like flights prices, although there are a lot of other service.
Is that centralized information for, you know, again like maybe a block post?
Maybe there's someone that is posting regularly and you want to take their information.
Do I don't know, showcase somewhere else?
I don't know.
That's something If your business and you want to keep track of your competitors, maybe that's a very specific task that no other company is doing.
This kind of general, whatever information you need, and it would be too tedious for you to like, do it by hand.
Maybe I want to, like, have like a dashboard page, which shows like my five top websites.
Maybe they make their information publicly.
Ville.
So happened like apart, sir, that puts it all together in one page.
Maybe that'd be compelling.
Yeah, you could have your own curated new speeders.
Yeah, Yeah, something like that, Actually.
Transit screen?
I don't know.
Sure, if they use the same.
I think they use Ap eyes for that.
But the same kind of idea?
Yeah.
Yeah.
Um, so anymore.
So Webb scraping is not the only way to select him.
For speaking of yeah.
Yeah.
Speaking of AP eyes, uh, generally, Web scraping is kind of like, I would say, the hacky ist way to take information because it's not really like official.
You can really just do it based on its HTML page that is given to you anytime you load a website.
But before you go straight to Web scraping, there are much easier ways to access information, the first of which is an FBI, which stands for application programming interfaces.
I don't have us on a Web stream on if you haven't done a web c'mon ap eyes yet I don't thank you.
I don't think we've touched.
I think in a couple streams.
Think we might have briefly mentioned what they are really got too much detail.
Yeah.
So, just generally speaking, usually larger companies organizations will actually create their own libraries and set of functions for user's usedto