字幕表 動画を再生する
What's going on
Everybody welcome to part two of our chat bot with Python and tensorflow tutorial series in this tutorial
We're going to be doing is beginning to build our
Database that's going to store our basically our parent comments - they're paired best reply comments
So the reason why we really want to do something like this is because well first of all
A lot of these files are way way way too big for us to just like read into RAM and then create the training files
from
even just individual months
But chances are you're gonna want to eventually if you wanted to create a really
Nice chat bot you're gonna be wanting to work on many months of data
so maybe possibly billions of comments you do have that your disposal so when that's the case we
Probably want to have some sort of database now for the purposes here. Just to keep things simple
We don't deal with my sequel servers or any other
Big database server type thing I'm just gonna use SQLite
It'll help us get the job done, but you can feel free to use pretty much whatever you want
But I'm gonna use SQLite here now before we get too deep
I just want to address kind of what what all our data should be looking like
So I want to bring up here
This is basically if you downloaded the reddit data, and you know extract it
It should look something like this you should have years
You know basically 2007 to 2015 again bigquery does have data all the way up to
Recent like you know last month whatever that would be
And also if you do go that route. Just know that your formats going to be totally different than ours, so you'll have to adapt
What you're doing?
If you want to go that way, but possibly if someone does share some way to officially pull bigquery
I'll probably append that to the end of this tutorial series because there's there's also multiple models that I'm working on with chatbots
So I also just I'm pretty confident that there will be some follow up videos
Anyways enough that if you click on any of these normally you'll just have all these compressed files
But if you extract them, it looks like this basically and then these files contain just a bunch of samples each sample looks
long
this
Alright, so this is just one sample as you can see there's a bunch of data here. It's obviously a JSON
It's key and value though so yeah, you know there's there's a lot of first of all wasted information here
So just putting into a database will severely
Decrease the size of this data right you just have one column name
And then all the data like you're basically you know this much data becomes just this right that makes more sense
Also, we don't need all of these we don't need like link ID for example. We don't really need name
We might be interested and created. We're probably not interested in when it was received author flare
We probably don't care about you might
We probably don't and so on obviously we do care about like things like score and ups and downs and maybe if they were gilded
Or not or stuff like that we might care about those especially like if trying to make some sort of a very specific bot
Same thing with like the subreddit or something like that if you want it again to create some sort of really specific
Type of chat bot for now I want it to be a fairly general
But I care about at least score one thing to note though is
I'm fairly confident score is miscalculated downs are always zero
So score is always miscalculated if I recall right I can't remember if that's truly the case, but anyway
It's really quick to test. I forget all I know is that take it with a grain of salt because it's improper
Anyway, I'm pretty sure it's the case that downs are always zero
But I can't remember if you can take ups and then score
You know basically ups - the actual score would equal the downs, or if score also always equals ups I can't remember
But anyway just know there's some sort of flaw there
anyway
Let's continue so
Working in Python now what I'm gonna. Go ahead and do is
We're just gonna start building out the code that we're going to be using here
So let's go ahead and import sqlite3 for our database import JSON
And then we'll go from date/time import date time and really
SQLite obviously database JSON to read that format basically and then date time we're really just going to use this to output where we are
As we're kind of outputting just some some logging information just so we know where we are
As you might imagine going through these huge files can take a lot of time so sometimes. I just like to put simple
Outputs that kind of tell us where we are at that at the time
Moving along we're gonna say time frame. I'm gonna say we're going to use 2015. Oh 5 so remember the format of the files
When you download them basically, so we're gonna basically be grabbing this one. They all have RC
I don't know what our C stands for us probably not release candidate, but it's
Reddit comments, maybe I don't know anyway
I don't know what it stands for but anyways they all have that same format obviously. This is May of 2015 so
This is the one that we want
Alternatively you could take lists of time frames and then iterate through them build the database the same way. I'm about to build the database
so
Once we've done that also I'm gonna have SQL
Transaction we're gonna have this because you don't want to be in specially like when you know you're gonna be working with like
millions of rows
You don't want to insert rows one by one if you don't have to that's really inefficient
Instead you want to build up a big transaction, and then do it all at once and it will be
Just gobs faster, so that's what we're going to use that for
Now what we're gonna. Do is build out the connection that's going to be SQLite 3 dot connect
We're gonna connect to something database not seeing the database that would still work though
taht format
And in time frame so this will just be a database called whatever the month and year is
Again alternatively if you wanted what you probably could do is
Like this well for example probably what we're in a color table is like
Parent reply or something like that
Instead you could actually make the database parent reply and then each table name could be the the month or something
To me I don't really think the month and year is all that valuable like there's no real reason why you would separate those out
So I'm not really gonna do that but you could if you wanted
anyway
Then we're gonna define our cursor, so that's just connection dots cursor
Okay now we're gonna. Go ahead and use creator table, so it's fine creates
table
And then this is just going to be your typical see to execute
create table it not
Exists and the tables to be parent
reply and
Then we're gonna have all of our columns so first of all we're gonna have parents ID and
This or. This is gonna be text type and then also it'll be our primary key
Yeah, this is gonna run way off the screen. I think you can get away with a triple quote here
We're gonna find out just
So I don't have to run everything off the screen so much
We'll see how it goes so yeah, so parent ID now we're going to need the comment
Comment ID and that again that's going to be a text
And a not a primary key
But it should be unique
unique
And then we're gonna have parent & parent
Will be also text text type and then we're gonna have the comment itself so the reply
Comment will be text type
I'd also like to go ahead and log the subreddit just simply because I do kind of see in the future
That's gonna be a useful thing to be tracking different subreddits have different ways of talking
And if you want a smarter soundings chatbot you could go with more scientific and engineering types of subreddits if you wanted a more
Nevermind I'm not gonna get myself in trouble well. We'll stop at that anyway. You could get different types of
Chatbots the unix time that's just gonna be an integer and then finally we'll go ahead and take the score
which also should be an int
Okay so with that. We've got our query and of course I did just run it off the screen anyway
But yeah so with that we should create the table if it doesn't exist so then what we do at the end here is if
We'll just start our main loop here our main chunk. I guess maybe name equals main
Let's go to create table, so this will just create the table if it doesn't exist
The other thing to note is if the database doesn't exist when you attempt to connect to it it creates a database
that's why we didn't have to create any database that's obviously just SQLite and
then finally
This obviously will only create the table if it doesn't exist and so it's relatively cheap to run it
So we'll go ahead and run that so
That's all for now what we're gonna. Do is in the next tutorial. We'll actually start working through
I'm not sure if we'll be able to insert any of the data
Too much because there's a lot of cleaning up of the data and stuff, but yeah in the next tutorial
We'll at least start
Buffering through the data and start kind of cleaning up that data and get it ready at least to insert it into the database
Anyways, if you have any questions comments concerns whatever feel free to leave them below, otherwise. I will see you in the next tutorial