字幕表 動画を再生する
[MUSIC PLAYING]
ELENA NIEDDU: I'm excited to be here
and to talk to you about the In Codice Ratio project--
that is a project going on at Roma Tre University--
and to talk to you about TensorFlow
help us build a module that is able to transcribe
ancient manuscripts in the Vatican Secret Archive.
So some introduction first.
This is our team.
On the right, you can see paleographers and archivists.
And on the left, there is us, a data science team.
And that's why I think the name we chose, In Codice Ratio,
reflects us very well.
Because it's a word play between the Italian
and the Latin meaning of the word "codice."
Now, in Latin, "in codice ratio" would mean a knowledge
through manuscripts.
But the word "codice" in Italian also means software code,
so it's also knowledge through software,
which is exactly what we're planning to do.
And so you might ask yourselves, what
brings paleographers and archivists and data scientists
together?
Well, they have one problem in common.
They both want to discover knowledge from big data.
We are used to think of big data as something
that happens in the web.
But actually, historical archives
are endless source of historical information,
of important information, of cultural information.
And just to give you a scale of how large this information can
be, let's just compare for a second
the size of the Vatican Secret Archive
to the height of Mount Everest.
Now, if you were to take each shelving of a Vatican Secret
Archive and stack it one top of the other,
you would get to 85 kilometers tall.
That is about 10 times the size of Mount Everest.
And the content spans the centuries and the continents.
For example, there, you have examples
of letters coming from China, from Europe, from Africa,
and, of course, from the Americas.
So what is our goal?
Our goal is to build tool and technology that
enable historians, archivists, and scholars
of the humanities in general to perform large-scale analysis
on historical archives.
Because right now, the process, let me tell you,
is entirely manual.
You still have to go there, consult the documents manually,
and be able to read that very challenging handwriting.
And then, if you find information
that may be linked to another collection,
then you have to do it all by yourself.
But first, we have to face the very first challenge that
is when you are dealing with web content-- for example,
if you want to extract data from the internet-- well, that's
already text.
And when we said we're dealing with the historical documents,
that's often scans.
And traditional OCR is fine for printed text.
But then you get to this.
This is medieval handwriting.
It's Latin, a language nobody uses anymore.
It's a handwriting nobody is able to write or read anymore,
for that matter.
It's heavily abbreviated.
And still, you want to get texts out of it.
So you might want to train a machine learning module.
Of course, you want.
But then, we come to the second challenge.
And that is scalability in the data set collection process.
Now, the graph you see there is a logarithmic scale.
And it might show you something that you already
know that is known as the zip flow that tells you
that there is very few words occurring humongous times.
And then, most of the words, they do not occur that often.
What does that mean for us?
That if we want to collect data, for example, at word level,
at vocabulary level, this means that we
have to annotate thousands of lines of text, which
means hundreds of pages, OK?
And similar systems do exist.
They are state of the art systems.
But most of the paleographers, even when
they know of these tools, get discouraged in using them
because they say, well, it's not cost-effective for me.
Because it can take up to months, or even years, of work
on these documents just to get a transcription system that they
will maybe use once or twice--
I don't know-- whereas they would like to do it faster.
So we asked ourself, how can we scale on this task?
And so we decided to go by easier step, simpler step.
The very first things that we did
was to collect data for single characters.
And these enabled us not to involve
paleographers but people with very less experience.
We built a custom crowdsourcing platform
that worked pretty much like CAPTCHA solving.
What you see there is an actual screen from the platform.
So the workers were presented with an image
and with a target.
And they had to match the target and select
the areas inside of the image.
And in this way, we were able to involve more than 500
high school students.
And in about two weeks' work, we made
more than 40,000 annotations.
So now we had the data, we wanted to build a model.
When I started working at the project,
I was pretty much a beginner in machine learning.
And so TensorFlow helped me put in practice
what I was studying in theory.
And so it was a great help that I
could rely on tutorials and on the community
and, where everything else failed, even the source code.
So we started experimenting, and we
decided to start small first.
We didn't want to overkill.
We wanted the model to fit, exactly, our data.
So we started small and proceeded incrementally
and, in this phase, in a constant cycle
of tuning hyperparameters and model tuning
and choosing the best optimizer, the best thing initializers,
the number of layers and the type of layers,
and then evaluating and training again.
Then we used Keras.
It was good for us because it allowed us to keep
the code small and readable.
And then, this is what we settled for.
It might look trivial.
But it allowed us to get up to a 94% average accuracy
on our test characters.
So where does this fit in the whole scheme
of the transcription system?
It's there in the middle.
And it's actually, so far, the only [INAUDIBLE] part,
but we are planning to expand.
And you will see how later-- we will see how later.
And so we have the input image.
So far, we're relying on an oversegmentation
that is old-school.
It's a bit old-school, but it allows
us to feed single characters or combinations of characters
inside of the classifier, which then produces
a different transcription who are ranked according
to a Latin language model, which we also build from publicly
available sources.
How good do we get?
We get about 65% exact transcription.
And we can get up to 80% if we consider minor spelling errors
or if the segmentation is perfect.
If we had perfect segmentation, we could get up to 80%.
We will see that this can be more challenging.
OK.
So what are our plans for a future?
We're very excited about the integration
of TensorFlow and Keras.
Because I described the process as being fully Keras.
What we actually found out was that sometimes some feature
were lagging behind, and sometimes we
wanted to get one part of the features from Keras
or from TensorFlow.
And so we found ourselves doing lots of--
I don't know if that's your experience, as well--
but we found ourselves doing lots of back and forth
between TensorFlow and Keras.
And now, we get the best of the two worlds,
so we're very excited about that.
And so how do we plan to expand our machine learning system?
First thing first, we are trying U-Nets
for semantic segmentation.
These are the same Nets that achieved very good results
on medical imaging.
And we're planning to use them to get rid
of these tricky computer vision, old-school segmentation.
And that would also achieve the result of having classification
together.
Because this is semantic segmentation we're talking of.
These are some preliminary examples
that work particularly well.
Of course, there is work still that we have to do.
And then, of course, since there could still be ambiguity,
we could do error correction and then transcription.
But I think this would be, in itself,
a significant improvement.
And another thing we're experimenting with
is enlarging our data set.
Because we don't want to stick to characters.
We want to evolve.
We want to move to word level, and even
sentence level, annotated characters.
But still, our focus is scalability
in the data set collection.
So we want to involve paleographers
as little as possible.
So for example, this is our generated inputs from GAN.
But we are also planning on using,
for example, a variational autoencoder
so that we can evolve our data set
with little human interaction--
the less we can.
And in the end, this would bring us to actually use
sequence model that could take full advantage of the sentence
level context, for example, and could even
be able to solve things that we couldn't be able to solve
with single character classification-- for example,
abbreviation.
In this kind of text, many words occur abbreviated, for example,
just like you would text.
In some texts, you would say me too
and use two, the number, or 4U.
And that's the same with this kind of manuscript.
And that's one of the application you could have.
Also, we are planning to use sequence models
to get to a neural language model because so far,
we only have experimented with statistics.
And one last thing before I let you go.
I mentioned the people in the team,
but there is so many people I would like to thank
that were not in that slides.
And first of all Simone, who should have been here,
but he couldn't make it.
And he was my machine learning Jedi Master.
And then Pi School of AI and Sébastien Bratiéres and Lukasz
Kaiser for their amazing mentoring.
And Marica Ascione, who is the high school teacher that
actually allowed us to involve those students that
were part of the platform.
And, of course, all of the graduate
and undergraduate students that worked with us and help us
achieve what we have achieved and what we
plan to achieve in the future.
And of course, thank you for your attention.
[APPLAUSE]
[MUSIC PLAYING]