字幕表 動画を再生する
MIHAI MARUSEAC: I am Mihai.
I've been working recently on file systems for TensorFlow.
So I'm going to talk about what TensorFlow
has in file system support and what
are the new changes coming in.
First, TensorFlow file system can
be used in Python like this.
So we can create the directories,
we can create files, read or write to them,
and we can see what's in a directory.
And you can say, is this similar to Python?
And if you compare it with Python,
it mostly looks the same.
So it's just some names changed.
There is one difference in mkdir.
In Python, the directory must not exist,
but there is a flag that changes this.
And now that I've said that, they still look similar.
You might ask yourself, why does TensorFlow need its own file
system implementation?
And one of the main reasons comes from the file systems
that TensorFlow supports from the formats of the file
parameter.
In TensorFlow, you can pass a normal path to a file,
or it can pass something that looks like a URL
or they are usually called URI--
uniform resource identifiers.
And they are divided into three parts--
there's the scheme part, like HTTPS, J3, S3, and so on;
the host if it's on a remote host;
and then the path, which is like a normal file
path on that segment.
For TensorFlow, we support multiple schemes.
And what we have is a mapping between schemes and a file
system implementation.
For example, for file, we have a LocalFileSystem.
For GS, we have a GoogleCloudFileSystem.
With viewfs, Hadoop, and so on.
Keep in mind this mapping because this
is the core aspect for why TensorFlow needs to have
its own system implementation.
However, this is not the only case.
We need a lot of use cases in TensorFlow code.
So beside the basic file I/O that I showed
in the first example, we also need to save or load models,
we need to checkpoint, we need to dump tensors
to file for debugging, We need to parse images
and other inputs, and also tf.data datasets,
also attaching the file systems.
All of this can be implemented in classes,
but the best way to implement them
would be in a layered approach.
You have the base layer where you
have the mapping between the scheme and the file system
implementation.
And then at the top, you have implementation
for each of these use cases.
And in this talk, I'm going to present all of these layers.
But keep in mind that these layers are only
for the purposes of this presentation.
They grow organically.
They are not layered the same way in code.
So let's start with the high level API, which
is mostly what the users see.
It's what they want to do.
So when the user wants to load a saved model,
the user wants to open a file that
contains the saved model, read from the file, load
inputs, tensor, and everything.
The user would just call a single function.
That's why I'm calling this high level API.
And in this case, I am going to present some examples of this.
For example, API generation.
Whenever you are building TensorFlow
while building the build package,
we are creating several proto buffer
files that contain the API symbols that TensorFlow
exports.
In order to create those files, we
are basically calling this function CreateApiDefs,
which needs to dump everything into those files.
Another example of high level API
is DebugFileIO where you can dump tensors into a directory
and then you can later review them
and debug your application.
And there are a few more, like loading saved models.
You see that loading saved model needs an export
directory to read from.
Or more others-- checkpointing, checkpointing
of sliced variables across distributed replicas,
tf.data datasets, and so on.
The question is not how many APIs
are available at the high level, but what
do they have in common?
In principle, we need to write to files, read from files,
get statistics about them, create and remove directories,
and so on.
But we also need to support compression.
We also need to support buffered I/O,
like read only a part of the file,
and then later read another part of the file instead
of fitting everything in memory, and so.
Most of these implementations come
from the next layer, which I am going
to call it convenience API.
But it's similar to middleware layer in the web application.
So it's mostly transforming from the bytes that
are on the disk to the information
that the user would use in the high level API.
Basically, 99% of the use cases are just
calling these six functions--
reading or writing a string to a file
or writing a proto-- either text or binary.
However, there are other use cases,
like I mentioned before, for streaming
and for buffered and compressed I/O where we have this input
stream interface class that implements streaming
or compression.
So we have the SnappyInputBuffer and ZlibInputBuffer
read from compressed data, MemoryInputStream
and BufferedInputStream are reading in a streamed fashion,
and so on.
The Inputbuffer class here allows
you to read just a single int, from a file
and then you can read another int, and then a string
and so on.
Like, you read chunks of data.
All of these APIs at the convenience level
are all implemented in the same way in the next layer, which
is the low level API.
And that's the one that we are mostly interested in.
Basically, this level is the one that needs to support multiple
platforms, supports all the URI schemes that we currently
support, has to support the directory I/O-- so far,
I never talked about directory operations in the higher level
APIs--
and also supports users who can get into this level
and creating their own implementations in case
they need something that is not implemented so far.
If you remember from the beginning,
we had this file system registry where
you had the scheme of the URI and a mapping
between that scheme and the file system implementation.
This is implemented as a FileSystem registry
class, which is basically a dictionary.
You can add a value to the dictionary,
you can see what value is at a specific key,
or you can seeing all the keys that are in that dictionary.
That's all this class does.
And it is used in the next class,
in the environment class--
Env-- which supports the cross-platform support.
So we have a WindowsEnv, or a PosixEnv.
For Windows when you are compiling on Windows,
using TensorFlow on Windows.
POSIX when you are using it on the other platforms.
And then there are some other Env classes for testing,
but let's ignore them for the rest of the talk.
The purpose of the Env class is to provide every low level
API that the user needs.
So, for example, we have the registration-- in this case,
like get all the file systems for a file,
get all the schemes that are supported,
registering a file system.
And of a particular notes is the static member default,
which allows a developer to write anywhere in the C++ code
Env Default and get access to this class.
Basically, it's like a single [INAUDIBLE] pattern.
So if you need to register a file system somewhere
in your function and it's not registered,
you just call Env Default register file system.
Other functionalities in Env are the actual file system
implementation, so creating files.
You see there are three types of files.
So random access files, writable files,
and read-only memory files.