Placeholder Image

字幕表 動画を再生する

  • MIHAI MARUSEAC: I am Mihai.

  • I've been working recently on file systems for TensorFlow.

  • So I'm going to talk about what TensorFlow

  • has in file system support and what

  • are the new changes coming in.

  • First, TensorFlow file system can

  • be used in Python like this.

  • So we can create the directories,

  • we can create files, read or write to them,

  • and we can see what's in a directory.

  • And you can say, is this similar to Python?

  • And if you compare it with Python,

  • it mostly looks the same.

  • So it's just some names changed.

  • There is one difference in mkdir.

  • In Python, the directory must not exist,

  • but there is a flag that changes this.

  • And now that I've said that, they still look similar.

  • You might ask yourself, why does TensorFlow need its own file

  • system implementation?

  • And one of the main reasons comes from the file systems

  • that TensorFlow supports from the formats of the file

  • parameter.

  • In TensorFlow, you can pass a normal path to a file,

  • or it can pass something that looks like a URL

  • or they are usually called URI--

  • uniform resource identifiers.

  • And they are divided into three parts--

  • there's the scheme part, like HTTPS, J3, S3, and so on;

  • the host if it's on a remote host;

  • and then the path, which is like a normal file

  • path on that segment.

  • For TensorFlow, we support multiple schemes.

  • And what we have is a mapping between schemes and a file

  • system implementation.

  • For example, for file, we have a LocalFileSystem.

  • For GS, we have a GoogleCloudFileSystem.

  • With viewfs, Hadoop, and so on.

  • Keep in mind this mapping because this

  • is the core aspect for why TensorFlow needs to have

  • its own system implementation.

  • However, this is not the only case.

  • We need a lot of use cases in TensorFlow code.

  • So beside the basic file I/O that I showed

  • in the first example, we also need to save or load models,

  • we need to checkpoint, we need to dump tensors

  • to file for debugging, We need to parse images

  • and other inputs, and also tf.data datasets,

  • also attaching the file systems.

  • All of this can be implemented in classes,

  • but the best way to implement them

  • would be in a layered approach.

  • You have the base layer where you

  • have the mapping between the scheme and the file system

  • implementation.

  • And then at the top, you have implementation

  • for each of these use cases.

  • And in this talk, I'm going to present all of these layers.

  • But keep in mind that these layers are only

  • for the purposes of this presentation.

  • They grow organically.

  • They are not layered the same way in code.

  • So let's start with the high level API, which

  • is mostly what the users see.

  • It's what they want to do.

  • So when the user wants to load a saved model,

  • the user wants to open a file that

  • contains the saved model, read from the file, load

  • inputs, tensor, and everything.

  • The user would just call a single function.

  • That's why I'm calling this high level API.

  • And in this case, I am going to present some examples of this.

  • For example, API generation.

  • Whenever you are building TensorFlow

  • while building the build package,

  • we are creating several proto buffer

  • files that contain the API symbols that TensorFlow

  • exports.

  • In order to create those files, we

  • are basically calling this function CreateApiDefs,

  • which needs to dump everything into those files.

  • Another example of high level API

  • is DebugFileIO where you can dump tensors into a directory

  • and then you can later review them

  • and debug your application.

  • And there are a few more, like loading saved models.

  • You see that loading saved model needs an export

  • directory to read from.

  • Or more others-- checkpointing, checkpointing

  • of sliced variables across distributed replicas,

  • tf.data datasets, and so on.

  • The question is not how many APIs

  • are available at the high level, but what

  • do they have in common?

  • In principle, we need to write to files, read from files,

  • get statistics about them, create and remove directories,

  • and so on.

  • But we also need to support compression.

  • We also need to support buffered I/O,

  • like read only a part of the file,

  • and then later read another part of the file instead

  • of fitting everything in memory, and so.

  • Most of these implementations come

  • from the next layer, which I am going

  • to call it convenience API.

  • But it's similar to middleware layer in the web application.

  • So it's mostly transforming from the bytes that

  • are on the disk to the information

  • that the user would use in the high level API.

  • Basically, 99% of the use cases are just

  • calling these six functions--

  • reading or writing a string to a file

  • or writing a proto-- either text or binary.

  • However, there are other use cases,

  • like I mentioned before, for streaming

  • and for buffered and compressed I/O where we have this input

  • stream interface class that implements streaming

  • or compression.

  • So we have the SnappyInputBuffer and ZlibInputBuffer

  • read from compressed data, MemoryInputStream

  • and BufferedInputStream are reading in a streamed fashion,

  • and so on.

  • The Inputbuffer class here allows

  • you to read just a single int, from a file

  • and then you can read another int, and then a string

  • and so on.

  • Like, you read chunks of data.

  • All of these APIs at the convenience level

  • are all implemented in the same way in the next layer, which

  • is the low level API.

  • And that's the one that we are mostly interested in.

  • Basically, this level is the one that needs to support multiple

  • platforms, supports all the URI schemes that we currently

  • support, has to support the directory I/O-- so far,

  • I never talked about directory operations in the higher level

  • APIs--

  • and also supports users who can get into this level

  • and creating their own implementations in case

  • they need something that is not implemented so far.

  • If you remember from the beginning,

  • we had this file system registry where

  • you had the scheme of the URI and a mapping

  • between that scheme and the file system implementation.

  • This is implemented as a FileSystem registry

  • class, which is basically a dictionary.

  • You can add a value to the dictionary,

  • you can see what value is at a specific key,

  • or you can seeing all the keys that are in that dictionary.

  • That's all this class does.

  • And it is used in the next class,

  • in the environment class--

  • Env-- which supports the cross-platform support.

  • So we have a WindowsEnv, or a PosixEnv.

  • For Windows when you are compiling on Windows,

  • using TensorFlow on Windows.

  • POSIX when you are using it on the other platforms.

  • And then there are some other Env classes for testing,

  • but let's ignore them for the rest of the talk.

  • The purpose of the Env class is to provide every low level

  • API that the user needs.

  • So, for example, we have the registration-- in this case,

  • like get all the file systems for a file,

  • get all the schemes that are supported,

  • registering a file system.

  • And of a particular notes is the static member default,

  • which allows a developer to write anywhere in the C++ code

  • Env Default and get access to this class.

  • Basically, it's like a single [INAUDIBLE] pattern.

  • So if you need to register a file system somewhere

  • in your function and it's not registered,

  • you just call Env Default register file system.

  • Other functionalities in Env are the actual file system

  • implementation, so creating files.

  • You see there are three types of files.

  • So random access files, writable files,

  • and read-only memory files.

  • The read-only memory regions are files

  • that are mapped in memory on a memory page,

  • and then you can just read directly from memory.

  • There are two ways to write to files.

  • Either you overwrite the entire context,

  • or you append at the end of that.

  • So that's why you have two constructors

  • for writable files-- the NewWritableFile

  • and the NewAppendableFile.

  • More functionalities in Env are creating or removing

  • directories, moving files around, basically everything

  • that is directory operations.

  • Furthermore, the next ones are determining the files

  • existing in your directory, determining all the files that

  • match a specific pattern, or getting information

  • about a specific part entry--

  • if it exists, if it is a directory, what is its size,

  • and so on.

  • All of these are implemented by each individual file system,

  • but I'm going to that soon.

  • There are other informations that Env contains,

  • but they are out of scope for this talk.

  • So Env also supports threading, supports an API for a clock,

  • getting information about your time, loading shared libraries,

  • and so on.

  • We are not concerned about this, but I just

  • wanted to mention them for completeness.

  • As I mentioned, there are three different types

  • that we support--

  • the random access file, the writable file,

  • and the read-only memory region, and this is the current API

  • that they support.

  • The first two files have a name, and then they have operations,

  • like read/write.

  • And the memory region, since it's already mapped in memory,

  • you don't need a name for it.

  • You only need to see how long it is

  • and what is the data in there.

  • These three files, I'm going to come back to them later.

  • That's why I wanted to mention them here.

  • Finally, the last important class

  • is the FileSystem class, which actually

  • does the real implementation.

  • So this is where we implement how

  • to create a file, how to read from a file,

  • and everything else.

  • In order to support TensorFlow in other languages,

  • we also need to provide the C API interface that language

  • bindings can link against.

  • And this is very simple at the moment.

  • It's just providing the same functions

  • that we already saw, except they use C types

  • and they use some other markers in the signature

  • to mark that they are symbols exported in a shared library.

  • This CIP interface is not complete.

  • So for example, it doesn't support random access files.

  • So it doesn't support you reading from files

  • from other languages except if that language [INAUDIBLE]

  • directory over the FileSystem class

  • that I showed you in a previous slide.

  • OK.

  • This is all about file systems that

  • currently exist in our disk in the current implementation.

  • However, there is now work to modernize the TensorFlow file

  • system support in order to reduce our complexity.

  • And when I'm speaking about complexity,

  • I am thinking about this diagram where

  • you have the FileSystem class that I showed you,

  • and all of these implementations of it.

  • So have support for POSIX, support for Windows,

  • support for Hadoop, S3, Gcs, and many others,

  • and then a lot of test file systems.

  • Each one of them