インクリメンタルモデル（ファイサル・エル・シャミ） (Incremental Models (Faisal El Shami))

字幕表動画を再生する

AI 自動生成字幕

Hello everybody.

皆さん、こんにちは。
My name is Faisal Alshami.

私の名前はファイサル・アルシャミだ。
I started my data career at TIER as a data analyst, and I slowly progressed into an analytics engineer.

私はデータアナリストとしてTIERでのデータキャリアをスタートさせ、徐々にアナリティクスエンジニアへとステップアップしていきました。
I work heavily with dbt, which I really love.

私はdbtを多用しており、とても気に入っている。
Also focus a lot on operational analytics, a lot of stuff around fleet availability and operations outside on the streets and a little fun fact about me is I always say good morning, no matter what time it is.

また、オペレーショナル・アナリティクス（運行分析）にも重点を置いており、フリート稼働率や路上でのオペレーションに関する多くのことに取り組んでいる。
And there's a reference there.

そこに言及がある。
So I want to speak about the table of contents of the things that I'll be discussing today.

そこで、今日お話しすることの目次についてお話ししたいと思います。
So first I will discuss about what is an incremental model?

そこでまず、インクリメンタルモデルとは何かについて説明しよう。
What does that mean?

どういう意味ですか？
When do we use incremental models when not to use incremental models and basically some additional tips and tricks, maybe some small definitions around that, and then at the end, how we use incremental models at TIER.

いつインクリメンタルモデルを使うか、いつインクリメンタルモデルを使わないか、そして基本的にいくつかのヒントやトリック、それにまつわるちょっとした定義、そして最後にTIERでのインクリメンタルモデルの使い方。
Let's start with the first part.

まずは最初の部分から。
What are incremental models?

インクリメンタルモデルとは何か？
So incremental models are built as tables in your data warehouse.

つまり、インクリメンタルモデルはデータウェアハウスのテーブルとして構築される。
So as opposed to being materialized as a view, and then what basically happens when you load or when dbt runs the model for the first time, the next time you tell dbt what to filter for.

つまり、ビューとしてマテリアライズされるのとは対照的に、ロード時やdbtが初めてモデルを実行したときに基本的に起こることは、次にdbtに何をフィルタリングするかを指示することだ。
So what roles should it transform instead of going and transforming the whole data set?

では、データ全体を変換するのではなく、どのような役割を変換すべきなのだろうか？
And basically this is really nice because this reduces the build time and maybe the computational complexity or heaviness when you're transforming the new records.

基本的に、これはとてもいいことだ。なぜなら、新しいレコードを変換する際のビルド時間が短縮され、計算の複雑さや重さが軽減されるからだ。
Of course, when you build incremental models, this requires a bit of extra configuration on your side and you should be a bit careful and incremental models work really good with event style data.

もちろん、インクリメンタルモデルを構築する場合、これはあなたの側で少し余分な設定を必要とし、少し注意しなければならない。
So if you look at the right on the table, you have the source data on the left side and you add an incremental filter, which I will show you how to do later that defines, for example, show me all the data that comes in these two dates, and then dbt will go on to transform only these roles and then add them to the destination table.

テーブルの右側を見てみると、左側にソースデータがあり、インクリメンタルフィルターを追加しています。
When do we use incremental models?

どのような場合にインクリメンタルモデルを使うのか？
If I want to make it really simple, it's basically to save on time and to save on money and resources.

本当にシンプルに言うなら、基本的には時間を節約し、お金と資源を節約するためだ。
So basically this stems to all the other reasons.

だから基本的には、これは他のすべての理由につながる。
So if your source data is huge or your models are taking too long to run, or maybe you're having really heavy computations that are happening there, then incremental models might be the way.

だから、ソース・データが膨大であったり、モデルの実行に時間がかかりすぎたり、あるいはそこで非常に重い計算が行われている場合は、インクリメンタル・モデルが適しているかもしれない。
Your historical data does not change a lot.

あなたの過去のデータはあまり変わらない。
As I mentioned before, event style data is really good for that.

前にも言ったように、イベント形式のデータはそれにとても適している。
And basically if your models want to be updated frequently, so if you have a lot of models that are getting updated every hour or two, you have them on dbt to be run every hour or two, then maybe using incremental models is the way to go.

基本的に、モデルを頻繁に更新したい場合、つまり、1時間か2時間ごとに更新されるモデルをたくさん持っていて、1時間か2時間ごとに実行するようにdbtに設定している場合は、インクリメンタルモデルを使うのがいいだろう。
And then when not to use incremental models, technically it's the opposite of what I've mentioned.

そして、インクリメンタルモデルを使用しない場合、技術的には私が述べたことの逆となる。
So if your source data is for that extra configuration, if your data is consistently changing and you need to update it, maybe incremental models is not the way.

だから、もしソース・データが余分なコンフィギュレーションのためのもので、データが常に変化していて、それを更新する必要があるのなら、インクリメンタル・モデルでは無理かもしれない。
If your models rely heavily on window functions, for example, then incremental models won't work.

例えば、あなたのモデルがウィンドウ関数に大きく依存している場合、インクリメンタルモデルは機能しない。
It's not necessary not to apply incremental models, but it's just a bit more complex and maybe it's better avoided because you have to integrate some more complex logic or some more steps into the way I'll start off with how to incrementalize a model.

インクリメンタルモデルを適用しない必要はないが、少し複雑になるし、より複雑なロジックを組み込んだり、より多くのステップを踏んだりしなければならないので、避けた方がいいかもしれない。
Just to recap on the incremental filter, there is incremental filter.

インクリメンタルフィルターについておさらいしておくと、インクリメンタルフィルターがある。
Basically what it does, it's a macro that assesses a true or false statement.

基本的に、これは真偽を判定するマクロである。
When is it true?

いつが本当ですか？
So once you write the normal Kaban dbt run model, you have the incremental filter.

つまり、通常のカバンのdbtランモデルを書けば、インクリメンタルフィルターができる。
As you can see on the picture on the right from line 19 and 22, what it tries to check for true is the following.

右図の19行目と22行目からわかるように、真かどうかをチェックしようとしているのは次のようなものだ。
So number one is the configuration of the model.

その第一は、モデルの構成だ。
So this is basically line two.

つまり、これが基本的に2行目だ。
If you look at materialized, it should be a materialized as incremental, right?

マテリアライズドを見ると、インクリメンタルとしてマテリアライズドされているはずですよね？
The second thing is that you have a destination table.

もうひとつは、デスティネーション・テーブルがあることだ。
So this is what the macro assesses.

これがマクロの評価だ。
And then when you try to configure it, you can also define a unique key.

そして、それを設定しようとするときに、ユニーク・キーを定義することもできる。
Unique key looks at the newly transformed records.

ユニークキーは、新しく変換されたレコードを見る。
So the ones under the incremental filter, and then it looks back at the table that you already have and it sees if they're matching and if they're matching, then it will update the existing data on the table.

インクリメンタルフィルターの下にあるものが、すでに持っているテーブルを検索し、一致するかどうかを調べ、一致すれば、テーブルの既存のデータを更新する。
So it's very simple to apply incremental models.

だから、インクリメンタルモデルを適用するのはとても簡単だ。
So if you look here on the right, this is a very simple select statement from a source, from a table.

右側を見てほしいが、これはソースから、テーブルから、非常に単純なselect文である。
We've, we're done with our configuration at the end on the where statement.

WHERE文の最後の部分で設定を完了した。
I just did a incremental clause and this is the syntax of how you do it.

私はインクリメンタル節を使っただけだが、これはそのシンタックスだ。
If the timestamp is greater than or equal to select the maximum timestamp from this model.

タイムスタンプがこのモデルから最大タイムスタンプを選択する以上である場合。
This here refers to the already existing destination table or this model, not the source data, so this is basically it.

ここでは、ソース・データではなく、すでに存在するデスティネーション・テーブルまたはこのモデルを参照する。
I'll go now over some tips around that.

今からそのヒントをいくつか見ていこう。
And maybe to clarify some stuff here.

そして、ここではっきりさせておきたいことがある。
This again, as I mentioned, this refers to the model you're actually working on or the destination table.

これもまた、前述したように、実際に作業しているモデル、あるいは作業先のテーブルを指している。
If you're using a lot of window functions, you might maybe consider that you have one step that basically loads the data incrementally.

ウインドウ関数をたくさん使っているのなら、基本的にデータをインクリメンタルにロードするステップを1つ持つことを検討してもいいかもしれない。
And then you have another model that run these window functions.

そして、これらのウィンドウ機能を実行する別のモデルがある。
The unique, this is important one.

唯一無二のもの、これが重要だ。
The unique key does not ensure uniqueness in certain cases.

一意キーは、特定のケースでは一意性を保証しない。
You don't need to add the unique key because it can be a bit more computationally expensive.

ユニーク・キーは計算量が多くなるので、追加する必要はない。
This is the first set of tips.

これが最初のヒントだ。
You have to think sometimes about if you have a lot of CTEs, where should you put the incremental filter?

CTEがたくさんある場合、インクリメンタルフィルターをどこに置くべきか、時々考えなければならない。
Because it can impact the performance of your model.

あなたのモデルのパフォーマンスに影響を与える可能性があるからだ。
Basically this runs the incremental model.

基本的に、これはインクリメンタルモデルを実行する。
So no need to do anything else.

だから、他に何もする必要はない。
If you, for example, added the new column, or you want to fully refresh the whole table, like we run the whole table, we run the computations and so on.

例えば、新しいカラムを追加したり、テーブル全体を完全にリフレッシュしたい場合、テーブル全体を実行し、計算などを実行します。
You can run a command, which is the one you see at the bottom of the presentation, dbt run full refresh, select my incremental model.

プレゼンテーションの一番下にある、dbt run full refresh, select my incremental modelというコマンドを実行します。
And this is it for the tips.

ヒントはここまで。
So now I want to speak about the usages at tier.

では、次はティアでの起用法について話したい。
So here at tier, we have around more than 400 dbt models out of which around 22% are incremental.

ティアには400以上のDBTモデルがあり、そのうち約22％がインクリメンタルモデルだ。
And on the right, this is just a graph on Looker.

そして右側はLookerのグラフだ。
We use the dbt API to get some stats and some data around dbt.

dbt APIを使って、dbtに関する統計やデータを取得しています。
If you can see that we have different runs that run sometimes frequently, sometimes an hourly, sometimes more than an hour, sometimes at night.

1時間に1本、1時間以上、夜間もある。
Also, we have a lot of frequent models that run that way.

また、そのように走る頻繁なモデルもたくさんある。
And a lot of them are incremental to save on time and resources, as I mentioned.

そしてその多くは、私が述べたように、時間とリソースを節約するための漸進的なものだ。
And we also run full refreshes on certain models, certain incremental models during the weekends.

また、週末には特定のモデル、特定のインクリメンタル・モデルをフル・リフレッシュしている。
I want to just mention something.

ちょっと言っておきたいことがある。
These are some of my models that I work with.

これらは私が一緒に仕事をしているモデルたちだ。
So I have a model that really has more than a billion rows.

だから、本当に10億行以上あるモデルを持っている。
Not computationally expensive, just a lot of case ones and unions.

ケース1とユニオンの数が多いだけで、計算量は多くない。
And if I run it normally, or if I run it as a full refresh, it takes around 55 minutes to run.

そして、普通に実行しても、フルリフレッシュとして実行しても、実行には55分ほどかかる。
But now the incremental model, I think I have a period of 24 hours to refresh data.

しかし、インクリメンタルモデルでは、24時間以内にデータを更新しなければならない。
It runs around 42 seconds.

約42秒。
And I have another model that's the same, a bit more computationally expensive or complex, let's say.

同じようなモデルで、もう少し計算量が多かったり複雑だったりするものがある。
It's also like running 1.2 hours.

それも1.2時間走るようなものだ。
It's taking one minute to run incrementally.

インクリメンタルに走るのに1分かかっている。
And then I have another model that's a bit smaller than that, but it has a lot of heavy computations.

そしてもうひとつ、それより少し小さいけれど、重い計算をたくさんするモデルがある。
It's running around 25 minutes normally, but now with incremental models, it runs around three minutes.

通常は25分程度だが、インクリメンタルモデルでは3分程度になった。
So you can see here, this is really good.

だから、ここを見てほしい。
This is saving a lot of time.

これは多くの時間を節約している。
I want to discuss a few approaches we use it here.

ここでは、私たちが使っているいくつかのアプローチについて説明したい。
First, it's a safe approach, which is the unique key.

第一に、安全なアプローチである。
So here we define a unique key.

そこで、ここではユニークキーを定義する。
We have the incremental statement and we basically tell the model, Hey, select the late everything that's after the latest timestamp and ensure the unique and ensure that any unique key that's already existed in the destination table can be basically updated.

インクリメンタルステートメントがあり、基本的にモデルに対して、最新のタイムスタンプより後のタイムスタンプをすべて選択し、一意であることを確認し、宛先テーブルにすでに存在する一意キーが基本的に更新できることを確認します。
This is the first safe, straightforward approach.

これが最初の安全でわかりやすいアプローチだ。
The second approach is the pre-hook configuration.

つ目のアプローチは、プリフック・コンフィギュレーションである。
So what are pre-hooks?

では、プレフックとは何か？
As you can see here on the picture on the left, there's a pre-hook at line five, basically a SQL statement that runs before the model runs.

左の写真にあるように、5行目にプリフックがあり、基本的にモデルが実行される前にSQL文が実行される。
We applied an incremental filter there.

そこでインクリメンタルフィルターをかけた。
So when it runs, we have this running and this is what it basically does.

だから、これが実行されると、私たちはこれを実行することになる。
It, it starts the model, it deletes 24 hours of data, and then it incremental logic pulls another 24 hours of data to process it.

モデルを起動し、24時間分のデータを削除し、インクリメンタルロジックでさらに24時間分のデータを取り出して処理する。
This is not a really good way to do things, but sometimes you have to do it because of certain limitations you have or the way the data is.

これはあまり良い方法とは言えないが、ある種の制約やデータの性質上、そうせざるを得ないこともある。
The first run of this will always fail.

これを最初に実行すると、必ず失敗する。
This is just to note.

これだけは注意してほしい。
So you have to always make sure that you run it like on an ad hoc job or something and then run the models normally.

だから、アドホック・ジョブか何かでそれを実行し、それからモデルを普通に実行することを常に確認しなければならない。
And then we have one of the latest last methods is the incremental usage with unions.

そして最新の最後の方法のひとつが、ユニオンを使ったインクリメンタルな起用法だ。
This is good if you're using window functions.

ウィンドウ関数を使うなら、これは良いことだ。
So basically what we're doing here is the first CTE where we're just taking an incremental model and we're just applying some incremental logic.

基本的にここでやっているのは、インクリメンタルなモデルを用いてインクリメンタルなロジックを適用する最初のCTEだ。
Let's say based on time, take everything after the latest timestamp and the destination table, and that's it.

例えば、時間に基づいて、最新のタイムスタンプとデスティネーション・テーブル以降のすべてを取り上げるとしよう。
And then the second one that's, I will take everything from the above CTE, but if I'm running it incremental union it with all the data from existing destination table, instead of cutting it in two models, you run everything at once.

そして2つ目は、上記のCTEからすべてを取り出しますが、既存のデスティネーション・テーブルのすべてのデータをインクリメンタル結合して実行する場合、2つのモデルに分けるのではなく、すべてを一度に実行します。
And then after that, you can continue with window functions and so on.

その後、ウィンドウ機能などを続ければいい。
Last method I want to talk about is like the elephant by the chunks.

最後にお話ししたいのは、チャンクの象のような方法です。
Instead of reprocessing the whole data, you process only a set amount of data.

全データを再処理するのではなく、一定量のデータだけを処理する。
And what you have is the incremental logic.

そして、あなたが持っているのはインクリメンタル・ロジックだ。
What you're trying to do here is just introduce a statement that when you define in the run, which you can see at the bottom, it will just apply there in the where statement if it's incremental, so this is just an extra configuration that can let you load your data in chunks instead of having it run throughout all history, because maybe that's not necessary.

ここでやろうとしているのは、実行時に定義するステートメントを導入することです。一番下にあるように、インクリメンタルであれば、whereステートメントの中で適用されます。
And I think that's it from my side.

私の方からは以上だ。
Any questions?

質問は？
What are the best practices to fill historical data on incremental models?

インクリメンタルモデルの履歴データを埋めるためのベストプラクティスとは？
For example, when you add new columns to a model over time, especially on huge models where full refresh will take too much, there's different ways you can So if I understood correctly, like what you can do is, as I mentioned, like the last method in the slide.

例えば、時間をかけてモデルに新しい列を追加する場合、特に完全なリフレッシュに時間がかかりすぎるような巨大なモデルでは、さまざまな方法があります。
So this one, what you can do is you can refresh it in chunks, start to column first and empty column.

つまり、この場合、最初にカラム、次に空のカラムというように、チャンクに分けてリフレッシュすることができる。
And then you can run the chunk, like the method, the elephant by the chunk that you saw.

そして、そのチャンクを、メソッドや象のように、あなたが見たチャンクを実行することができる。
So I think this is one way you can approach it.

だから、こういうアプローチもあると思う。
Another way is maybe you can have the historical table built in one way, like in a separate model, and then all the newer data in a different model.

別の方法としては、過去のテーブルを別のモデルで作成し、新しいデータはすべて別のモデルで作成することもできます。
Depends on the usage and how people are using it.

使い方や、人々がどのように使っているかによる。
Maybe that's not the best way to, but yeah, I think the elephant method, I think that's the one that I would recommend, I guess.

それはベストな方法ではないかもしれないけど、そうだね、ゾウの方法は、僕がお勧めする方法だと思うよ。
And we use that a lot during migration.

移籍の際にもよく使うんだ。
When we migrated from Redshift to Snowflake to build the tables, a whole table that we knew was going to be huge and maybe fail at the very end.

テーブルを構築するためにRedshiftからSnowflakeに移行したとき、テーブル全体が巨大になり、最後の最後で失敗するかもしれないとわかっていた。
There's a new thing, merge by column.

列ごとのマージという新しいものがある。
I think it's a new feature where you can just update certain columns and you can figure it out.

特定の列を更新するだけで、それを把握できる新しい機能だと思う。
I think it's in the dbt documentation.

dbtのドキュメントに書いてあると思う。
There's another one that you can like enable to add new columns.

もう一つ、新しい列を追加できるようにするものがある。
So select is dynamic and then it would generate a new column.

つまり、セレクトはダイナミックで、新しいカラムが生成される。
Then you can just enable that in the config and it will create it in the incremental model for you.

そうすれば、コンフィグでそれを有効にするだけで、インクリメンタルモデルを作成してくれる。
Why do you regularly do full refresh runs on your incremental models?

インクリメンタルモデルを定期的にフルリフレッシュするのはなぜですか？
Okay.

オーケー。
So we have different kinds of incremental models.

だから、私たちにはさまざまな種類のインクリメンタルモデルがある。
I think the ones with only pre hooks, for example, these are good candidates to get to refresh, but I think it's also to get like a full refresh regularly.

例えば、プリフックだけのものは、リフレッシュするのに良い候補だと思いますが、定期的にフルリフレッシュすることも必要だと思います。
But I just want to point out that we do not fully refresh everything.

ただ、私たちがすべてを完全にリフレッシュしているわけではないことだけは指摘しておきたい。
We fully refresh certain models, not all models, unless it's very, very necessary.

私たちは特定のモデルをフルリフレッシュするが、すべてのモデルをリフレッシュすることはない。
I think this is it.

これだと思う。
We don't have to fully refresh everything, but I think the ones that use the pre-hook method prone to data leakage some way, at least in my experience, and it's good to refresh them every once in a while, but also there's a lot of factors, like how is the source data looking at?

すべてを完全にリフレッシュする必要はないが、少なくとも私の経験では、プリフック方式を使っているものは何らかの形でデータが漏れやすいと思う。
Is it stable?

安定しているか？
And so on.

などなど。
How do you use a materialized view versus an incremental model, especially in the snowflake?

マテリアライズド・ビューとインクリメンタル・モデル、特にスノーフレークでは、どのように使い分けるのですか？
We usually use materialized on the first raw data and then almost everything else we materialize as table.

私たちは通常、最初の生データにマテリアライズドを使い、あとはほとんどすべてテーブルとしてマテリアライズする。
View is because the source is just like the query running.

ビューは、ソースが実行中のクエリと同じだからだ。
It's like a snapshot.

スナップショットのようなものだ。
The tables might take a bit more to run, but they're sometimes bigger.

テーブルを動かすにはもう少し時間がかかるかもしれないが、その分大きくなることもある。
They're computationally less expensive and the query faster.

計算コストが低く、クエリーも速い。
So they take more to build, but the query of course, faster.

だから、建設にはもっと時間がかかるが、クエリーはもちろん速くなる。
So let's say in Looker and so on.

だから、ルッカーなどで言ってみよう。
Usually we have the first steps, the very raw tables as views, and then everything built on top of that usually.

通常、最初のステップ、つまりビューとして未加工のテーブルを用意し、その上にすべてを構築するのが一般的だ。
So what I was referring to is not a standard view, but a materialized view as an object in Snowflake.

つまり、私が言っていたのは標準的なビューではなく、Snowflakeのオブジェクトとしてのマテリアライズド・ビューです。
You can create a materialized view that is actually physically storing the data.

実際に物理的にデータを格納するマテリアライズド・ビューを作成できます。
So performance is also good.

だからパフォーマンスもいい。
And Snowflake is taking care of the easy transformations underneath.

そして、雪印はその下で簡単な変身を引き受けている。
So you can also not use everything, but you can do easy mapping by timestamp and all of that is possible.

だから、すべてを使うことはできないが、タイムスタンプによる簡単なマッピングや、そのすべてが可能だ。
And also, if I may add, if you have it in dbt, it's a little bit more transparent also for the rest of your data team, right?

また、付け加えれば、dbtで管理すれば、データチームの他のメンバーにとっても、少し透明性が増しますよね？
Like it would depend on how big I guess your data team is, but normally not everyone will have, let's say admin rights to be modifying and then having those that view or that incremental model in dbt would give transparency, it would add the version control, it would ensure actually, because you can still have version control, but it would ensure that you need to use version control for deployment.

データ・チームの規模にもよりますが、通常は、例えば管理者権限を持っている人全員が修正できるわけではありませんし、dbtにビューやインクリメンタル・モデルがあれば、透明性が確保され、バージョン管理も可能になります。
Thank you.

ありがとう。