Dagster Data Orchestration 10分間のウォークスルー (Dagster Data Orchestration 10 min walkthrough)

字幕表動画を再生する

AI 自動生成字幕

Hi, my name is Sean, and I'm an engineer working on Dagster.

こんにちは、僕の名前はショーンで、ダグスターで働くエンジニアです。
I get to talk to a lot of different engineering teams, and unfortunately, they all say that they're struggling.

私はいろいろなエンジニアリング・チームと話をする機会があるが、残念なことに、彼らはみな苦労していると言う。
They spend too much time babysitting production, and they don't have a chance to build new things and be proactive with stakeholders.

プロダクションのお守りをする時間が長すぎて、新しいものを作ったり、ステークホルダーと積極的に関わる機会がないのだ。
So why is this?

では、これはなぜなのか？
Well, unfortunately, a lot of those teams are using task-based orchestrators like Airflow, and that puts them into this vicious cycle where, unfortunately, they can't test code out locally, so they have to push it straight to production.

残念なことに、そのようなチームの多くはAirflowのようなタスクベースのオーケストレーターを使っていて、ローカルでコードをテストすることができず、そのまま本番環境にプッシュしなければならないという悪循環に陥っている。
But because it's hard for them to reason ahead of time about what new code will do, often pushing straight into production leads to failures and outages, and that's what ends up paging on-call and interrupting those engineers who are trying to do new work.

しかし、新しいコードが何をするのかを前もって推論するのは難しいため、本番環境にそのまま押し込むと、多くの場合、障害や停止が発生し、オンコールを呼び出したり、新しい仕事をしようとしているエンジニアを中断させたりすることになる。
Unfortunately, because of those interruptions, the team is slow and often criticized for being behind, and that in turn means they're unable to pay down technical debt that would actually allow them to fix some of these problems.

残念なことに、このような中断のせいで、チームは遅れをとり、しばしば遅れをとっていると批判される。
So how do we get out of this vicious cycle?

では、どうすればこの悪循環から抜け出せるのか？
We believe that Dagster is the solution, and that's because Dagster is an orchestrator built for data engineers and the entire software development lifecycle.

私たちはDagsterがそのソリューションだと信じています。なぜならDagsterはデータエンジニアとソフトウェア開発ライフサイクル全体のために構築されたオーケストレーターだからです。
It allows you to think about individual assets and to take a declarative approach.

個々の資産について考え、宣言的なアプローチをとることができる。
So instead of having to build one monolithic DAG that's tied to your production resources, you can write new code incrementally, and then the orchestrator will figure out when those new data assets need to be run.

そのため、本番リソースに結びついたモノリシックなDAGを1つ構築する代わりに、インクリメンタルに新しいコードを書くことができ、オーケストレーターが新しいデータ資産を実行する必要があるタイミングを把握する。
If this approach sounds familiar, it's because many modern web engineers have taken this declarative approach.

このアプローチに聞き覚えがあるとすれば、それは現代のウェブ・エンジニアの多くがこの宣言的アプローチを取っているからだ。
In fact, the migration from Angular to React was all about adopting these benefits.

実際、AngularからReactへの移行は、こうした利点を採用するためだった。
So let's see this in action.

では、実際に見てみよう。
Here's the global data asset graph for the Hooli data engineering team.

Hooliデータエンジニアリングチームのグローバルデータ資産グラフです。
You can see they start by grabbing some data from an API.

APIからデータを取得することから始めているのがわかるだろう。
That data is fed through a series of transformations, and eventually a daily order summary table is created.

そのデータは一連の変換を経て、最終的に日次オーダーサマリーテーブルが作成される。
That table is then used by the data science team to run forecasting routines and create a reporting team for KPI reporting and executive dashboards.

そのテーブルは、データサイエンスチームが予測ルーチンを実行し、KPIレポートや経営陣ダッシュボードのためのレポートチームを作成するために使用される。
So what are the benefits of using assets?

では、アセットを使うメリットは何だろうか？
Well, imagine an executive has a question about the daily order summary.

さて、ある幹部が日々のオーダーサマリーについて質問をしたとしよう。
Something doesn't look quite right.

何かがおかしい。
Well, in a normal orchestrator, you would have to go spelunking through all the different tasks logs, trying to figure out what task might have impacted that table.

さて、通常のオーケストレーターでは、どのタスクがそのテーブルに影響を与えたかを突き止めようと、さまざまなタスクのログを探し回らなくてはならない。
Whereas in Dagster, you can immediately look at the daily order summary and see metadata about it, see the run logs associated with it, and even information like the SQL that generated the table.

一方、Dagsterでは、日々の注文サマリーをすぐに見ることができ、それに関するメタデータ、それに関連する実行ログ、さらにはテーブルを生成したSQLのような情報まで見ることができる。
This allows you to debug problems and answer that executive question really quickly, and in fact, give stakeholders the ability to self-serve questions like, when was this data set last updated?

これにより、問題をデバッグし、経営陣の質問に素早く答えることができる。
In addition, taking an asset-first approach allows Dagster to do declarative scheduling.

さらに、アセットファーストのアプローチを取ることで、ダグスターは宣言的なスケジューリングを行うことができる。
So instead of having to create a single monolithic DAG or try to reason through when different cron schedules should be applied to different jobs, you can simply define new assets and encode the SLA that stakeholders have for them.

そのため、単一のモノリシックなDAGを作成したり、異なるクーロンスケジュールを異なるジョブに適用するタイミングを推論したりする代わりに、単純に新しいアセットを定義し、利害関係者がそれらに対して持つSLAをエンコードすることができる。
So for example, this average order asset that the marketing team relies on needs to be updated pretty frequently because it's in a KPI dashboard.

例えば、マーケティング・チームが頼りにしている平均受注資産は、KPIダッシュボードにあるため、かなり頻繁に更新する必要がある。
So a policy has been set that the asset should never be more than 90 minutes stale.

そのため、アセットが90分以上古くなってはならないという方針が定められている。
In contrast, the daily order summary asset only needs to be updated every day by 9 a.m.

対照的に、日次注文サマリー資産は毎日午前9時までに更新すればよい。
Dagster figures out when these assets should run, and because it's aware of all the different data assets that your team cares about and how they depend on one another, Dagster's smart enough to avoid redundant work.

Dagsterはこれらのアセットがいつ実行されるべきかを把握し、チームが気にかけているすべての異なるデータ・アセットと、それらが互いにどのように依存しているかを認識しているため、Dagsterは冗長な作業を避けることができる。
So here we're seeing that the average order data set that has that SLA encoded to be up-to-date every 90 minutes needs to have itself and two other stale assets upstream updated, but everything else is already fresh enough.

つまり、90分ごとに更新されるようにSLAがエンコードされている平均的な注文データセットは、それ自身と他の2つの古くなった資産を上流で更新する必要があるが、他のすべてはすでに十分に新鮮であることがわかる。
This avoids redundant computations and expensive cloud warehouse queries.

これにより、冗長な計算や高価なクラウドウェアハウスクエリを避けることができる。
So how are all these things built?

では、これらのものはどうやって作られているのか？
Let's take a look at a Dagster project.

ダグスターのプロジェクトを見てみよう。
Dagster projects are formatted as Python packages.

DagsterプロジェクトはPythonパッケージとしてフォーマットされる。
And within a project, we can create an asset by simply writing a new function.

プロジェクト内では、新しい関数を書くだけでアセットを作成できる。
Assets in Dagster can be Pandas data frames, they can be Jupyter notebooks, they can be Spark data frames, or really any arbitrary code.

Dagsterのアセットは、Pandasデータフレームであったり、Jupyterノートブックであったり、Sparkデータフレームであったり、本当に任意のコードであったりする。
So here we'll create a new function to calculate the average order size, which is an important metric for our executive team.

ここでは、経営陣にとって重要な指標である平均注文サイズを計算する新しい関数を作成します。
We'll start by writing a function and then adding Dagster's asset decorator.

まずは関数を書いて、ダグスターのアセット・デコレーターを追加する。
Then within the function, we'll just use our regular logic to compute that KPI.

そして関数内で、通常のロジックを使ってKPIを計算します。
And then finally, we'll encode the SLA for what stakeholders expect as a freshness policy.

そして最後に、ステークホルダーが期待するSLAを鮮度ポリシーとしてエンコードする。
Once we have our asset created, in Dagster, we can run everything locally.

Dagsterでアセットを作成したら、ローカルですべてを実行できる。
So we'll fire up a local copy of our Dagster user interface.

そこで、ダグスターのユーザー・インターフェースのローカル・コピーを立ち上げてみよう。
And here I can test out that my code, logical code that I just wrote runs.

そしてここで、私が書いたコード、つまり今書いた論理的なコードが動くかどうかをテストすることができる。
When I run things locally, I don't have to use production resources.

ローカルで物事を実行する場合、本番のリソースを使う必要はない。
So here when I run all of my code, I'm going to be using just the local file system to store intermediate results.

だからここでは、すべてのコードを実行するときに、中間結果を保存するためにローカルのファイルシステムだけを使うことにしている。
And the SQL that I'm writing will execute against a local DuckDB warehouse.

そして、私が書いているSQLは、ローカルのDuckDBウェアハウスに対して実行されます。
This allows me to execute all of my logical code really quickly and to iterate really fast without impacting or relying on production systems.

これにより、本番システムに影響を与えたり依存したりすることなく、すべての論理コードを本当に素早く実行し、本当に素早く反復することができる。
When we do make it to production, instead of using DuckDB, we'll use a Snowflake warehouse.

本番稼動時には、DuckDBの代わりにSnowflakeウェアハウスを使う予定だ。
Instead of using our local file system, we'll use S3.

ローカルのファイルシステムを使う代わりに、S3を使おう。
And that's all encoded and configured through Dagster's pluggable resource system.

そしてそれはすべて、ダグスターのプラグイン可能なリソースシステムによってエンコードされ、設定される。
So now that I'm happy with my code locally, let's open up a pull request.

ローカルでのコードに満足したので、プルリクエストを申請しよう。
Normally, when data teams open pull requests, you can review the code, but you have to guess what that code will actually do once it's in production.

通常、データチームがプル・リクエストを出すと、コードをレビューすることはできるが、本番環境でそのコードが実際に何をするのかは推測しなければならない。
With Dagster, we create what's called a branch deployment, which is essentially an isolated copy of our entire data platform just for this pull request.

Dagsterでは、ブランチ・デプロイメントと呼ばれる、このプル・リクエストのためだけにデータ・プラットフォーム全体を隔離したコピーを作成します。
That allows my team to actually run the code and see what it's going to look like.

そうすることで、私のチームは実際にコードを実行し、それがどのように見えるかを確認することができる。
In this case, we're running against resources that are very similar to production.

この場合、我々は本番とよく似たリソースを相手にしている。
We're using Snowflake to clone a copy of our production database that this pull request can run against.

私たちはSnowflakeを使って、このプルリクエストが実行できる本番データベースのコピーをクローンしています。
So while we're not impacting production, we can be sure that our code is going to work with production-like systems.

そのため、本番稼動に影響を与えることはないが、本番稼動に近いシステムで我々のコードが動作することを確認することができる。
So in this way, Dagster provides a staging environment for every pull request that you open.

このように、Dagsterはあなたが開くすべてのプルリクエストにステージング環境を提供する。
Once you're ready to put code into production, Dagster was built with all the modern bells and whistles.

Dagsterは、コードを本番稼動させる準備が整えば、あらゆる最新機能を備えている。
So, for example, multiple teams can collaborate together in different virtual environments and different projects.

そのため、例えば複数のチームが異なる仮想環境や異なるプロジェクトで共同作業を行うことができる。
You don't have to try to get everyone on the same version of pandas while still having a global asset view where those teams can depend on one another's work.

全員が同じバージョンのpandasを使いながら、グローバルなアセットビューを持つことで、各チームが互いの作業に依存する必要がなくなる。
Dagster has full support for role-based access controls and single sign-on.

Dagsterはロールベースのアクセス制御とシングルサインオンを完全にサポートしています。
In fact, many of our Dagster customers allow everyone in their organization to be viewers so that they can self-service questions from the Dagster asset catalog, like when was this data set last updated?

実際、Dagsterのお客様の多くは、組織内の全員がビューアになることを許可しており、Dagsterのアセット・カタログから、このデータセットの最終更新日はいつですか、などの質問をセルフサービスで行うことができます。
Dagster has a variety of settings to help ensure that the orchestrator is robust, including things like automatic op retries and run queues with different priorities.

Dagsterには、オーケストレーターの堅牢性を確保するためのさまざまな設定がある。たとえば、オペの自動再試行や優先順位の異なるキューの実行などだ。
And finally, Dagster supports a variety of different alerting policies.

そして最後に、Dagsterは様々なアラートポリシーをサポートしている。
Like many orchestrators, you can alert on failure, but Dagster actually helps teams avoid alert fatigue by also allowing you to alert on SLA violations.

多くのオーケストレーターと同様に、障害発生時にアラートを出すことができるが、DagsterはSLA違反時にもアラートを出すことができるため、実際にチームがアラート疲れを避けるのに役立っている。
And that means that you're only going to get notified when data sets are outside of the SLAs that actually matter to stakeholders and not get notified based on spurious failures that are automatically recoverable.

そしてそれは、データセットが利害関係者にとって実際に重要なSLAから外れた場合にのみ通知を受け、自動的に回復可能な偽の障害に基づいて通知を受けないことを意味する。
Finally, Dagster Cloud can run in a variety of different ways.

最後に、ダグスタークラウドはさまざまな方法で実行できる。
And so, for example, you might use Kubernetes or you might use ECS or any other highly scalable compute layer.

例えば、Kubernetesを使うかもしれないし、ECSやその他のスケーラビリティの高いコンピュート・レイヤーを使うかもしれない。
So we hope you're excited about Dagster and ready to give it a shot.

ダグスターに興奮し、挑戦する準備ができていることを願っている。
If that's the case, we've made it really easy to get started with Dagster Cloud.

そのような場合、Dagster Cloudを使い始めるのはとても簡単です。
You can clone an example project and get running in no time.

サンプルのプロジェクトをクローンすれば、すぐに実行できる。
Or you can start out by developing locally.

あるいは、地元での開発から始めることもできる。
Once you're ready to run things in production, you can either host Dagster open source yourself or Dagster Cloud comes with a fully serverless option or a hybrid computation models available as well.

本番稼働の準備ができたら、Dagsterオープンソースを自分でホストするか、Dagster Cloudには完全なサーバーレス・オプションやハイブリッド計算モデルも用意されている。
So be sure to check us out.

だから、ぜひチェックしてほしい。
Find us on GitHub and give us a star.

GitHubで私たちを見つけて、星をつけてください。
That's the best place to keep track of recent updates like our 1.1 release.

私たちの1.1リリースのような最近のアップデートを追跡するのに最適な場所だ。
Or join us on Slack where you can ask questions and meet other modern data engineers.

また、Slackに参加して、質問したり、他の最新のデータエンジニアと知り合うこともできます。
Thanks so much.

本当にありがとう。