Placeholder Image

字幕表 動画を再生する

  • [MUSIC PLAYING]

  • CATHERINE XU: Hi.

  • I'm Kat from the TensorFlow team,

  • and I'm here to talk to you about responsible AI

  • with TensorFlow.

  • I'll focus on fairness for the first half of the presentation,

  • and then my colleague, Miguel, will end

  • with privacy considerations.

  • Today, I'm here to talk about three things.

  • The first, an overview of ML fairness.

  • Why is it important?

  • Why should we care?

  • And how does it affect an ML system exactly?

  • Next, we'll walk through what I'll

  • call throughout the presentation a fairness workflow.

  • Surprisingly, this isn't too different from what

  • you're already familiar with--

  • for example, a debugging or a model evaluation workflow.

  • We'll see how fairness considerations

  • can fit into each of the discrete steps.

  • Finally, we'll introduce tools in the TensorFlow ecosystem,

  • such as Fairness Indicators that can be

  • used in the fairness workflow.

  • Fairness Indicators is a suite of tools

  • that enables easy evaluation of commonly used fairness

  • metrics for classifiers.

  • Fairness Indicators also integrates

  • well with remediation libraries in order

  • to mitigate bias found and a structure

  • to help in your deployment decision

  • with features such as model comparison.

  • We must acknowledge that humans are at the center of technology

  • design, in addition to being impacted by it,

  • and humans have not always made product design decisions

  • that are in line with the needs of everyone.

  • Here's one example.

  • Quick, Draw! was developed through the Google AI

  • experiments program where people drew little pictures of shoes

  • to train a model to recognize them.

  • Most people drew shoes that look like the one on the top right,

  • so as more people interacted with the game,

  • the model stopped being able to recognize shoes

  • like the shoe on the bottom.

  • This is a social issue first, which

  • is then amplified by fundamental properties of ML--

  • aggregation and using existing patterns to make decisions.

  • Minor repercussions in a faulty shoe classification product,

  • perhaps, but let's look at another example that can

  • have more serious consequences.

  • Perspective API was released in 2017

  • to protect voices in online conversations

  • by detecting and scoring toxic speech.

  • After its initial release, users experimented

  • with the web interface found something interesting.

  • The user tested two clearly non-toxic sentences

  • that were essentially the same, but with the identity term

  • changed from straight to gay.

  • Only the sentence using gay was perceived by the system

  • as likely to be toxic, with the classification score of 0.86.

  • This behavior not only constitutes

  • a representational harm.

  • When used in practice, such as a content moderation system,

  • this can lead to the systematic silencing of voices

  • from certain groups.

  • How did this happen?

  • For most of you using TensorFlow,

  • a typical machine learning workflow

  • will look something like this.

  • Human bias can enter into the system

  • at any point in the ML pipeline, from data collection

  • and handling to model training to deployment.

  • In both of the cases mentioned above,

  • bias primarily resulted from a lack of diverse training data--

  • in the first case, diverse shoe forms,

  • and in the second case, examples of comments containing gay

  • that were not toxic.

  • However, the causes and effects of bias are rarely isolated.

  • It is important to evaluate for bias at each step.

  • You define the problem the machine learning

  • system will solve.

  • You collect your data and prepare it,

  • oftentimes checking, analyzing, and validate it.

  • You build your model and train it of the data

  • you just prepared.

  • And if you're applying ML to a real world use case,

  • you'll deploy it.

  • And finally, you'll iterate and improve

  • your model, as we'll see throughout the next few slides.

  • The first question is, how can we do this?

  • The answer, as I mentioned before,

  • isn't that different from a general model quality workflow.

  • The next few slides will highlight the touch points

  • where fairness considerations are especially important.

  • Let's dive in.

  • How do you define success in your model?

  • Consider what your metrics and fairness-specific metrics

  • are actually measuring and how they relate to areas

  • of product risk and failure.

  • Similarly, the data sets you choose

  • to evaluate on should be carefully selected

  • and representative of the target population of your model

  • or product in order for the metrics to be meaningful.

  • Even if your model is performing well at this stage,

  • it's important to recognize that your work isn't done.

  • Good overall performance may obstruct poor performance

  • on certain groups of data.

  • Going back to an earlier example,

  • accuracy of classification for all shoes was high,

  • but accuracy for women's shoes was unacceptably low.

  • To address this, we'll go one level deeper.

  • By slicing your data and evaluating performance

  • for each slice, you will be able to get a better

  • sense of whether your model is performing equitably

  • for a diverse set of user characteristics.

  • Based on your product use case and audience,

  • what groups are most at risk?

  • And how might these groups be represented in your data,

  • in terms of both identity attributes and proxy

  • attributes?

  • Now you've evaluated your model.

  • Are there slices that are performing significantly worse

  • than overall or worse than other slices?

  • How do we get intuition as to why

  • these mistakes are happening?

  • As we discussed, there are many possible sources of bias

  • in a model, from the underlying training data to the model

  • and even in the evaluation mechanism itself.

  • Once the possible sources of bias have been identified,

  • data and model remediation methods

  • can be applied to mitigate the bias.

  • Finally, we will make a deployment decision.

  • How does this model compare to the current model.

  • This is a highly iterative process.

  • It's important to monitor changes

  • as they are pushed to a production setting

  • or to iterate on evaluating and remediating

  • models that aren't meeting the deployment threshold.

  • This may seem complicated, but there

  • are a suite of tools in the TensorFlow ecosystem

  • that make it easier to regularly evaluate and remediate

  • for fairness concerns.

  • Fairness Indicators is a tool available via TFX, TensorBoard,

  • Colab, and standalone model-agnostic evaluation

  • that helps automate various steps of the workflow.

  • This is an image of what the UI looks like,

  • as well as a code snippet detailing

  • how it can be included in the configuration.

  • Fairness Indicators offers a suite of commonly-used fairness

  • metrics, such as false positive rate and false negative rate,

  • that come out of the box for developers

  • to use for model evaluation.

  • In order to ensure responsible and informed use,

  • the toolkit comes with six case studies that

  • show how Fairness Indicators can be applied across use cases

  • and problem domains and stages of the workflow.

  • By offering visuals by slice of data,

  • as well as confidence intervals, Fairness Indicators

  • help you figure out which slices are underperforming

  • with significance.

  • Most importantly, Fairness Indicators

  • works well with other tools in the TensorFlow ecosystem,

  • leveraging their unique capabilities

  • to create an end-to-end experience.

  • Fairness Indicators data points can easily

  • be loaded into the What If tool for a deeper analysis,

  • allowing users to test counterfactual use cases

  • and examine problematic data points in detail.

  • This data can also be loaded into TensorFlow Data Validation

  • to identify the effects of data distribution

  • on model performance.

  • This Dev Summit, we're launching new capabilities

  • to expand the Fairness Indicators

  • workflow with remediation, easier deployments, and more.

  • We'll first focus on what we can do

  • to improve once we've identified potential sources of bias

  • in our model.

  • As we've alluded to previously, technical approaches

  • to remediation come in two different flavors--

  • data-based and model-based.

  • Data-based remediation involves collecting data, generating

  • data, re-weighting, and rebalancing in order

  • to make sure your data set is more representative

  • of the underlying distribution.

  • However, it isn't always possible to get or to generate

  • more data, and that's why we even investigated

  • model-based approaches.

  • One of these approaches is adversarial training,

  • in which you penalize the extent to which a sensitive attribute

  • can be predicted by the model, thus mitigating the notion

  • that the sensitive attribute affects

  • the outcome of the model.

  • Another methodology is demographic-agnostic

  • remediation, an early research method

  • in which the demographic attributes don't need

  • to be specified in advance.

  • And finally, constraint-based optimization

  • we will go into more detail in over the next few slides

  • in a case study that we have released.

  • Remediation, like evaluation, must be used with care.

  • We aim to provide both the tools and the technical guidance

  • to encourage teams to use this technology responsibly.

  • CelebA is a large-scale face attributes

  • data set with more than 200,000 celebrity images,

  • each with 40 binary attribute annotations, such as is

  • smiling, age, and headwear.

  • I want to take a moment to recognize

  • that binary attributes do not accurately

  • reflect the full diversity of real attributes

  • and is highly contingent on the annotations and annotators.

  • In this case, we are using the data set

  • to test a smile detection classifier

  • and how it works for various age groups characterized

  • as young and not young.

  • I also recognize that this is not

  • the possible full span of ages, but bear

  • with me for this example.

  • We trained an unconstrained-- and you'll find out what

  • unconstrained means--

  • tf.keras.Sequential model and evaluated and visualized

  • using Fairness Indicators.

  • As you can see, not young has a significantly higher false

  • positive rate.

  • Well, what does this mean in practice?

  • Imagine that you're at a birthday party

  • and you're using this new smile detection

  • camera that takes a photo whenever everyone in the photo

  • frame is smiling.

  • However, you notice that in every photo,

  • your grandma isn't smiling because the camera falsely

  • detected her smiles when they weren't actually there.

  • This doesn't seem like a good product experience.

  • Can we do something about this?

  • TensorFlow constraint optimization

  • is a technique released by the Glass Box research team

  • here at Google.

  • And here, we incorporate it into our case study.

  • TF constraint optimization works by first defining

  • the subsets of interest.

  • For example, here, we look at the not young group,

  • represented by groups_tensor less than 1.

  • Next, we set the constraints on this group, such

  • that the false positive rate of this group is less than

  • or equal to 5%.

  • And then we define the optimizer and train.

  • As you can see here, the constrained sequential model

  • performs much better.

  • We ensured that we picked a constraint

  • where the overall rate is equalized

  • for the unconstrained and constrained model, such that we

  • know that we're actually improving the model, as opposed

  • to merely shifting the decision threshold.

  • And this applies to accuracy, as well-- making sure

  • that the accuracy and AUC has not gone down over time.

  • But as you can see, the not young FPR

  • has decreased by over 50%, which is a huge improvement.

  • You can also see that the false positive rate for young