字幕表 動画を再生する 英語字幕をプリント AMIT SHARMA: Hi, all. Welcome to this session on Causality and Machine Learning as a part of Frontiers in Machine Learning. I'm Amit Sharma from Microsoft Research and your host. Now of course I presume you would all agree that distinguishing correlations from causation is important. Even at Microsoft, for example, when we're deciding which product feature to ship or when we're making business decisions about marketing, causality is important. But in recent years, what we're also finding is that causality is important for building predictive machine learning models as well. So especially if you're interested in out-of-domain generalization having your models not brittle, you need causal reasoning to make them robust. And in fact there are interesting results even about adverse robustness and privacy where causality may play a role. This is an interesting time at the intersection of causality and machine learning. And we now have a group at Microsoft as well that is looking at these connections. I'll post a link in the chat. But for now, today I thought we can ask, all ask this question what are the big ideas that will drive further this conversation between causality and ML. And I'm glad that today we have three really exciting talks. Our first talk is from Susan Athey, economics of technology professor from Stanford. She'll talk about the challenges and solutions for decision-making under high dimensional and how generative data modeling can help. And in fact when I started in causality, Susan's work was one of the first I saw that was making connections between causality and machine learning. I'm looking forward to her talk. And next we'll have Elias Bareinboim, who will be talking about the three kinds of questions we typically want to ask about data and how two of them turn out to be causal and they're much harder. And he'll also talk about an interesting emerging new field, causal reinforcement learning. And then finally we'll have Cheng Zhang from Microsoft Research Cambridge. She'll talk about essentially give a recipe for how to build models, neural networks that are robust to adversal attacks. And by now you've guessed in the session she'll use causal reasoning. And at the end we'll have 20 minutes for open discussion. All the speakers will be live for your questions. Before we start, let me tell you one quick secret. All these talks are prerecorded. So if you have any questions during the talk, feel free to just ask those questions on the hub chat itself and our speakers are available to engage with you on the chat even while the talk is going on. With that, I'd like to hand it over to Susan. SUSAN ATHEY: Thanks so much for having me here today in this really interesting session on machine learning and causal inference. Today I'm going to talk about the application of machine learning to the problem of consumer choice. And I'm going to talk about some results from a couple of papers I've been working on that analyze how firms can use machine learning to do counterfactual inference for questions like how should I change prices or how should I target coupons. And I'll also talk a little bit about the value of different types of data for solving that problem. Doing counterfactual inferences is substantially harder than prediction. There can be many data situations where it's actually impossible to estimate counterfactual quantities. It's essential to have the availability of experimental or quasi experimental variation in the data to separate correlation from causal effects. That is, we need to see whatever treatment it is we're studying, that needs to vary for reasons that are unrelated to other unobservables in the model. We need the treatment assignment to be as good as random after adjusting for other observables. We also need to customize machine learning optimization for estimating causal effects and counterfactual of interest instead of for prediction. And indeed, model selection and regularization need to be quite different if the goal is to get valid causal estimates. That's been a focus of research, including a lot of research I've done. A second big problem in estimating causal effects is statistical power. In general, historical observational data may not be informative about causal effects. If we're trying to understand what's the impact of changing prices, if prices always change in the past in response to demand shocks, then we're not going to be able to learn what would happen if I change the price at a time when there wasn't demand shock. I won't have data from that in the past. I'll need to run an experiment or I'm going to need to focus on just a few price changes or use statistical techniques that focus my estimation on a small part of the variation of the data. Any of those things is going to lead to a situation where I don't have as much statistical power as I would like. Another problem is effect sizes are often small. Firms are usually already optimizing pretty well. It will be surprising if making changes leads to large effects. And the most obvious ideas for improving the world have often already been implemented. Now that's not always true, but it's common. And finally personalization is hard. If I want to get exactly the right treatment for you, I need to observe lots of other people just like you, and I need to observe them with different values of the treatment variable that I'm interested in. And again that's very difficult, and often it's not possible to get the best personalized effect for someone in a small dataset. Instead, I'm averaging over people who are really quite different than the person of interest. So for all of these reasons, we need to be quite cautious in estimating causal effects and we need to consider carefully what environments enables that estimation and give us enough statistical power to draw conclusions. Now I want to introduce a model that's commonly used in economics in marketing to study consumer choice. This model was introduced by Dan McMadden in the early 1970s, he won the Nobel Prize for this work. The main crux of his work was to establish a connection between utility maximization, a theoretical model of economic behavior. And the statistical model, the multinomial logit. And this modeling setup was explicitly designed for counterfactual inference. The problem he was starting to solve was what would happen if we expand Bart, which is the public transportation in the Bay Area, what if I expand Bart, how will people change their transportation choices when they have access to this new alternative. So the basic model is an individual's utility depends on their mean utility which varies by the user, the item and time and idiosyncratic shock. In general, this -- we're going to have a more specific functional model for the mean utility, and that's going to allow us to learn from seeing the same consumer over time and also to extrapolate from one consumer to the other. We're going to assume that the consumer maximizes utility among items in a category by just making this choice. So they're going to choose the item I that maximizes their utility. The nice thing is that if the error has type one extreme value distribution is independent across items, then we can write the probability that the user used choice at time T is equal to I, just standard multinomial logit type functional form. So utility maximization will where these mus are the means, will lead to multinomial logit probabilities. So data about individual I's purchases can be use to estimate the mean utility. In particular, if we write their utility, their mean utility as something that depends on the item and the user but that's constant over time, so this is just their mean utility for this item, like how much they like a certain transportation choice, and then a second term which is a product of two terms, the price the users faces at time T for item I and a preference parameter that's specific to the user. If I have this form of preferences and then the price varies over time while the user's preference parameters stay constant, I'll be able to estimate how the user feels about prices by looking at how their choices differ across different price scenarios. And if I pull data across users, I'll then be able to understand the distribution of consumer price sensitivities as well as the distribution of user utilities for different items. So in a paper with Rob Donnelly and David Blei and Fran Ruiz, We take a look at how we can combine machine learning methods and modern computational methods with traditional approaches to studying consumer purchase behavior in supermarkets. The traditional approach in economics in marketing is to study one category like paper towels at a time. We then model consumer preferences using a small number of latent parameters. For example, we might allow a latent parameter for how consumers can care about prices. We might allow a latent parameter for product quality. But other than that, we would typically assume that there's a small number of observable characteristics of items and there's some common coefficients which express how all consumers feel about those characteristics. The traditional models also assume that items are substitutes within a category and they would ignore other categories. So you might study consumer purchases for paper towels ignoring everything else in the supermarket, just throwing all that data away. So what we do in our approach is that we maintain this utility maximization approach. But instead of just studying one category, we study many categories in parallel. We look at more than 100 categories, more than a thousand products at the same time. We maintain the assumption that categories are independent and that items are substitutes within the categories. And we select categories where that's true. So categories of items where the consumers typically only purchase one brand or one of the items. We then take the approach of a nested logit which comes from the literature in economics and marketing, in each category where there's a shock to an individual's need to purchase in the category at all. And but then conditional on purchasing, the errors or the idiosyncratic shock to the consumer utility are independent. So having the single shock to purchasing at all is effectively introducing correlation among the probabilities of purchasing each of the items within the category at all. Now, the innovation where the machine learning comes in is that we're going to use matrix factorization for the user item preference parameters. So instead of having for each consumer a thousand different latent parameters, each one for each product they might consider, instead we use matrix factorization so there's a lower dimensional vector of latent characteristics for the products and consumers have a lower vector for latent preferences for those characteristics. That allows us to improve upon estimating a hundred different separate category models. We're going to learn about how much you like organic lettuce from whether you chose organic tomatoes, and we'll also just learn about whether you like tomatoes at all from whether you purchased lettuce in the past. And so I won't have time to go through it today but this is a layout of what we call nested factorization model showing the nest where first you decide what to purchase if you're going to purchase, and then the consumers deciding whether to purchase at all. And we have in each case vectors of latent parameters that are describing the consumer's utility for categories and for items. One of the reasons that this type of model hasn't been done in economics in marketing in the past is what was standard in economics and marketing, if you were going to do a model like this, would be to use either classical methods like maximum likelihood without very many latent parameters or consider Markov chain Monte Carlo bayesian estimation which historically had very limited scalability. What we do in our papers is use variational bayes where we approximate the posterior with parameterized distribution and minimize the kale divergence to the true posterior using stochastic gradient descent. We show we can overcome a number of challenges particular introducing price and time rate cover rate slows down the computation a fair bit, and the substitutability within categories leads to nonlinearities. Despite that we're able to overcome these challenges. Once we have estimates of consumer preferences for a product, and as well we have estimates of consumer sensitivity to price, we can then try to validate our model and see how well do we actually do in assessing how consumer demand changes when prices change. And in our data we see many, many price changes. We see prices typically change on this particular grocery store we have data from on Tuesday night. And so in any particular week there may be a change in price from Tuesday to Wednesday. And so in order to assess how well our model does in predicting the change in demand and response to a change in price, we held out test data from weeks with price changes. In those weeks we break the price changes into large price changes and different buckets of the size of the price change. We then look at what is the change from Tuesday to Wednesday in demand in those weeks. Finally, we break out those aggregations according to which type of consumer we have for each item. So in particular, on a week where we have a change in price for a product, we can characterize the consumers as being very price sensitive, medium price sensitive or not price sensitive for that specific product. And then we can compare how demand changes for each of those three groups. And so this figure here illustrates what we find in the held-out test data. In particular, we find that the consumers that we predict to be the least price sensitive in fact don't seem to respond very much when prices change, while the consumers who are most price sensitive are most elastic, as we say in economics, are the ones whose quantity changes the most when prices change. Once we're confident that we have a good model of consumer preferences, we can then try to do counterfactual exercises such as evaluate what would happen if I introduce coupons and targeted them at individual consumers. We'll take a simple case where we have only two prices we consider, the high price or the typical price, and the low price, which is the discounted price. Now what we do we look into the data and we evaluate what would happen if we sent those targeted coupons out. So for each product we look at the two most common prices that were charged in the data. We then assess which consumers would be most appropriate for coupons. We might look, for example, and say I want to give coupons to third of consumers, I can see which consumers are most price sensitive, most likely to respond to those coupons. I can then actually use held out test data to assess whether my coupon strategy is actually a good one. And that will allow me to validate again whether my model has done a good job in distinguishing the more price sensitive consumers from the less price sensitive consumers. So this figure illustrates that for a particular product there were two prices, the high price and low price that were charged over time. The actual data, the different users might have seen come to the store sometimes on a low price day and sometimes on the high price day indicated by blue or red. What we then do is say what would our models say about who should get the high price and who should get the low price. So we can reassign counterfactually say the top four users to high prices indicated by these orange squares, and we can counterfactually reassign the low, the fourth -- the fifth and sixth user to the low price, indicated by the green rectangles. Now, since the users we assigned to high saw a mix of low and high prices, I can actually compare how much those users purchased on the high priced days and low priced days and I can also look among the people that I would counterfactually assign to low prices and see what's the impact of high prices versus low prices for those consumers. And I can use those estimates to assess what would happen if I reassigned users according to my counterfactual policy. When I do this I can compare what my model prediction happened in the test set to what actually happened in the test set. What I actually find somewhat surprisingly here is that in fact what actually happens in the test set is even more advantageous for the firm than what the model predistricts. In particular, our model predicts if I reallocate the prices to the consumers according to what our model suggests would be optimal from a profit perspective, we can get an 8% increase in revenue. That is instead of varying prices from high to low or from day to day we always kept them high and then we targeted the coupons to the more price sensitive consumers. In the data, if we actually look at what happened in our held-out test data, it looks like that the benefits to high versus low prices and the difference in those benefits between the high and the low consumers are such that it looks like in the test set we actually would have gotten a 10 or 11% increase in profits had the prices been set in that way. To conclude, the approach I've outlined is to try to learn parameters of consumers utility through revealed preference. That is, use the choices that consumers make to learn about their preferences about product characteristics and prices and then predict their responses to alternative situations. It's important to find a dataset that's large enough and has sufficient variation in price to isolate the causal effects of prices and also assess the credibility of the estimation strategy. And it's also important to select counterfactual study where there's actually enough variation in the data to be able to assess and validate whether your estimates are right. And so I illustrated two cases where I was able to use test set data to validate the approach. You use the training data to assess, for example, which consumers are most price sensitive and look at the test data and see if their purchase behavior varies with price in the way that your model predicts. In ongoing work, I'm trying to understand how the different types of data create value for firms. And so in particular if firms are using the kinds of machine learning models that I've been studying and they use those estimates in order to do things like target coupons, we can ask how much do profits go up as they get more data. In particular, how does that answer vary if it's more data about lots more consumers, or if we do things like retain consumer data for a longer period of time. And preliminary results are showing that retaining user data for a longer period of time so you really get to know an individual consumer can be especially valuable in this environment. Overall, I think there's a lot of promise in combining tools from machine learning like matrix factorization but also could be neural nets, with some of the traditional approaches from causal inference. And so here we've put the things together. We used functional forms for demand and the concepts of utility maximization and approaches to counterfactual inference from economics in marketing that use computational techniques from machine learning in order to be able to do this type of analysis at large scale. ELIAS BAREINBOIM: Hi, guys. Good afternoon. I'm glad to be here online today. Thank you for coming. Also thank you for the organizer, I appreciate the organizer, Amit and Amber, for inviting me to speak in the event today. My name is Elias Bareinboim. I'm from the Computer Science Department and the Causal Artificial Intelligence Lab at Columbia University. Check my Twitter. I have discussions about artificial intelligence and machine learning. Also apologies for my voice. I'm a little bit sick. But very happy to be here today. I will be talking about what I have been thinking about the foundations of artificial intelligence, how it relates to causal inference and the notions of explainability and decision-making. I'll start from the outline of the talk. I'll start from the beginning. Defining what is a causal model. I will introduce three basic results that are somewhat intertwined. I usually say that to understand them, we understand like 50% of what causal inference is about. There's a lot more technical results, but the conceptual part, the most important. The first I'll start with structural causal models, which is the most general definition of causal model that we know to date, that's by Pearl himself. Then I'll introduce the second and third order, the second result which is known as the Pearl Causal Hierarchy, the PCH, which was named after him. This is the name after object, mathematical object, used by Pearl himself and Markesian in the book of White. If you haven't read the book, strongly recommend it. It's pretty good since it discusses the foundations of causal inference and how it relates to the future of AI and machine learning. More prominently in the last chapter as well as the intersection of the other sciences. This is work partially based on that chapter that we're working on post oppose hierarchy and the foundations of causal inference, joint work with Juan Correa and Duligur Ibeling Thomas Icard, my students at Columbia, and the last two are collaborators from Stanford University. This is the link here to the chapter. Take a look because most of the things I'm talking there, it's in there some shape or form. Then I'll move to another result that is called the causal hierarchy theorem that which was proven in the chapter about 20 years old, 20-plus years old open result, and used as one of the main building blocks, one of the main causes. And then I'll try to connect advanced machine learning and more specifically supervisory and causal learning, how does it fit or how it fits with the specifics of causal hierarchy, also called ladder of causation in the book. Then I'll move to talk a little bit what causal inference and cross-layer inferences. I would then move to the design of artificial intelligence, artificial intelligence systems with causal capabilities. I will come back with machine learning methods and virtual deep learning MRL. And perspective and my focus here will be more about my goals to introduce the ideas, principles and some tasks. I will not focus on implementation details. Also I should mention that essentially business, the outline of the course this semester course at Columbia, bear with me, I'll try to give you the idea if you're interested to learn more check the reference or send message. Now without further ado, let me introduce here the idea of what is a causal model, structural causal model. And we will use the idea, the idea from the processes, we'll take a process based approach to causality. The idea is borrowed from physics, chemistry sometimes economics and other fields that have a collection of mechanism in some line of some phenomena that we're theorizing. In this case, suppose you're trying to understand the effect of taking some drug on the headache. Those are observable variables and we have the corresponding mechanisms here and the data for available drug and sub H to the variable headache. Each mechanism takes as input, has as argument set of observables in the case of apps of the age and observables, in this case U sub B. There's an observable here, all variables in the universe that generate variations should drop. There's not age can be included in the U sub B. And the same here would be U sub H, drug and age observables transformation U sub H, will use U sub H, all variables in the universe that are not drug and age and someone would have or would not have headache. This is the real process, you usually have possibly a complicated function here F sub G, F sub H, if it's not substantiated. Usually we have some type of course and this is the causal graph related to this collection of mechanisms. The causal graph is nothing, the partial specification of the system in which the arrows here just means that some variable participates in the mechanisms of the other. Just put XYZ to make the communication easier. Now we have, for example, age participates in the mechanism of data sub H and then here's from Z to Y. The same with drug. This is arrow from X to Y and same age here participates in the F sub B. Note here in the graph we don't have the particular instantiation of the function, we're just preserving the arguments that will help them. Now, for sure we can try -- this is the process that is kind of unfolding in time. We can sample from a process like that. This gives rise to our distribution, observational and nonexperimental distribution over the observables PX and Y in this case. Usually when you're doing machine learning supervised learning or unsupervised learning, we're playing about this side here of the equation. Here we are trying to understand causality, it's about when you go to the system and you change something or you overwrite, overwrite as a computer science we like, we overwrite some function. Here we would like to overwrite the equation, the natural way of how people is taking drugs, here is drug is equal to yes. This is called also introduce organic layer given the time but this is related to the due operator in which you have overwrite the original mechanism, F sub D, in this case is do X is equal to yes. Now we no longer have a regional equation, you have a constant here. You can have a lot of constants, constants no on the other side we don't have time on these slides. That's what we have. This is semantics without necessarily having access to the mechanisms themselves. This is the meaning of the operation. Now, here is the graphical, the graphical counterpart of that. Note that there's F sub D here would no longer has the age as argument of this function, there's just the constant, you put the constant here and we cut about this graph we cut the incoming out X. This is the mutilated graph. Again, if we're able to contrive reality in this way, you can sample from this distribution or from this process which gives rise to the distribution called interventional distribution or experimental distribution, P of ZY, given 2X is equal to yes. I use these variables here XZY but X would be any decision, Y can be any outcome, Z any set of covariates or features. Now, what is the challenge here? The challenge that in reality this upper floor here is almost never observed. This is usually called unobserved. This is why I put it in gray. Then this is one of the things we don't have that in practice or very rarely and another challenge usually observe the data that we have it's coming from the left side that is coming from this naturally unfolding or how the system is naturally evolving and we will like to understand what's the effect if you go there and do things and do intervention to the system, with our own wheel or deliberately as a policymaker, decision-maker, this sets this variable to yes. And we have data from the left, from Cheng, you do inference what would happen if you do something in this system. Now, we can try to generalize this idea and define what is the structural causal model. This is chapter on causality book approach a thousand. I won't go through definitions step-by-step, but suffice to say you have type of observables or endogenous variables like age, drug or headache. And exogenous, the unobserved variables, that could U sub D and H that we had before and we'll have a collection of mechanisms for each of these observed variables. Mechanism sub D or sub H, excuse me, this could be seen as some type of new point in physics to summarize the conditions outside the system. Excuse me. Outside the system. Kind of sprinkle mass probability. Have this probability P of U over the exogenous variable. Now, we understand very well how the systems work. There's awesome work by Halpern Galles at Cornell, and Galles and Pearl. Given this type of understanding over these types of systems. And today we're interested in a different result that is the following. Once we have SCM, structural causal model M that's fixed or particular environment or set in with the particular agents, this induces the Pearl Causal Hierarchy, or PCH, that is called the ladder of causation in the Book of White. Let's try to understand what -- here's the PCH. Now, different layers of the hierarchy. This is the first layer that is called the associational layer, the activities of seeing, how it would seeing some variable acts, acts change my belief in the variable Y, what does a symptom care wise about the disease. Syntactically it's written as sub P of Y given X and why do people ask this layer but this is very related to the machine learning, supervised and unsupervised learning. Different types of comments there. Bayes is one type of model there. You have decision trees. You have supercomputer machines, and deep neural networks and different types of neural networks. They live in this layer here. Quite important, we're to scale up inferences given this X, could be the pixels, the set of features could be order of thousands, even millions, and try to predict how wide some labelly, have pixels, whether it's a cap or not. And it's kind of classic and it's very hard, we're kind of mastering that to understand pretty well how to do that, and recent breakthroughs in the field in the last 20 years, I should say. Now I have a qualitatively different layer, layer two, interventional. It's related to the activity of doing what if I do X actions, what if I take the Asprin, will my headache be cured. The counterpart to machine learning would be reinforcement learning. You have causal bayesian networks and decision processes, partially observable, Ps and so on. Quite important I'll tell you more about that. Symbolically, you say P of Y given to X comma C. That's the notation that you have. Now I have a qualitatively different layer that's layer three, which is a counterfactual layer. I'll go back here soon, but it's related to activity in pagination, agents to have imagination, retrospection, and introspection, and responsibility, credit assignment. It is the layer that gave the name for the Book of White. This is the why type of question. What if I had acted differently, was it the aspirin that stopped my headache. Syntactically, we have this common nested counterfactual here. I took the drug that is X prime as instantiation of the big X, pardon for my license here. Expire, I took the drug, and I'm cured. That is why prime. Now, you can ask how I have a view of the headache that is y, the opposite of Y prime, had I not taken the drug; that is the X that is the opposite of X prime. I took the drug, I'm good experiment prime in the actual world. In this world. And I asked what if I hadn't taken the drug that is X? Would I be okay? That is the Y or not okay, that is the Y. Okay. Not Y. And then there's no counterpart exactly in machine learning, if you have some particular instance you can ask me off line, but it's all kind of things written in the literature, this comes from the structural causal model. Now I would like to see what is going beyond machine learning. I just mentioned this layer three here. Specifically I'd like to highlight different family of tasks of inferential attacks which fall very naturally causally called cross layer type of inferences as I'm seeing here. Layer one is related as suppose as input you have some data here and most of the available data today is observational. It's possibly collected, numbers here 99 percent of the data we have is coming from layer one, and the latest numbers can, someone complains to you, but 99, 90 percent of the inferences that were introduced today is about doing or actionnal layer three about counterfactuals. And about policies, treatments and decisions, just to cite a few examples. Then search question that we're trying to answer here across layers that we have the data and the inference that one should do is how to use the data collected from observations, passively, that's layer one. Maybe coming from the hospital, to answer questions about the interventions that this layer two. And under what conditions can we do that? Why is this task different is usually a good question. Why is the causal problem nontrivial? The answer is like SCM. Almost never observed, but for a few exceptions such as in feuds such as feuds, such as physics, sorry, chemistry and biology. Biology sometimes. In which the very target there is to learn about this collection mechanism in general we do not observe. That's Young. Most of the fields we in AI machine learning we're interested that there's the human in the loop. Some type of interactions that we cannot given that we cannot read minds and we don't isolate the environment in some kind of precise way. You don't have a controlled environment, usually you cannot get that help. But still the observation here that if it does exist, this collection of mechanisms that underlying the system that we're trying to understand is through there and inducing the PCH and you could still have the query or the data task, the cross layer tasks, how can you get from data, from data that is from a fragment that we have from the SCM, you can talk about that's layer one, observational, how can you answer the question from layer two. And have observed phenomenon and you're trying to get fragments observed at least relizable. That could be layer three as well. How can you move across these layers? Like a lot I use in the class, spend some time but I like the metaphor here, since there's complicated reality, just observe the fragments or shadow of the fragments of the PCH, do an inference about the outside world under what conditions can give you that. That's kind of the flavor or the consequence of these mechanisms that could be the other layers, layer two or three, for example, I'd like to talk about the possibility results or this cross layer inferences. As usual, let me read the task here. Infer causal quantity Y given to X from layer three from observational that is layer one. That's the task that I just showed. Now, the effect of X and Y is not identifiable. I've seen from the observed data proves there exists a collection of mechanisms or SCMs capable of generating the same observed behavior, layer one P of X, and Y, Y is disagreeing with respect to the causal query. To witness, we show two models. This is model one. This is model two. Such that they generate the sale of course the model here. Is this for you to go home and think a little bit, but simple models here this is Xor, by the way. Not X, Xor. These models generate the same observed, model one, P1, P2. And same observed behavior in layer one; however, they generate different layer two behaviors. Different layer two predictions. In this case it tells me -- in this case layer two says probability of Y given to X1 is equal to half, while the model two is saying this is one. In other words, we have kind of layer one under the deterrence what can we say about layer two? There's not enough information there to move. That's the result. I would like now to make a broader statement generalize this idea. Again, this is great work with Correa, Ibeling and Icard from the paper I mentioned earlier, proved the following result theorem. Respect for that measure over some kind of technical conditions, measure over SCM, the subset that any PCH collapse is measure zero. Let me read the informal version here. You go home and you can try to parse that. But informally, for almost any SCM, in other words, any possible environment in which your agent or your system is embedded, the PCA doesn't collapse. In other words, the layers of the hierarchy remains distinct. In other words, you have this hierarchy here, there's some kind of this will not happen that one layer usually determines the other. There's more knowledge in layer two than in layer one on line. There's more knowledge in layer three than layer one and layer two. Then one layer determines the other, you don't get this type of situation. This caused an open problem. As stated in the book of White as parallel in Chapter 1 that says answer question I abut certain type of interaction to layer two above intervention, one needs knowledge at layer I, two or above. Now, the natural question here that you could be asking is like, Elli, how is after all are causal inferences possible, or how are causal inferences possible. Now commonly now these are enforce. Doesn't mean you shouldn't do causal inference at all even if you have this type of determination from one layer to another? And the answer is not at all. The idea here, this motivates the following observation. If you know a little bit about the -- if you know zero about SCM, this is the CHP the causal hierarchy pyramid you get. If you know anything about SCM it may be possible. What is this little bit, it's what you call constructial constraints, which you could have encoded in a graphical model. Different models here you can have graphical model layer one, layer two and so on. And then in principle it could be possible to move across layers, depending on how you encode constraints here. Families are graphical models. I'd like to examine for just one minute the graphical model layer one here that is very popular. Such as a bayesian network that's layer one versus a causal based, start of a base net. Not all graphical models are created equal. This is the same task from the previous theorem. It was shown that it's impossible to move from layer one data to layer two type of statement. Now what if you have a base net? It's compatible with the data? Now, this is all base net, this is compatible with the data. X pointing to I, whatever data we get over XY. And we would like to know what's the P of I layer two quantity Y to X in this case. If you play a little bit or if you know a little bit of causality, there's no unobserved confounder here in this graph, then P of Y given to X is equal to P of Y given to X. By ignorability or back door admissibility, those are names we use to say this unobserved confounder. Now I pick another BM, another layer 1 object, fit there, not only from XY, but from Y to X, see what would be the causal effect of X and Y in this case the Y given 2x is still compatible. Turns out for the semantics of causal intervention, the 2, you'll be cutting the arrow here that is coming from the AX because we're the one controlling this system, which PY given of X equals to be P of Y. Then this here highlights that they have different answers, recruitment, that's not enough information about the underlying SCM in the BM. So as to allow causal inference. To say this is not good, the constraints could be coming from the SCM as to why a layer one object is not good. This is not the end we're looking. Now I would like to consider a second object that is a layer two kind of graphical model. You go to the paper that you define more prominently and I'll do that here, possible to encoder layer two constraints coming from SCM. The idea of asymmetry of causal relations and we'd like to focus on this one now. Now the idea is that there are positive instances we can do cross layer inferences. Let's consider a graphical model, other true graphical models. Remember the mental picture I'd like you to construct is the following. Suppose that this is all the space of all structural models. Here are the models compatible to the graph G. It's a true graphical model. These are the models SCM compatible with PZ, could generate this observed distribution. And here are the models that linked a section of these guys who have the models that are giving the same Y given to X. What I'm saying in reality is that there are situations that for any structural model in quoting this unobserved nature, let's call nature N1 and 2, such that they have the same graph of G. G of N1 is is equal to layer two. If they generate the same PO of PV, the same observed distribution, then they will generate the same causal distribution. That's the notion of identifiability. It is possible to get in some settings. Now let me try to summarize what I've said so far. About some sort of patience between the reality and the reality that is destroying the line mechanism that we don't have and our model of reality that will be graphical model, for example, could be other or N the data. We started from the other defined world, semantically speaking, in which an SCM a pair F and P of U mechanisms and distribution over the exogenous implying the PCH. Which means different aspects of the island nature and types of behavior. Layer one, two, three. We do acknowledge that the collection of mechanisms are there but inference are limiting given that SCM is almost never observable or observed due to the CHP, we have this constraint about how to move across the layers. Now we'll move towards scenarios in each parcel knowledge of the SCM is available that is such a causal graph, layer two causal graph. Causal inference theory helps us determine whether the causal target, the targeted inference is allowed. In the prior example the inference is from layer one to layer two. Namely trying to understand if the graph is P of V that is layer one distribution allows us to answer P of Y given to X. Observation here, sometimes this is not possible. I mean, for weak models, if you have a weak model, mental picture here is like sometimes the true models generating this green guy here, this distribution. There's another model that had the same graph G. It can induce the same observation distribution and generate a model that's called P star Y given to X. And they're in a situation that we cannot do the inference about layer three just without one data. Now, I'd like to spend two minutes just doing a summary of the how does reinforcement learning fit into this picture. I stand three hours last week in ICML talking about that, go to the crl.causal.net, if you want details I'll give you two minutes what happened there. This is the PCH. Now my comment is typical URL is usually confined to layer two or subset of layer two, and usually you cannot move from layer one, cannot leverage the data that is from layer one or very rarely. And this URL doesn't support us make statement about the counterfactuals, the layer two type of counterfactuals. That's the global picture. This is the kind of canonical picture of RL. Can have an agent that's embedded in the environment. The agent is a collection of parameters. The agent observed some kind of state and commits to an action and observes reward. There's a lot of discussion about the model, base of model free. I'd like to say that all model base they mentioned today in the literature is not causal model based, it's causal. Important not to get confused. You can ask me more later. The only difference here causal reinforcement learning perspective, that what, that we'll leverage. And I spent three hours, almost three hours discussing that in the tutorial. That now officially the collection of mechanisms that we just studied, the structural causal model, would be the model of the environment, officially, and the agent side that you have graph G. Now, the two cube observations, the environment and the agent would be tied to the payer SCM in the environment side, environmental side and causal graph on the agent side will define different types of actions or interactions following the PCH, which means that observing, experimenting in and imagining would be these different modes. Please check the CRL.causal.ai for more details there. And this one, we can check later, talk about different types of tasks that we weren't acknowledging before. I'd like to move quickly, spend 30 seconds discussing how does deep learning fit into this picture. Here's the same picture I have before from the left side observational in the pipes, about ten slides ago, and right side interventional world. Now in reality this is about reality and model. This is abstraction in reality or have a data. And you can sample from the data. And this allows us to get the hat distribution, the P hat, and we have results saying the results of the hat distribution and the original distribution keeps decreasing, which makes sense to operate in terms of the hat distribution. Now, for sure can use some kind of formalism to try to learn the hat distribution, including a deep network variation of that. Now, challenge usually interesting in this inference in the right side and you have zero data points in the right side. I'm talking broadly, not reinforcement learning. The reinforcement learnings have all the problems. But you have zero here. Now how on earth can you learn about hat of distribution. Some people, it's connecting the input of the DNN that you learned from the left side to the right side. Which put a guy like that, there's nothing in the data, in this data, nor in the deep net that takes into account the structural constraints that we discussed and nor the CHT. It makes no sense to connect. There's something missing there. I could talk one hour, you invite me to talk about neural nets and causal inference, but this is the picture I want to start the conversation. I would like to conclude and apologies for the short time. It's like very short talk, and thanks for the opportunity. Now, let me conclude. Causal inference and AI are from mentally intertwined, novel learning opportunities emerge when this connection is fully understood. Most of the patterns for general AI today are orthogonal to the current eight causal maps available. And we're not even touching the problems, the patterns for general AI, including deep learning, the huge discussions we're having in reinforcement learning. In practice, failure to acknowledge the distant features of causality almost always leads to poor decision-making and superficial type of explanations. The board here, the agenda we're pursuing almost 10 years now, we're developing a framework for principle algorithms and tools for designing causally sensible AI systems integrating the three PCH observational, interventional and counterfactual data. Modes of reasoning and knowledge. And my belief, strong belief is that this will lead to natural treatment of human like explainability given that we're causal machines and rational decision-making. I would like to thank you for listening and also this is, my collaborators, this is joint work with the causal AI lab at Columbia and collaborators, thanks Juan, Sanghac Kai-Zhan, Judea, Andrew, Dulgar and Thomas, and all the others, it's a huge effort. Thanks. I'll be glad to take questions. CHENG ZHANG: Hello, everyone. I'm Cheng Zhang from Microsoft Research UK. Today I'm going to talk about causal view on robustness of neural networks. Deep learning has been very successful in many applications. However, it's also vulnerable. So let's take our favorite stochastic classification, for example. Deep learning can achieve it with 99 percent accuracy. This is impressive. However, if we just shift image a little bit, not much, the safe range won't be more than 10 percent. So the accuracy will drop to around 85 percent, which is already not satisfying for application. If we enlarge the safety rings to 20 percent, the accuracy will drop to half, which is not acceptable anymore. The plot shows that the more we shift, the more -- the less the performance. This is not desired. Especially with minor shift. Okay. Now we would like to be robust. Let's vertical shift image in the training set as well. This is a type of training setting. Major dash shift up to 50 percent image in the training set. So that you can see that the performance is much better with about 95 percent accuracy even when we shift up to 50 percent. But do we solve the problem now? What if I didn't know that it will be vertical shifting the test, and start, it would have been horizontal shift. Then, for example, shift in the training data then during the testing time, we test image with vertical shift as before. The online shows the performance with vertical shift. It is actually even worse than training with clean data only. So adverse training does not solve the robust problem with deep learning because it could even harm the robustness to unsim -- manipulated images. And we'll never know our possible attacks. This is the real issue in deep learning. This is a simple task which has seven digits. How about healthcare or policymaking, the decision qualities is critical. But humans are very good at this task. We can recognize that if it has shifted a little bit or if the background changes, because we're very good at causal reasoning. We know that shift or background changes does not change the digit number or make a path to adopt. This is a model property called reasoning, and it is also referred as independent mechanism sometime. So causal the relationship from the previous example can be summarized in this way. The final observation is an effect of three types of causes. One is the digit number, and the other one is the writing style, et cetera, and the left one is the different manipulations such as shift or rotation. The same applies to the other example observation of cat is that a real cat? And it's for color, et cetera, features, and different environments, such as different view and also background. We use Y here to denote the target of the task and D denotes the factors that cannot be manipulated. And M denotes the factors that can be manipulated manually. So add here is the factor we'd like to be robust for you before diving down into the robustness details. Let's review what is a valid attack. We have seen the shapes amidst the digit, or background change with cat. And another common thing is to add a bit noise as we see here. We stressed a very small amount of noise. We can come through a deep learning model claisified as independent as a given. We can rotate the image and even add speakers sometime. It's also been pointed the noise can fool humans. The left image looks more like a dog than cat to me. The question is this still a valid attack, what type of change and how much change can we consider that since you form a valid attack. We'd like to define valid attack from causal lens. Let's take previous example from a causal view. We can see the valid attack of generated from intervention on M, together with original Y and Z it produces manipulated data X. In general, valid attacks should not change the underlying Y because this is a target. This we can now intervene the target of Y or parts of Y if there is an appearance of Y. While Z are not equal to intervene by our definition such as the genetic feature of a cat or writing style of the image itself. In this regard, recent adversary attacks can be considered as specific types of intervention on M such as the adding noise on manipulating the image. In this way a learned predictor is saved. So the goal of robustness of deep learning is to be robust to both the known manipulation and the unknown manipulation. Adversary training can help with the known manipulation but it the unknown manipulation. Our question is how to perform a prediction that can be adaptive to the potential of known manipulations as the shifted digit example. In this work we propose a model naming deep causal manipulation argumented model. We call it Deep CAMA. The idea is to create the deep learning model that is consistent with underlying causal process. In this work, we'll assume that the causal relationship among the variables of interest are provided. The Deep CAMA is one of the generative model. Let's quickly recall the deep generative model variation auto-encoder. The variation auto-encoder bridges deep learning and probablistic modeling and has been successful in many applications. The graphic model is shown on the left. From a probabilistic modeling point of view, we can drive down the model as we can theorize showing in the right-hand side of this equation. We learned the posterior using versional inference. In particular, we can introduce a variational distribution queue and we can try to minimize the divergence between the P and the Q. We can follow the standard staff and form the evidence lower bound. We call it ELBO. And optimize the evidence lower bound to get the posterior estimation. Different from traditional probabilistic modeling every link in the graphic model on the left are all deep neural networks. This becomes an auto-encoder where we can try to reconstruct the X with the stochastic probablity. This can be learned as a standard deep learning framework with the loss using evidence lower bound we just showed before. So CAMA is also a deep generative model. Instead of simple factorization like on the left. Our model is factorized in a causal consistent way. You can see it on the right-hand side. The model is consistent with the causal relationship that we saw before. Next, let's see how can we use the inference? When we only have the clean dataset, which means like the dataset without any augmentation or an adversary example. So from a causal lens, this is the same as do M equals clean. Now we translate it to the deep CAMA model. We can use value zero indicating the clean data. This we set M to be 0, and we can consider it to be observed. We only need to infer the latent variable Z in this case for variational distribution instead of conditioning only on X as traditional inversion of the encoder. In CAMA we can consider XYZ together to define the variational distribution. We follow the same procedure and form the evidence lower bound, ELBO shown below. As L is a root node, we have do M and the DO calculation can be written as conditioning. And the user adverse training way, we may have manipulated data in the training set as well. In this way we may not know the manipulation with the straight M as latent variable. In this case we need to infer both M and Z. This we have the Q5ZM condition on X and Y bookend. We can provide evidence lower bound in this form. Finally, with both clean and manipulative data in the training set, the final loss is in a combined form with corresponding loss with clean data and unmanipulated data as shown before. Here there's the clean subset of the data and D prime is the subset of the data which are manipulated. This is adversary training setting using CAMA. In this way CAMA, can be used either with only clean data and with manipulative data together. The final neural network for architecture I'll show here, so the encoder and decoder are shown on the right. And the decoder network correspond to the solid arrow in the graphic method on left side, and then the encoder network corresponds to the dashed line. The inference network can help us to compute the posterior distribution, M and Z. In the test of time, we'd like to model to be robust to unseen manipulation. We want to learn it in test of time. While with the network we're presenting the generative process from Y to Z and X. And we fine tune the network to adopt to the new M and how M influences X. So in this way the network can learn a new unseen manipulation. Test of time the label is not known so we don't know the Y. In this way we need to marginalize Y and optimize the fine tune last year to adopt to the unseen manipulation. For prediction we use base of the posterior of the Y, we see it parting definitely. We can see CAMA is designed in the causal system way where we can officially train the model following similar procedure as variation of the encoder. We can also fine tune the model in test of time to unseen manipulation and make predictions. Next see the performance of Karma. And first let's use only the clean data as we see in the process Y of this talk. We bring the blue curve from the first slide, which is the regular deep neural network. Our method without fine tuning is shown in orange. With fine tuning with corresponding manipulation, in test of time, which shows the group green curve in figure A, And we can see significant improvement in the performance in the figure A. We can also see with fine tuning our different manipulation in the middle panel, the performance does not drop unlike the traditional neural network. This is thanks to that we fixed the mechanism from Y and Z to X. Fine tuning a long time manipulation does not affect the robustness to other type of manipulation which is desired. In the middle panel the fine tuning was done on the horizontal shift and the testing was on vertical. Furthermore, we use different percentage for tests for fine tuning. We see the more data for fine tuning, the more robust we have for the performance of the unseen manipulation. More importantly, we can see that with only more than 10 percent of the data, we already obtained very good performance, which means that the fine tuning procedure is validate utilization. We also tested our method with popular gradient based adversary attacks. In particular, the fast gradient side method as shown on the left and projected gradient descent attack on the right. The blue one is traditional deep learning method which is very vulnerable. The orange one is a common model without fine tuning. And the green one is the one with fine tuning. We can see that CAMA with fine tuning is so much more robust to even gradient based attacks. The red line shows the clean test performance after fine tuning, which means fine tuning does not deteriorate the clean data performance either. So the improvement in the robustness of common model compared to traditional model is significant with gradient based attacks. With adversary training setting we can obtain the same results as with clean data we see before. See our paper for more results. I will not repeat here. But moreover, I would like to say that our method obtained natural disentanglement due to our model Z and M separately. And we can apply do operation to create counterfactual examples. Figure A shows some examples that are vertically shifted in the training data. After fitting the data, we can apply do operation and set it the do M to zero and generate new data. Which is shown on the right hand side. You can we can shift the image back to the centered location. Now we have shown that CAMA works well in the image classification case and how does it work for general case. For example with many vulnerables and more causal relationships is as the one shown in the picture. For example, with many -- with the ring, there can be multiple causes. And whether the jolt is worth it or not can be caused by multiple factors, for example, was it ringing enough, was CAMA broken or not and can we use CAMA in this case, the answer is yes. We can have generalized deep CAMA in this setting. We can see there's a micro blanket and environmental interest and we compute a deep neural network model that's consistent with causal relationship. With target Y, we put all the variables in the corresponding location which either has ancestor A children X and co-parent C that's consistent with causal relationship. We introduce Z in the same way where Z represents hidden factors which cannot be intervened and M is hidden manipulations. We also extend the inference and fine tuning methods in the same way for this generalized Deep CAMA model. We use set of data which have completed a causal relationship with the experiment. We shift the children dataset for testing. Again, the blue line is the baseline and the orange line is the one without fine tuning and the green one is the one with fine tuning. We can see that such generalized Deep CAMA is significantly more robust and the red line shows the type of data after manipulation, and we can see that clean data performance remains high even after the model adapts to unseen manipulation. The same holds for gradient-based adversary attacks. The attack can be on both children and co-parents, and vulnerable attacks as target Y remains the same, comparing to green line and orange line to the baseline which is in blue, our method is significantly more robust to gradient-based attacks. Last you may ask, what if we don't have causal relationship? As to now we always assumed that the causal relationship is given already. In general, there are many methods for causal discovery from observational data and the informational data. So given the dataset, you can use different tools to find the causal relationship. A good review paper is provided by Clark, et cetera, last year which summarized different types of causal discovery method. Myself also did some research on this topic. However, just to be honest, the causal discovery is a challenging problem. And it may not be perfect all the time. What if the causal relationship that we use was not completely correct? There may be small errors. So here we performed experiments to show it with specific data with many variables. The blue line is a baseline and the orange line is the case where the causal relationship is perfect. So here different colored lines shows different degree of misspecification in the causal relationship. In this experiment we have ten children variable in total and we make them have different degree of misspecified causal relationship. The green line shows that two variables are mis-specified in the causal relationship. And the red line is four variables are mis-specified. We see that with the mis-specified causal relationship, the performance drops comparing to the ideal scenario. However, if it's mis-specified by a small fraction, we can still obtain more robust results compared to baseline. It's helpful to consider causal consistent design even though we may not have the perfect causal relationship given. In the end, I would like to summarize my talk. I presented causal view on model robustness and causal inspired deep generative model called Deep CAMA. Our model is manipulation aware and robustness to unseen manipulation. This is efficient with or without manipulated data during the training. Please contact me if you have any questions. Thank you very much. AMIT SHARMA: We're back live for the panel session. One of the questions that was asked a lot during the chat was the question about model mis-specification and model misspecification can happen in two ways. One is that while we're thinking about the causal assumptions, we may miss something. So there could be, for example, an unobserved confounder. And the other way could be when we build our statistical model, we might parameterize it to simply or too complex and so on. So maybe this is a question for both Susan and Elias, is how do you reconcile with that? Are there tools that we can use to detect which kind of error is happening, or can we somehow give some kind of confidence intervals of guarantees on when we are worried that such errors may occur? So maybe, Susan, you can go first. SUSAN ATHEY: Sure. That's a great question. And it's definetly something I worry about in a lot of different aspects of my work. I think one approach is to exploit additional variation. So I guess we should start from the fact that in general in many of these places these models are just identified So there's a theorem that says that you can't detect the presence of the confounder without additional information. But sometimes we do have additional information. So if you have like multiple experiments, for example, that you can exploit that additional information And so in one of my papers we do an exercise where we try to assess, we can look at certain types of violations of our assumptions and see if we can accept or reject their presence. So, for example, one thing that we worried about was there might be an upward trend over time in demand for product that might coincide with an upward trend in prices. So we were already using things like weak effects and throwing out products that had a lot of seasonality. But still our functional form might not capture everything. And so we did these exercises called placebo tests where you put in fake price series that are shifted up or shifted back and then try to assess whether we actually find a treatment effect for that fake price series, and then we had 100 different categories so we could test across those hundred categories, and we found basically a uniform distribution of test statistics for the effect of a fake price series, which sort of helped us convince ourselves that at least like these kind of overall time trends were not a problem. But that was designed to look at a very specific type of mis-specification. And in another setting, there might not be an exact analog of that. Another thing that I emphasize in my talk was trying to validate the model using test data which again was only possible because we had lots of price changes in our data. And so those types of validation exercises can also kind of let you know when you're on the right track because if you have mis-estimated price sensitivities then your predictions about differences in behavior between high and low price sensitive people and the test set won't be right. But broadly, this issue of identification, the fundamental assumptions for identification and testing them is challenging. One of the most common mistakes I see from people from the machine learning community is sort of thinking that, oh, well, I can just test it this way or test it that way without realizing actually in many cases there's a theorem that says even infinite data would not allow you to distinguish things. So you have to start with the humbleness that there are theorems that say that you can't answer some of these questions directly. You need assumptions. But sometimes you can be clever and at least provide some data that supports your assumptions. So maybe I can come back to the functional forms and let Elias take a crack at the first question because there's a completely separate answer for functional forms Go ahead. You're muted. ELIAS BAREINBOIM: There we go. Can you hear me? AMIT SHARMA: Yes. ELIAS BAREINBOIM: Cool. Thanks, Susan. Thanks, Amit. The model uses specification machine learning. First comment I would say that is very common is people trying to use this idea of the training and testing sets so paradigm, as I like to call to use, whatever, try to validate a causal model, try to validate a causal query or to verify. It makes no sense in causality, as I just talk -- I just kind of summarize in my talk, usually one type of data that we have in the kind of training and testing data is layer one that is observational data and we're trying to make a statement about another distribution that is the experimental one. Then there's no training or testing, the world that one distribution can tell you about the other, at least not naively, or not in general. And this is the first comment. The second one, I think the interesting scenario, as I mentioned in the chat earlier, is about when you are in the reinforced learning, before reinforcement learning, observational setting for certain trying to get task classification of your causal model, condition independence and quality constraints and other types of constraints to try to validate the model. I think this would be the principal approach, called task classifications. And a lot of the people doing causal inferences trying to understand what are these kind of constraints we usually have. And then you can submit that to some type of statistical test. Now moving to the reinforcement learning, that's more active set, quite interesting. In this setting already taking the decision. We already kind of randomize and controlling the environment, then the very goal of doing that by Fisher, perhaps 100 years ago, was to avoid the unobserved confounding that was what originated the question. Then reinforcement learning is good for that. And then if you have something wrong, many times I can be super critical about that, but many times the effects of having wrong will wash away. I think that's another nice idea in the reinforcement learning setting that we're kind of pursuing, but I think is quite nice, other people should think about, how can you use the combination of these different datasets not only to decision-making itself but to try to validate the model, which parts of the model the model are on. And there's kind of different types of tasks that are usually very unconventional about how to triangulate the observational and different types of experimental distributions in order to detect the parts of the model that have problems. My last note here, my last idea is just do sensitivity analysis. Usually don't have so many matters that have good ones or initial ones, but you don't have so many in particular tailored to the causal inference problem. I think that's a very good area have some initial work, but I think is very promising, and we'll talk about future of frontiers. But for now I think some more people should do sensitivity. I pass the ball here to Amit. AMIT SHARMA: Sure, yeah, right. I think it's a fundamental distinction between identification and estimation, right. And I think maybe, Susan, maybe you can talk about the statistical misspecification. SUSAN ATHEY: So the functional forms. Right. So in econometrics, we often look at nonparametric identification and look at things like semi-parametric estimation. So you might think, for example, in this choice problems I was talking about, we had behavioral assumptions that consumers were maximizing utility. We had identification assumptions which basically say that whether the consumer arrived at the store just before, just after the price change was as good as random. And so the price was -- within a period of two days -- was randomly assigned to the consumer. That's kind of the identification assumption. And then there's a functional form assumption which is type one extreme value which allows you to use the multinomial logit formulation. That functional form assessment is incredibly convenient because it tells you if one product goes out of stock I can predict how you are going to redistribute purchases across substitute products. It's going to allow you to make these counterfactual predictions and it's very efficient. If I change one, if I have one price sensitivity, I can learn that on one product and apply it to other products as well. Those types of things are incredibly efficient, and they've been shown to be incredibly useful for studying consumer choice behavior over many decades. But there's still functional form assumptions. So there are also theorems that say actually in principle you can identify choice behavior even if you don't assume the type one extreme value, don't assume this logit formulation. But then you need a lot of variation in prices in order to trace out what the distribution of your errors really are and to fully uncover the joint distribution of all of the shocks to your preferences, you would need lots of price variation and lots of products over a long period of time. So theoretically, you can learn everything without the functional form assumptions; but in practice, it's not practical. And so you're always going to be relying on some functional form assumptions in practice. Even though theoretically you can identify everything nonperimetrically with enough price variation. So then it comes to sensitivity analysis. You want to check whether your results are sensitive to these, to the various assumptions you've made, and that becomes more of a standard exercise. But I think it's really helpful to frame the exercise by first saying, is it even possible to answer these questions; and what would you need? And many problems are impossible. And just as Elias was saying, if you have a confounder in your training set, you're also going to have one in your test set, and just splitting trust and train doesn't solve anything. So you have to have a theoretical reason why you think that you're going to be able to answer your question. AMIT SHARMA: Makes sense. I have a similar question for Cheng as well, in the sense it will be great if we have a training method that is robust to all adversarial attacks, but obviously that'll be dificult. There's some assumptions you're making in the structure of your causal model itself, in the Deep CAMA method. So my question to you is how sensitive it is and what kinds of attacks can be... What will your model be robust to? But I'll also throw a more ambitious question. Is it possible to formally define the class of attacks on which a causal model may be robust to? CHENG ZHANG: So I think like as a key here is like how can we formulate attack in a causal way. So I think like for some attacks it's very easy to formulating a causal way, for example, shifting, is manipulation, just another cause for the impact you're observing. But for some attacks, it's even more tricky to formulate it in a causal way, for example, gradient-based attack. It's like causal setting and especially with this multiple staff gradient-based attack. So I think then it goes to like overtime with cycle, causal model as an underlying model. So I think if you can formulate it properly as a causal model and design a model that is consistent, and then we can be robust to the attack, but not all cases are so easy or like there can be technical challenges when there's cycles over time for certain type of attacks. So I think in general it's just always good to consider more causality but how difficult and how much assumption you have to make and to which degree you wish to violate the assumptions I think that depends on the situation you're in. AMIT SHARMA: Yeah, that makes sense. And maybe I think one question I want to ask and maybe this will be the last question live. So we talked about really interesting applications of causality. So, Susan, you talked about sort of the classic problem of price sensitivity in economics. Elias, you briefly talked about reinforcement learning and Cheng about adverse attacks. These are interesting ideas that we have seen. I wanted to ask you to look in the future a bit. Maybe a few years. What are the areas or applications where you're more excited about or you think that this amalgamation of causality and machine learning is poised to help and may have the biggest impact. Susan, you want to go. SUSAN ATHEY: That's a good question. So one thing that I'm working on a lot in my lab at Stanford is just personalization of digitally provided services education and training, which of course all the partners I'm working with have had huge uptake in the COVID-19 crisis. So of course you can start to attack personalization in digital services without thinking about causality; you can build sort of classic recommendation systems without really using a causal framework. But as you start to get deeper into this, you realize that you actually can do a fair bit better in some cases by using a causal framework. And so, first of all, it's using reinforcement learning, for example, is I would argue that reinforcement learning is just intrinsically causal. You're running experiments, basically. But if you're trying to do reinforcement learning in a small data setting, you do want to use ideas from causal inference and also be very careful about how you're interpreting your data and how you're extrapolating. I think that at this sort of intersection of causal inference and reinforcement learning and smaller data environments where the statistics are more important, worrying about biases that come up when naive reinforcement learning you're creating selection biases and confounding in your own data. And if the statistician in the reinforcement model isn't actually factoring everything in, you can make mistakes. And more broadly we're seeing a lot of the companies that I'm working with, Ed Tech and training tech, are running a lot of randomized experiments. And so we're combining historical observational data with their experiments. And so you can learn some parts of the model using the historical observational data and use that to make the experimentation as well as the analysis of the experimentation more efficient. And so I think this whole intersection of combining observational experimental data when you're short on statistical power is another super interesting area that a lot of companies will be thinking about as they try to improve their digital services. AMIT SHARMA: Elias, what do you think? ELIAS BAREINBOIM: Amit, thanks for the question, by the way. I was trying to answer you. Thanks, Amit. I think that in terms of applications, my general goal of, the goal in the lab is to build more general types of AI, I would say. That is more human, as people say human-friendly, using this name or you have some type of rational decision, you can attach to this label rational decision making. I would like to review these notions on how we're doing that for the last maybe five years or so, review what this could mean. Because since if up go to books, AI books from 20, 30 years ago, all of them are using the same label and they are usually not causal. Then I would say I personally don't see any way of doing general AI or more general types of AI, I should say, without being serious or attacking causal inference front and center. I can count on my hands the effort today how many people are doing it and I cannot count the number of people that's excited, which is pretty good. I'm excited about the lab excitement at the moment. Then the primary suggestions I just don't go around it, just trying to understand what is a causal model, what causality is about, and then just do it. A little bit of a learning curve. But I think this is the critical path if you want to do AI or more general types of AI. The two other applications that we've been working and go to the website causalAI.net. Causal reinforcement learning as you mentioned, we were chatting before in the internal chat here, I just gave three hours tutorial at ICML that is trying to explain my vision of how I see this intersection of causality and reinforcement learning and check it out see how causal enforcement learning and all the notions of explainability and fairness and ethics. There's many papers and works that are not technical... says causality is hard or is difficult to get a causal model and so on. It's inevitable in some way. Then there's no point in postponing. If you go to the court, or talk to human beings, usually causality is required in the law or in legal circles. And as humans, we're causal machines. Then there's no way to go around. I'd like to see more people work on it, including Microsoft for sure. Microsoft was the leader by the way in the bayes net, in the early 90s, revolution that takes the 90s and into early 2000s, I think, which push a lot the limit and of them today including variation out in encoders and so on, still I'd like to see much bolder steps from Microsoft. ...Eric Horvitz and David Heckerman, those are the two leaders. They understood very well, they are the developers of the theory of graphical mods of bayes net in the late '80s and they pushed that in such a good way. Now I'm not baptized from the bayes net, that's completely different than the causal graphical model. This is the expectation for Microsoft for myself, and I think huge potential... well thats the idea. AMIT SHARMA: Thank you, Elias. Cheng, what sort of domains or applications are you most excited about CHENG ZHANG: I would like to second Susan and Elias. I think these are all interesting directions. I see, like, great importance in considering causality in all occurrence of machine learning, all directions, because for deep learning reinforcement learning, fairness, I like I really your work, Amit, as well on privacy robustness, generalization, a lot of current problems in machine learning is like if we actually consider coming -- I really see the last magic ingredient to solve a lot of this drawbacks in the current machine learning model. But I'd like to bring another angle, is if you think about causality as a direction of machine learning, I think in past years -- in recent years a lot of gap has been bridged but I think in early days is what I see it's a little bit more separated. I would like to second that like from a causal chair, I think a lot of more modern machine learning techniques can also improve the causal discovery itself because traditionally we hear about all these theorems, proof, identifiability and all these things, and commonly we limit ourselves to a simpler function of work. I think in recent years there's been more advances like a format and other things, but I also do see a lot of advances, machine learning techniques that help with causal discovery, for example, with a lot of nonlinear ICA work from Apple recently, it actually bridges nonlinear ICA and IAE and with the self supervised learning time series can also help with causal discovery from observational data. And I also see this is a great trend. For example, even from Burnhouse, recent work, how you use, like, active learning, element based active learning for causal discovery. So I actually see not only causality, too, as machine learning, but also see a great potential for other machine learning methods to causality. AMIT SHARMA: Great. On that note, that's a wrap. Thank you again all the speakers for taking your time and attending this session. And of course thank you all for the audience for coming to the Frontiers in ML event. We'll start again tomorrow at 9:00 a.m. Pacific. And we'll have a session on machine learning, reliability and robustness. Thank you, all.
B1 中級 米 Frontiers in Machine Learning: Big Ideas in Causality and Machine Learning 16 1 陳韋達 に公開 2021 年 09 月 10 日 シェア シェア 保存 報告 動画の中の単語