**October 19–21, 2020**

**9 am to 2 pm PT (Noon to 5 pm ET) Daily**

**Watch Workshop Videos: **C3.ai DTI YouTube Channel (From our homepage, scroll down to Workshops)

We are on the verge of a deep learning revolution that is leading to many disruptive technologies: from automatic speech recognition systems such as Apple Siri, to automated supermarkets such as Amazon Go, to autonomous vehicles such as Google Car. As we increasingly employ deep learning in our daily lives to support important decisions, it becomes critical to understand the predictions made by deep neural networks (DNNs). The purpose of this workshop is to share progress and foster collaboration on the analytical foundations of deep learning. Our goal is to help explain phenomena observed in practice from rigorous mathematical and statistical perspectives, and lead to new principles that help practitioners improve the design of algorithms and architectures, ultimately leading to new deep learning systems that are “correct by construction” and offer performance guarantees in terms of robustness or fairness etc.

This workshop is to have three well connected components: starting with two tutorials in one day, followed by two days of invited presentations, and a brainstorming session for the last day.

*Tutorials:*We plan to start the workshop with two tutorials on the first day. The first tutorial aims to provide a mathematical justification for properties of conventional deep networks, such as global optimality, invariance, and stability of the learned representations. The second tutorial will cover more recent developments on graph neural networks that are applicable to broader family of data structures.*Invited Presentations:*We will have two days of invited presentations and discussions by experts in the field. Among all the diverse set of topics related to deep learning, this workshop will focus more on “Principled design and interpretability” for the first day and “Guaranteed robustness and fairness” for the second day.*Discussion and Brainstorming:*The last day of the workshop will be devoted to discussion and brainstorming on exciting open problems related to the analytical foundations for deep learning. We will encourage each participant to prepare and bring a list of problems of their own and discuss them at the workshop. One outcome of the workshop is a report on a list of fundamental and challenging open problems for future research.

**ORGANIZERS**

Yi Ma (University of California, Berkeley) and René Vidal (Johns Hopkins University)

**SPEAKERS**

Peter Bartlett (University of California, Berkeley), Tom Goldstein (University of Maryland), Gitta Kutyniok (Ludwig-Maximilians Universität München), Yi Ma (University of California, Berkeley), Alejandro Ribeiro (University of Pennsylvania), Guillermo Sapiro (Duke University), René Vidal (Johns Hopkins University), Soledad Villar (Johns Hopkins University), Max Welling (University of Amsterdam), Bin Yu (University of California, Berkeley)

**PROGRAM***(All times are Pacific Time)*

### Day 1: Tutorials

#### ABSTRACT

The past few years have seen a dramatic increase in the performance of recognition systems thanks to the introduction of deep networks for representation learning. However, the mathematical reasons for this success remain elusive. For example, a key issue is that the neural network training problem is non-convex, hence optimization algorithms may not return a global minima. In addition, the regularization properties of algorithms such as dropout remain poorly understood. The first part of this tutorial will overview recent work on the theory of deep learning that aims to understand how to design the network architecture, how to regularize the network weights, and how to guarantee global optimality. The second part of this tutorial will present sufficient conditions to guarantee that local minima are globally optimal and that a local descent strategy can reach a global minima from any initialization. Such conditions apply to problems in matrix factorization, tensor factorization and deep learning. The third part of this tutorial will present an analysis of the optimization and regularization properties of dropout for matrix factorization in the case of matrix factorization.

#### ABSTRACT

We will develop the concept of Graph Neural Networks (GNNs), which intend to extend the success of CNNs to the processing of high dimensional signals in non-Euclidean domains. They do so by leveraging possibly irregular signal structures described by graphs. The following topics will be covered: (1) Graph Convolutions and GNN Architectures. The key concept enabling the definition of GNNs is the graph convolutional filter. GNN architectures compose graph filters with pointwise nonlinearities. (2) Fundamental Properties of GNNs. Graph filters and GNNs are suitable architectures to process signals on graphs because of their permutation equivariance. GNNs tend to work better than graph filters because they are Lipschitz stable to deformations of the graph that describes their structure. This is a property that regular graph filters can’t have. (3) Distributed Control of Multiagent Systems. An exciting application domain for GNNs is the distributed control of large scale multiagent systems. Applications to the control of robot swarms and wireless communication networks will be covered.

#### ABSTRACT

The past few years have seen a dramatic increase in the performance of recognition systems thanks to the introduction of deep networks for representation learning. However, the mathematical reasons for this success remain elusive. For example, a key issue is that the neural network training problem is non-convex, hence optimization algorithms may not return a global minima. In addition, the regularization properties of algorithms such as dropout remain poorly understood. The first part of this tutorial will overview recent work on the theory of deep learning that aims to understand how to design the network architecture, how to regularize the network weights, and how to guarantee global optimality. The second part of this tutorial will present sufficient conditions to guarantee that local minima are globally optimal and that a local descent strategy can reach a global minima from any initialization. Such conditions apply to problems in matrix factorization, tensor factorization and deep learning. The third part of this tutorial will present an analysis of the optimization and regularization properties of dropout for matrix factorization in the case of matrix factorization.

#### ABSTRACT

We will develop the concept of Graph Neural Networks (GNNs), which intend to extend the success of CNNs to the processing of high dimensional signals in non-Euclidean domains. They do so by leveraging possibly irregular signal structures described by graphs. The following topics will be covered: (1) Graph Convolutions and GNN Architectures. The key concept enabling the definition of GNNs is the graph convolutional filter. GNN architectures compose graph filters with pointwise nonlinearities. (2) Fundamental Properties of GNNs. Graph filters and GNNs are suitable architectures to process signals on graphs because of their permutation equivariance. GNNs tend to work better than graph filters because they are Lipschitz stable to deformations of the graph that describes their structure. This is a property that regular graph filters can’t have. (3) Distributed Control of Multiagent Systems. An exciting application domain for GNNs is the distributed control of large scale multiagent systems. Applications to the control of robot swarms and wireless communication networks will be covered.

### Day 2: **Principled Design & Interpretability**

#### ABSTRACT

A number of powerful principles underlie much of modern physics, such as the behavior of variables and fields under symmetry transformations and the strange statistical laws of quantum mechanics. Can these principles also be used in deep learning? While this may look strange at first sight, we only need to realize that both physics and deep learning can be understood as information processing systems. In this talk, I will explain how we can apply representation theory for both global as well as local (gauge) transformations to deep learning. In the second half, I will explain how even the language of quantum mechanics can be applied to deep learning and might, with the advent of quantum computers, become a new powerful paradigm for deep learning.

#### ABSTRACT

In this talk, we provide a theoretical framework for interpreting neural network decisions by formalizing the problem in a rate-distortion framework. The solver of the associated optimization, which we coin Rate-Distortion Explanation (RDE), is then accessible to a mathematical analysis. We will discuss theoretical results as well as present numerical experiments showing that our algorithmic approach outperforms established methods, in particular, for sparse explanations of neural network decisions.

#### ABSTRACT

Predictability, computability, and stability (PCS) are three core principles for veridical data science that aims at responsible, reliable, reproducible, and transparent data analysis and decision-making. They embed the scientific principles of prediction and replication in data-driven decision-making while recognizing the central role of computation. Based on these principles, the PCS framework consists of a workflow and documentation (in R Markdown or Jupyter Notebook) for the entire data science life cycle (DSLC) from problem formulation, data collection, data cleaning to modeling and data result interpretation and conclusions. Veridical interpretability is defined as trustworthy interpretability of data results that captures reality with predictability as a minimum and is reliable through stability analysis relative to appropriate perturbations to DSLC including human judgment calls. The PCS framework provides a protocol towards veridical interpretability. Two interpretation methods, DeepTune and ACD for DNN models, will be demonstrated as case studies of PCS towards veridical interpretability. In particular, DeepTune elicits meaningful and testable (image) interpretations of DNN-based models of single neurons in the difficult primate visual cortex area V4. ACD (agglomerative contextual decomposition) provides hierarchical interpretations of DNN predictions, and is effective at diagnosing incorrect predictions and identifying dataset bias, while being largely stable to adversarial perturbations.

#### ABSTRACT

In this talk, we offer an entirely “white box’’ interpretation of deep (convolutional) networks. In particular, we show how modern deep architectures, linear (convolution) operators and nonlinear activations, and parameters of each layer can be derived from the principle of rate reduction (and invariance). All layers, operators, and parameters of the network are explicitly constructed via forward propagation, instead of learned via back propagation. All components of such a network have precise optimization, geometric, and statistical meaning. There are also several nice surprises from this principled approach that shed new light on fundamental relationships between forward (optimization) and backward (variation) propagation, between invariance and sparsity, and between deep networks and Fourier analysis.

### Day 3: **Guaranteed Robustness & Fairness**

#### ABSTRACT

Classical theory that guides the design of nonparametric prediction methods like deep neural networks involves a tradeoff between the fit to the training data and the complexity of the prediction rule. Deep learning seems to operate outside the regime where these results are informative, since deep networks can perform well even with a perfect fit to noisy training data. We investigate this phenomenon of ‘benign overfitting’ in the simplest setting, that of linear prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of effective rank of the data covariance. It shows that overparameterization is essential: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. It also shows an important role for finite-dimensional data: benign overfitting occurs for a much narrower range of properties of the data distribution when the data lies in an infinite dimensional space versus when it lies in a finite dimensional space whose dimension grows faster than the sample size. We discuss implications for deep networks, for robustness to adversarial examples, and for the rich variety of possible behaviors of excess risk as a function of dimension. This is joint work with Phil Long, Gábor Lugosi, and Alex Tsigler.

#### ABSTRACT

We first formulate and formally characterize group fairness as a multi-objective optimization problem, where each sensitive group risk is a separate objective. We propose a fairness criterion where a classifier achieves minimax risk and is Pareto-efficient w.r.t. all groups, avoiding unnecessary harm, and can lead to the best zero-gap model if policy dictates so. We provide a simple optimization algorithm compatible with deep neural networks to satisfy these constraints. Since our method does not require test-time access to sensitive attributes, it can be applied to reduce worst-case classification errors between outcomes in unbalanced classification problems. We test the proposed methodology on real case-studies of predicting income, ICU patient mortality, skin lesions classification, and assessing credit risk, demonstrating how our framework compares favorably to other approaches. We then extend this work when the sensitive classes are not known even at training time, achieving this via a game theoretical optimization approach. We show the implications of this to the concept to subgroup robustness. This is joint work with Natalia Martinez, Martin Bertran, Afroditi Papadaki, and Miguel Rodrigues.

#### ABSTRACT

Large, high capacity, deep learning models, trained on large amounts of data have shown to achieve impressive performance and generalize well. However, there is an argument for simpler models that claims that algorithms used for decision-making cannot be fair if they cannot explain their decisions. In this talk, we approach the problem of interpretable feature selection and the topic of robustness. We first derive a linear approach to select relevant features using lasso, and then we extend it to the context of deep learning using variational autoencoders and the famous gumbel softmax trick. Finally, we evaluate this method in the context of single-cell RNA sequencing data. This is joint work with Nabeel Sarwar and Bianca Dumitrascu.

#### ABSTRACT

Evasion and poisoning attacks have been demonstrated on a range of systems, but usually in a simplified laboratory setting. In this talk, I will describe recent work on evasion attacks and present our work on dataset poisoning. I’ll explain how attacks on toy systems can be scaled up and weaponized to break industrial systems, including copyright detection systems, algorithmic trading bots, and the Google and Amazon machine learning APIs.

### Day 4: **Brainstorm and Discussion**

#### ABSTRACT

**Lead: **Edgar Dobriban (University of Pennsylvania)

**Participants: **Sebastien Bubeck (Microsoft Research), Jinghui Chen (University of California, Los Angeles), Soheil Feizi (University of Maryland), Micah Goldblum (University of Maryland), Zico Kolter (Carnegie Mellon University), Omar Montasser (Toyota Technological Institute at Chicago), Cyrus Rashtchian (University of California, San Diego), Aditi Raghunathan (Stanford University), Alex Robey (University of Pennsylvania), Chong You (University of California, Berkeley), Hongyang Zhang (Toyota Technological Institute at Chicago)

#### ABSTRACT

**Leads: **Gitta Kutyniok (Ludwig-Maximilians Universität München) and Guillermo Sapiro (Duke University)

**Participants: **Solon Barocas (Cornell University, Microsoft Research) and Ana-Andreea Stoica (Columbia University)

Watch recorded talk on YouTube

#### SPEAKER

Noah E. Friedkin is a Professor in the Department of Sociology at the University of California, Santa Barbara. He is an AAAS Fellow and Editor-in-Chief of the Journal of Mathematical Sociology. His research interests are in network science modeling of interpersonal influence systems. He has published two award winning books on this subject along with numerous journal publications in which modelling predictions are evaluated with data collected from experiments on human subjects. During the past six years, his work has been conducted with collaborators in the fields of engineering control theory (F. Bullo, R. Tempo, A.V. Proskurnikov) and computer science (A. Singh).

#### ABSTRACT

**Leads: **Benjamin Haeffele (Johns Hopkins University) and Chong You (University of California, Berkeley)

**Participants: **Anima Anandkumar (California Institute of Technology, Nvidia), Song Han (Massachusetts Institute of Technology), Qiang Liu (University of Texas at Austin), Tess Smidt (Lawrence Berkeley National Laboratory)

**Format of Brainstorm Sessions:**

• Collect significant open problems.

• Discuss potential technical approaches.

• Present grand intellectual/industrial challenges that we can embark on.

• Draft an outline of a report by the group.

Watch recorded talk on YouTube