About Me


I am a PhD student in the Machine Learning Group at the University of Toronto and the Vector Institute. I am primarily interested in neural networks, optimization, Bayesian inference, generative models, and reinforcement learning. I am also interested in applying insights about how the brain works to improve neural networks.

I obtained my Master's degree in Computer Science at Simon Fraser University. My MSc research focused on belief change, which is an area of knowledge representation concerned with updating knowledge bases in light of new information. I developed a Python package called Equibel to make it easier for researchers to experiment with belief change in multi-agent systems.

I received my BSc in computer science from Simon Fraser University in Winter 2014, and my Master's in 2016.


As a Canadian, I enjoy salmon, maple syrup, and their combination (salmon candy)! I also like sports based on water in different physical phases: swimming, skating, and skiing. As a music-lover, I support the Vancouver Symphony Orchestra and the Vancouver Opera. I myself play the saxophone and the guitar.

A collection of my favorite books can be found on my bookshelf.


April 21, 2021 Gave a talk Introduction to Disentanglement at the Vector Endless Summer School
January 22, 2021 Understanding and mitigating exploding inverses in INNs accepted at AISTATS 2021
October-December, 2020 Instructor for Vector Institute ML Course
September-December, 2020 Mentor for UMass Amherst Team at ProjectX ML Research Competition
March 13, 2019 Presented Self-Tuning Networks at the Vector Endless Summer School
February 22, 2019 Presented Self-Tuning Networks at the Vector Research Symposium
December 20, 2018 Self-Tuning Networks accepted to ICLR 2019
September 4, 2018 Reversible Recurrent Neural Networks accepted to NIPS 2018
August 29, 2018 Attended the Vector Science Communications Workshop
July 25 - August 3, 2018 Attended Deep Learning and Reinforcement Learning Summer School
June 27, 2018 Adversarial Distillation of Bayesian Neural Network Posteriors up on arXiv
June 22, 2018 MovieGraphs featured in a UofT news article
February 19, 2018 MovieGraphs accepted to CVPR 2018
January 29, 2018 Flipout accepted to ICLR 2018
December 19, 2017 MovieGraphs preprint up on arXiv
September, 2016 Started PhD at the University of Toronto
June 22, 2016 One paper accepted to ECAI 2016
June 6, 2015 One paper accepted to LPNMR 2015
May 7, 2015 Awarded the Governor General's Silver Medal for highest graduating GPA
December 20, 2014 Finished my BSc in Computer Science at Simon Fraser University


NeurIPS 2021, ICML 2021 (Expert Reviewer & Top 10% Reviewer), AISTATS 2021, NeurIPS 2020 (Top 10% Reviewer), ICLR 2020, NeurIPS 2019 (Top 400 Reviewer), ICML 2019, EMNLP 2017, ICCV 2017.


Meta-optimization of a learning rate schedule for a toy regression problem.

Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

Paul Vicol, Luke Metz, Jascha Sohl-Dickstein

Abstract: Unrolled computation graphs arise in many scenarios, including training RNNs, tuning hyperparameters through unrolled optimization, and training learned optimizers. Current approaches to optimizing parameters in such computation graphs suffer from high variance gradients, bias, slow updates, or large memory usage. We introduce a method called Persistent Evolution Strategies (PES), which divides the computation graph into a series of truncated unrolls, and performs an evolution strategies-based update step after each unroll. PES eliminates bias from these truncations by accumulating correction terms over the entire sequence of unrolls. PES allows for rapid parameter updates, has low memory usage, is unbiased, and has reasonable variance characteristics. We experimentally demonstrate the advantages of PES compared to several other methods for gradient estimation on synthetic tasks, and show its applicability to training learned optimizers and tuning hyperparameters.

ICML 2021 (Long Talk)
Handcrafted visualization of cold-start vs warm-start solutions for overparameterized dataset distillation.

Implicit Regularization in Overparameterized Bilevel Optimization

Paul Vicol, Jonathan Lorraine, David Duvenaud, Roger Grosse

Abstract: Bilevel problems involve inner and outer parameters, each optimized for its own objective. Most prior work makes the simplifying assumption that the inner and outer objectives have unique solutions, but often in practice, at least one of them is overparameterized. In this case, there are many ways to choose among equivalent optima, which lead to different results. Building on recent studies of implicit regularization in single-level optimization, we investigate the inductive biases of different gradient-based algorithms for jointly optimizing the inner and outer parameters. We distinguish between two different solution concepts---cold-start and warm-start equilibria---and show that the converged solution or long-run behavior depends to a surprising degree on algorithmic choices such as the hypergradient approximation, which depends on higher order Hessian-vector products. We also show that with warm-start equilibria, the inner parameters can encode a surprising amount of information about the outer objective, even when the outer parameters are low-dimensional.

A visualization of the Game Ridge Rider algorithm branching at a bifurcation.

Using Bifurcations for Diversity in Differentiable Games

Jonathan Lorraine; Jack Parker-Holder, Paul Vicol, Aldo Pacchiano, Luke Metz, Tal Kachman, Jakob Foerster

Abstract: Ridge Rider (RR) is an algorithm for finding diverse solutions to optimization problems by following eigenvectors of the Hessian (``ridges''). RR is designed for conservative gradient systems (i.e. settings involving a single loss function), where it branches at saddles -- the only relevant bifurcation points. We generalize this idea to non-conservative, multi-agent gradient systems by identifying new types of bifurcation points and proposing a method to follow eigenvectors with complex eigenvalues. We give theoretical motivation for our method -- denoted Game Ridge Rider (GRR) -- by leveraging machinery from the field of dynamical systems. Finally, we empirically motivate our method by constructing novel toy problems where we can visualize new phenomena and by finding diverse solutions in the iterated prisoners' dilemma, where existing methods often converge to a single solution mode.

Convergence rates for simultaneous complex momentum on $\min_x \max_y xy$

Complex Momentum for Learning in Games

Jonathan Lorraine, David Acuna, Paul Vicol, David Duvenaud

Abstract: We generalize gradient descent with momentum for learning in differentiable games to have complex-valued momentum. We give theoretical motivation for our method by proving convergence on bilinear zero-sum games for simultaneous and alternating updates. Our method gives real-valued parameter updates, making it a drop-in replacement for standard optimizers. We empirically demonstrate that complex-valued momentum can improve convergence in adversarial games---like generative adversarial networks---by showing we can find better solutions with an almost identical computational cost. We also show a practical generalization to a complex-valued Adam variant, which we use to train BigGAN to better inception scores on CIFAR-10.

Understanding and mitigating exploding inverses in invertible neural networks

Understanding and Mitigating Exploding Inverses in Invertible Neural Networks

Jens Behrmann*, Paul Vicol*, Kuan-Chieh Wang*, Roger Grosse, Jörn-Henrik Jacobsen

Abstract: Invertible neural networks (INNs) have been used to design generative models, implement memory-saving gradient computation, and solve inverse problems. In this work, we show that commonly-used INN architectures suffer from exploding inverses and are thus prone to becoming numerically non-invertible. Across a wide range of INN use-cases, we reveal failures including the non-applicability of the change-of-variables formula on in- and out-of-distribution (OOD) data, incorrect gradients for memory-saving backprop, and the inability to sample from normalizing flow models. We further derive bi-Lipschitz properties of atomic building blocks of common architectures. These insights into the stability of INNs then provide ways forward to remedy these failures. For tasks where local invertibility is sufficient, like memory-saving backprop, we propose a flexible and efficient regularizer. For problems where global invertibility is necessary, such as applying normalizing flows on OOD data, we show the importance of designing stable INN building blocks.

Gradient-Based Hyperparameter Optimization with the Implicit Function Theorem

Optimizing Millions of Hyperparameters by Implicit Differentiation

Jonathan Lorraine, Paul Vicol, David Duvenaud

Abstract: We propose an algorithm for inexpensive gradient-based hyperparameter optimization that combines the implicit function theorem (IFT) with efficient inverse Hessian approximations. We present results on the relationship between the IFT and differentiating through optimization, motivating our algorithm. We use the proposed approach to train modern network architectures with millions of weights and millions of hyperparameters. We learn a data-augmentation network---where every weight is a hyperparameter tuned for validation performance---that outputs augmented training examples; we learn a distilled dataset where each feature in each datapoint is a hyperparameter; and we tune millions of regularization hyperparameters. Jointly tuning weights and hyperparameters with our approach is only a few times more costly in memory and compute than standard training.

Self-Tuning Network Architecture

Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions

Matthew MacKay*, Paul Vicol*, Jonathan Lorraine, David Duvenaud, Roger Grosse

Abstract: Hyperparameter optimization can be formulated as a bilevel optimization problem, where the optimal parameters on the training set depend on the hyperparameters. We aim to adapt regularization hyperparameters for neural networks by fitting compact approximations to the best-response function, which maps hyperparameters to optimal weights and biases. We show how to construct scalable best-response approximations for neural networks by modeling the best-response as a single network whose hidden units are gated conditionally on the regularizer. We justify this approximation by showing the exact best-response for a shallow linear network with L2-regularized Jacobian can be represented by a similar gating mechanism. We fit this model using a gradient-based hyperparameter optimization algorithm which alternates between approximating the best-response around the current hyperparameters and optimizing the hyperparameters using the approximate best-response function. Unlike other gradient-based approaches, we do not require differentiating the training loss with respect to the hyperparameters, allowing us to tune discrete hyperparameters, data augmentation hyperparameters, and dropout probabilities. Because the hyperparameters are adapted online, our approach discovers hyperparameter schedules that can outperform fixed hyperparameter values. Empirically, our approach outperforms competing hyperparameter optimization methods on large-scale deep learning problems. We call our networks, which update their own hyperparameters online during training, Self-Tuning Networks (STNs).

Reversible RNN Attention Mechanism

Reversible Recurrent Neural Networks

Matthew MacKay, Paul Vicol, Jimmy Ba, Roger Grosse

Abstract: Recurrent neural networks (RNNs) provide state-of-the-art performance in processing sequential data but are memory intensive to train, limiting the flexibility of RNN models which can be trained. Reversible RNNs---RNNs for which the hidden-to-hidden transition can be reversed---offer a path to reduce the memory requirements of training, as hidden states need not be stored and instead can be recomputed during backpropagation. We first show that perfectly reversible RNNs, which require no storage of the hidden activations, are fundamentally limited because they cannot forget information from their hidden state. We then provide a scheme for storing a small number of bits in order to allow perfect reversal with forgetting. Our method achieves comparable performance to traditional models while reducing the activation memory cost by a factor of 10--15. We extend our technique to attention-based sequence-to-sequence models, where it maintains performance while reducing activation memory cost by a factor of 5--10 in the encoder, and a factor of 10--15 in the decoder.

Adversarial Posterior Distillation Framework

Adversarial Distillation of Bayesian Neural Network Posteriors

Kuan-Chieh Wang, Paul Vicol, James Lucas, Li Gu, Roger Grosse, Richard Zemel

Abstract: Bayesian neural networks (BNNs) allow us to reason about uncertainty in a principled way. Stochastic Gradient Langevin Dynamics (SGLD) enables efficient BNN learning by drawing samples from the BNN posterior using mini-batches. However, SGLD and its extensions require storage of many copies of the model parameters, a potentially prohibitive cost, especially for large neural networks. We propose a framework, Adversarial Posterior Distillation, to distill the SGLD samples using a Generative Adversarial Network (GAN). At test-time, samples are generated by the GAN. We show that this distillation framework incurs no loss in performance on recent BNN applications including anomaly detection, active learning, and defense against adversarial attacks. By construction, our framework distills not only the Bayesian predictive distribution, but the posterior itself. This allows one to compute quantities such as the approximate model variance, which is useful in downstream tasks. To our knowledge, these are the first results applying MCMC-based BNNs to the aforementioned applications.

Flipout LSTM Variance Reduction

Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches

Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, Roger Grosse

Abstract: Stochastic neural net weights are used in a variety of contexts, including regularization, Bayesian neural nets, exploration in reinforcement learning, and evolution strategies. Unfortunately, due to the large number of weights, all the examples in a mini-batch typically share the same weight perturbation, thereby limiting the variance reduction effect of large mini-batches. We introduce flipout, an efficient method for decorrelating the gradients within a mini-batch by implicitly sampling pseudo-independent weight perturbations for each example. Empirically, flipout achieves the ideal linear variance reduction for fully connected networks, convolutional networks, and RNNs. We find significant speedups in training neural networks with multiplicative Gaussian perturbations. We show that flipout is effective at regularizing LSTMs, and outperforms previous methods. Flipout also enables us to vectorize evolution strategies: in our experiments, a single GPU with flipout can handle the same throughput as at least 40 CPU cores using existing methods, equivalent to a factor-of-4 cost reduction on Amazon Web Services.


MovieGraphs: Towards Understanding Human-Centric Situations from Videos

Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler

Abstract: There is growing interest in artificial intelligence to build socially intelligent robots. This requires machines to have the ability to "read" people's emotions, motivations, and other factors that affect behavior. Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed graph-based annotations of social situations de- picted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions. In addition, most interactions and many attributes are grounded in the video with time stamps. We provide a thorough analysis of our dataset, showing interesting common-sense correlations between different social aspects of scenes, as well as across scenes over time. We propose a method for querying videos and text with graphs, and show that: 1) our graphs contain rich and sufficient information to summarize and localize each scene; and 2) subgraphs allow us to describe situations at an abstract level and retrieve multiple semantically relevant situations. We also propose methods for interaction understanding via ordering, and reasoning about the social scene. MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents.

An example model graph

A Minimization-Based Approach to Iterated Multi-Agent Belief Change

Paul Vicol, James Delgrande, Torsten Schaub

Abstract: We investigate minimization-based approaches to iterated belief change in multi-agent systems. A network of agents is represented by an undirected graph, where propositional formulas are associated with vertices. Information is shared between vertices via a procedure where each vertex minimizes disagreement with other vertices in the graph. Each iterative approach takes into account the proximity between vertices, with the underlying assumption that information from nearby sources is given higher priority than information from more distant sources. We have identified two main approaches to iteration: in the first approach, a vertex takes into account the information at its immediate neighbours only, and information from more distant vertices is propagated via iteration; in the second approach, a vertex first takes into account information from distance-1 neighbours, then from distance-2 neighbours, and so on, in a prioritized fashion. There prove to be three distinct ways to define the second approach, so in total we have four types of iteration. We define these types formally, find relationships between them, and investigate their basic logical properties. We also implemented the approaches in a software system called Equibel.

Equibel System Architecture

An Implementation of Consistency-Based Multi-Agent Belief Change using ASP

Paul Vicol, James Delgrande, Torsten Schaub

Abstract: This paper presents an implementation of a general framework for consistency-based belief change using Answer Set Programming (ASP). We describe Equibel, a software system for working with belief change operations on arbitrary graph topologies. The system has an ASP component that performs a core maximization procedure, and a Python component that performs additional processing on the output of the ASP solver. The Python component also provides an interactive interface that allows users to create a graph, set formulas at nodes, perform belief change operations, and query the resulting graph.


Seminar Presentations

Click the links for slides!


Equibel Logo


A Python toolkit for equivalence-based belief change.

Examples flower images from all near/far classes

The Delicate Art of Flower Classification

Abstract: In this paper, we study the feasibility of identifying flowers in real-world Flickr photos. Established datasets for flower recognition contain only close-up, centered images of flowers, which are not representative of the large variety found on Flickr. We introduce a new dataset of close-up images that has greater variation in the orientation, shape, colour, and lighting than existing datasets. We also introduce a new dataset that contains both close-up and far-away images of certain species of flowers. We show that it is possible to identify fields of flowers, as well as individual flowers, in images downloaded from Flickr. This could provide a way to mine Flickr for information about nature, that could impact our understanding of the consequences of climate change.


A solution to a graph coloring problem

An Introduction to Answer Set Programming

Paul Vicol

Guest Lecture for Artificial Intelligence Survey (CMPT 310)

Abstract: Answer Set Programming (ASP) is a declarative logic programming paradigm tailored to solve NP-hard combinatorial search problems. ASP separates problem specification from solving by combining a rich modeling language with a general-purpose, high-performance solver. This allows it to solve a wide variety of search problems in a uniform way. This talk introduces ASP with a focus on practical problem solving. By looking at classic problems like graph colouring and n-queens, we see how to model problems in the language of ASP and use off-the-shelf tools to obtain solutions. Answer Set Programming is a powerful paradigm that has been used in applications ranging from robotics to music composition and genomics. This talk shows you how to get started with ASP, so that you can apply it to your own problems!

Latest from the Blog

A Glimpse of My Research in Multi-Agent Belief Change

A look at my recent work in comparing different forms of iterated belief change in multi-agent systems.

Read more!

Recommended from the Bookshelf


by Michael Sandel
Illustrates philosophical concepts with real examples from today's world. Makes it fun to learn the differences between categorical and consequentialist moral reasoning. One of my favorite books ever! You can also watch Sandel's Harvard lectures online!