LADIS – Workshop on Large-Scale Distributed Systems and Middleware

Distributed systems and middleware are at the epicenter of large-scale data centers, cloud computing infrastructures, scalable web services, and data analytics. The 12th Workshop on Large Scale Distributed Systems and Middleware (LADIS) aims to bring together a select group of researchers and professionals in the field to surface their work in an engaging virtual workshop atmosphere. Join us!

LADIS 2021 will take place in conjunction with EuroSys 2021 via Zoom on Monday, April 26th, 2021 from 16:00 BST — 21:00 BST (11am-4pm ET).

PC Chairs: Idit Keidar and Peter Pietzuch
General Chairs: Gregory Chockler and Ymir Vigfusson

The program is on Sched and below. Sign up for EuroSys’21 free of charge here.

You can find the YouTube live stream of LADIS’21 here.

LADIS 2021 Program

Schedule (all times are in British Summer Time).

16:00 – 16:50 Session 1: Blockchain scalability

Yossi Gilad (HUJI): Scaling Byzantine Agreements for Cryptocurrencies.
Emin Gün Sirer (Cornell/Ava Labs): The Avalanche network and the Internet of Finance being built upon it.
Dahlia Malkhi (Diem Association and Novi Financial): Twins — Production BFT Systems Made Robust.

16:50 – 17:00 Break
17:00 – 17:50 Keynote (Joint with PaPoC)

Peter Alvaro (USC): What not where: Sharing in a world of distributed, persistent memory

17:50 – 18:00 Break
18:00 – 19:15 Session 2: Large-scale ML & graph processing

Marco Canini (KAUST): Accelerated Deep Learning via Efficient, Compressed and Managed Communication.
Keval Vora (SFU): Efficient Large-Scale Graph Analytics.
Matteo Interlandi (Microsoft): Using Tensor Computations for Traditional Machine Learning Prediction Serving and Beyond.
Alexey Tumanov (GaTech): Unified Distributed Dataflow in Ray.

19:15 – 19:45 Open discussions/hallroom conversations/break
19:45 – 20:45 Session 3: Scalable and consistent data processing and storage

Natacha Crooks (UC Berkeley): Oblivious Transactions In the Cloud.
Mahesh Balakrishnan (Facebook): Virtual Consensus in the Delos Storage System.
Kyle Kingsbury (Jepsen): Everything Is Broken.

20:45 – 21:00 Closing

Detailed program

16:00 – 16:50 Session 1: Blockchain scalability

Yossi Gilad: Scaling Byzantine Agreements for Cryptocurrencies.

Algorand is a new cryptocurrency that confirms over a thousand transactions per second with a latency of a few seconds. It ensures that users never have divergent views of confirmed transactions, even if some of the users are malicious and the network is temporarily partitioned. Algorand uses a new Byzantine Agreement (BA) protocol to reach consensus among millions of participants on the next set of transactions. To scale the consensus to many participants, Algorand uses a novel mechanism based on Verifiable Random Functions that allows users to privately check whether they are selected to participate in the BA to agree on the next set of transactions, and to include proof of their selection in their network messages. In Algorand’s BA protocol, participants do not keep any private state except for their private keys, which allows Algorand to replace participants immediately after they send a message. This mitigates targeted attacks on chosen participants after their identity is revealed.

Yossi Gilad is a Harry & Abe Sherman Senior Lecturer at the Hebrew University of Jerusalem. His research focuses on designing, building, and analyzing secure and scalable protocols and networked systems. Prior to the Hebrew University, he was a postdoc researcher at MIT and Boston University.

Emin Gün Sirer: The Avalanche network and the Internet of Finance being built upon it

Before Ethereum, many would have scoffed at the very notion of decentralized finance. Now, it’s one of the most hyped areas of fintech. Ava Labs is building upon that momentum with a breakthrough in consensus protocols – Avalanche – to usher in a new era of finance defined by velocity, efficient use of capital, security against bad actors, and preservation of network value. This talk will focus on the Avalanche network and the Internet of Finance being built upon it.

Emin Gün Sirer is Co-founder and CEO at Ava Labs and a Professor of Computer Science at Cornell University, where his research focuses on operating systems, networking, and distributed systems. He is well-known for having implemented the first currency that used Proof-of-Work (PoW) to mint coins, as well as his research on selfish mining, characterizing the scale and centralization of existing cryptocurrencies, and having proposed the leading protocols for on-chain and off-chain scaling. He is the Co-Director of the Initiative for Cryptocurrencies and Smart Contracts (IC3), which aims to move blockchain-based applications from whiteboards and proofs-of-concept to tomorrow’s fast and reliable financial systems

Dahlia Malkhi: Twins — Production BFT Systems Made Robust.

Twins is a principled strategy for effectuating Byzantine attack scenarios at scale in Byzantine Fault Tolerant (BFT) systems and examining their behavior.

Dahlia Malkhi is the Chief Technology Officer at Diem Association, Lead Maintainer of the Diem project, and Lead Researcher at Novi. She has applied and foundational research interest in broad aspects of reliability and security of distributed systems. For over two decades, she participated in driving innovation in tech, notably: co-inventor of HotStuff, co-founder and technical co-lead of VMware blockchain, co-inventor of Flexible Paxos, the technology behind Log Device, creator and tech lead of CorfuDB, a database-less database driving VMware’s NSX-T distributed control plane (see Corfu github repo), and co-inventor of the FairPlay project.
Dahlia joined the Diem (Libra) team in June 2019, first as a Lead Reseacher at Novi and later as Chief Technology Officer at the Diem Association. In 2014, after the closing of the Microsoft Research Silicon Valley lab, she co-founded VMware Research and became a Principal Researcher at VMware until June 2019. From 2004-2014, she was a principal researcher at Microsoft Research, Silicon Valley. From 1999-2007, she served as tenured associate professor at the Hebrew University of Jerusalem. From 1995-1999, she was a senior researcher at AT&T Labs, NJ. Dahlia holds Ph.D., M.S. and B.S. in computer science from the Hebrew University of Jerusalem.

17:00 – 17:50 Keynote (Joint with PaPoC)

Peter Alvaro: What not where: Sharing in a world of distributed, persistent memory

A world of distributed, persistent memory is on its way. Our programming models traditionally operate on short-lived data representations tied to ephemeral contexts such as processes or computers. In the limit, however, data lifetime is infinite compared to these transient actors. We discuss the implications for programming models raised by a world of large and potentially persistent distributed memories, including the need for explicit, context-free, invariant data references. We present a novel operating system that uses wisdom from both storage and distributed systems to center the programming model around data as the primary citizen, and reflect on the transformative potential of this change for infrastructure and applications of the future. We focus in particular on the landscape of data sharing and the consequences of globally-addressable persistent memory on existing consistency models and mechanisms.

Peter Alvaro is an Assistant Professor of Computer Science at the University of California Santa Cruz, where he leads the Disorderly Labs research group disorderlylabs.github.io. His research focuses on using data-centric languages and analysis techniques to build and reason about data-intensive distributed systems, in order to make them scalable, predictable and robust to the failures and nondeterminism endemic to large-scale distribution. Peter earned his PhD at UC Berkeley, where he studied with Joseph M. Hellerstein. He is a recipient of the NSF CAREER Award, the Facebook Research Award, the USENIX ATC Best Presentation Award, and the UCSC Excellence in Teaching Award.

18:00 – 19:15 Session 2: Large-scale ML & graph processing

Marco Canini: Accelerated Deep Learning via Efficient, Compressed and Managed Communication

Scaling deep learning to a large cluster of workers is challenging due to high communication overheads that data-parallelism entails. This talk describes our efforts to rein in distributed deep learning’s communication bottlenecks. We describe SwitchML, the state-of-the-art in-network aggregation system for collective communication using programmable network switches. We introduce OmniReduce, an efficient streaming aggregation system that exploits sparsity to maximize effective bandwidth use. We touch on our work to develop compressed gradient communication algorithms that perform efficiently and adapt to network conditions. Lastly, we take a broad look at the challenges to accelerated decentralized training in the federated learning setting where heterogeneity is an intrinsic property of the environment.

Marco Canini does not know what the next big thing will be. But he’s sure that our next-gen computing and networking infrastructure must be a viable platform for it and avoid stifling innovation. Marco’s research spans a number of areas in computer systems, including distributed systems, large-scale/cloud computing and computer networking with emphasis on programmable networks. His current focus is on designing better systems support for AI/ML and providing practical implementations deployable in the real-world.
Marco is an associate professor in Computer Science at KAUST. Marco obtained his Ph.D. in computer science and engineering from the University of Genoa in 2009 after spending the last year as a visiting student at the University of Cambridge. He was a postdoctoral researcher at EPFL and a senior research scientist at Deutsche Telekom Innovation Labs & TU Berlin. Before joining KAUST, he was an assistant professor at UCLouvain. He also held positions at Intel, Microsoft and Google.

Keval Vora: Efficient Large-Scale Graph Analytics

With growing interest in analyzing large graph datasets, several graph analytics solutions have been developed over the past decade. In this talk, I will briefly summarize some of our past contributions for asynchronous graph processing, and then shift gears towards two of our recent works: GraphBolt and Peregrine. GraphBolt enables efficient processing of streaming graphs while guaranteeing BSP execution semantics, and our recent work called DZiG adds sparse incremental processing in GraphBolt which scales incremental processing to high graph mutation rates. Peregrine, on the other hand, is a system to perform fast graph pattern mining tasks. It exposes a pattern-based programming model and presents novel abstractions like “anti-edges” and “anti-vertices” that enable easily expressing structural constraints in patterns. Peregrine’s pattern-aware runtime extracts the semantics of patterns, and uses them to efficiently explore matches in the data graph such that it outperforms state-of-the-art distributed and single machine graph mining systems.

Keval Vora is an Assistant Professor at the School of Computing Science at Simon Fraser University. He received his Ph.D. from the Department of Computer Science and Engineering at the University of California, Riverside where he was advised by Prof. Rajiv Gupta. He also worked with Prof. Harry Xu at the University of California, Irvine. His research lies at the intersection of Large-Scale Data Systems and Parallel/Distributed Computing. His current work addresses challenges involved in processing large-scale static and dynamic graphs across different processing environments.

Matteo Interlandi: Using Tensor Computations for Traditional Machine Learning Prediction Serving and Beyond

Machine Learning (ML) adoption in the enterprise requires simpler and more efficient software infrastructure. Model scoring, the process of obtaining prediction from a trained model over new data, is a primary contributor to infrastructure complexity and cost, as models are trained once but used many times.
Hummingbird is a novel approach to model scoring, which compiles featurization operators and traditional ML models (e.g., decision trees) into a small set of tensor operations. This approach inherently reduces infrastructure complexity and directly leverages existing investments in Neural Networks’ compilers and runtimes to generate efficient computations for both CPU and hardware accelerators. Hummingbird performance are competitive and even outperforms hand-crafted kernels while enabling seamless end-to-end acceleration of ML pipelines. Hummingbird is open source, part of the PyTorch Ecosystem, and someone even mentioned it as one of the top 10 Python libraries of 2020! In this talk, I will give an overview of the Hummingbird project, and provide some new future directions we are currently investigating.

Matteo Interlandi is a Senior Scientist in the Gray Systems Lab (GSL) at Microsoft, working on scalable Machine Learning systems. Before Microsoft, he was a Postdoctoral Scholar in the CS Department at the University of California, Los Angeles, working on Big Data systems. Prior to joining UCLA, he was a researcher at the Qatar Computing Research Institute, and at the Institute for Human and Machine Cognition. He obtained his PhD in Computer Science from the University of Modena and Reggio Emilia.

Alexey Tumanov: Unified Distributed Dataflow in Ray

The next generation of AI applications will continuously interact with the environment and learn from these interactions. These applications impose new and demanding systems requirements, both in terms of performance and flexibility. In this paper, we consider these requirements and present Ray — a distributed system to address them. Ray implements a unified interface that can express both task-parallel and actor-based computations, supported by a single dynamic execution engine. To meet the performance requirements, Ray employs a distributed scheduler and a distributed and fault-tolerant store to manage the system’s control state. In our experiments, we demonstrate scaling beyond 1.8 million tasks per second and better performance than existing specialized systems for several challenging reinforcement learning applications.

Prof. Alexey Tumanov is a tenure-track Assistant Professor in the School of Computer Science at Georgia Tech since August 2019. He completed my postdoc at the University of California, Berkeley, working with Ion Stoica and Joseph Gonzalez. He completed his Ph.D. at Carnegie Mellon University, advised by Gregory Ganger. At Carnegie Mellon, Tumanov was honored by the prestigious NSERC Alexander Graham Bell Canada Graduate Scholarship (NSERC CGS-D3) and partially funded by the Intel Science and Technology Centre for Cloud Computing and the Parallel Data Lab industry consortium. Tumanov’s research interests predominantly lie in systems support Machine learning, including resource management for distributed machine learning frameworks and applications. Alexey is currently working on distributed systems and scheduling algorithms for soft-real time Machine Learning inference and co-scheduling ML inference and online training. This builds on the body of research and development at Carnegie Mellon modeling, designing, and developing abstractions, primitives, algorithms and systems for a general resource management framework with support for static and dynamic heterogeneity, hard and soft placement constraints, time-varying resource capacity guarantees, and combinatorial constraints in heterogeneous resource contexts.

19:45 – 20:45 Session 3: Scalable and consistent data processing and storage

Natacha Crooks: Obladi: Oblivious Transactions in the Cloud

Cloud storage offers applications the promise of failure-free, infinite scalability. Offloading data to an untrusted party raises privacy concerns if application data is sensitive. Studies have shown that encryption alone is not sufficient and that privacy leakage often occurs through access patterns. Current techniques that defend against such attacks offer restricted APIs, no fault-tolerance and poor performance. This talk presents Obladi, the first cloud-based database that instead supports traditional ACID transactions with good performance. Obladi’s core insight lies in observing that the flexibility inherent in the definition of serializability can be leveraged to amortize the costs of guaranteeing privacy.

Natacha Crooks is an Assistant Professor at UC Berkeley, where she works on consistency, privacy and integrity in large-scale distributed systems. She obtained her PhD from UT Austin in 2019 and is a recipient of the Dennis Ritchie Doctoral Dissertation Award.

Mahesh Balakrishnan: Virtual Consensus in the Delos Storage System

Delos is a storage system at the bottom of the Facebook stack, operating as the backing store for the Twine scheduler and processing more than 2B TXes/day. Delos has a shared log design (based on Corfu), separating the consensus protocol from the database via a shared log API. In this talk, I’ll describe the two ways in which Delos advances the state of the art for replicated systems. First, we virtualize consensus by virtualizing the shared log: as a result, the system can switch between entirely different log implementations (i.e., consensus protocols) without downtime. Second, we virtualize the replicated state machine above the shared log, splitting the logic of the database into separate, stackable layers called log-structured protocols. Virtual consensus in Delos enables safe upgrades to the consensus protocol as well as the database above it, without downtime or complex migration logic.

Mahesh Balakrishnan leads the Delos project at Facebook. Prior to Facebook he was at Yale University, VMware Research, and Microsoft Research Silicon Valley. His work has received best paper awards at OSDI and ASPLOS.

Kyle Kingsbury: Everything Is Broken

Distributed systems often claim to store and process our data in a safe, fault-tolerant manner. However, practical experience suggests that simple faults (like network partitions and process crashes) can lead to significant safety violations. We’ll show how Jepsen, a distributed systems testing library, combines fault injection with sophisticated property-based testing to find consistency errors in real-world databases.

Kyle Kingsbury, a.k.a. “Aphyr”, is a computer safety researcher working as an independent consultant. He’s the author of the Riemann monitoring system, the Clojure from the Ground Up introduction to programming, and the Jepsen series on distributed systems correctness. He grills databases in the American Midwest.

LADIS collocated with EuroSys 2021