Hacker Newsnew | past | comments | ask | show | jobs | submit | matt_d's favoriteslogin

It's interesting to compare the issues that Shapiro found with the issues that we found during the development of Rust. It turns out that we ran into many of the same issues. Since Go is being mentioned here too, I'll compare the way Go dealt with the issues as well.

(0) Objects and inheritance: Rust had support for objects (via prototype-based inheritance) in the first version of the compiler, but we found that they weren't being used. We attempted to be as minimalist and simple as possible regarding objects, and as a result we ended up with a system that didn't have enough features to be useful. It also didn't fit well with the rest of the language, so we scrapped it and added typeclasses instead. Those worked a lot better, and now most of our standard library is using typeclasses for OO-like patterns. Recently, we've found that we really do want objects, but mostly as a way to achieve code reuse, privacy, and a more direct association between a type and its method than typeclasses alone provide. The current system that is being implemented is a nice way, in my opinion, to unify typeclasses and object-oriented programming. There are just a few concepts to learn, and it all meshes together quite nicely.

Go's interface system is quite similar to Rust's typeclass system. The main things that Rust has that Go doesn't are first-class constructors, the trait system (not yet implemented) to facilitate method reuse, and the ability to define multiple implementations of an interface per type. The things that Go has that Rust doesn't are duck typing (which is a quite good idea, and I'd like to add it to Rust as well) and downcasting (which we don't want to support because we have more type-safe mechanisms for the same thing).

(1) The compilation model: Rust uses dynamic linking pervasively, because OS's have good support for it and it helps keeps binaries small. It also has strong support for separate compilation, because we want to make compile times fast. So far, so good, but, just like BitC did, we discovered that type abstraction (which you use generics in Rust to achieve) doesn't mix well with separate compilation. We didn't want to have a uniform value representation like the MLs do (supporting only 31-bit ints and boxing everything else doesn't fly in a systems language), so we tried to use dynamic size calculations for all of the values. It resulted in a huge amount of complexity (we never shook out all the bugs), and it also had a large runtime performance penalty. Unlike C#, we couldn't fall back on a JIT, because Rust is an ahead-of-time-compiled language. So we moved to a "monomorphization" scheme for Rust 0.2, which is basically like C++ template instantiation, only without the overhead of reparsing all the code from scratch. Even with this scheme, you only pay for monomorphization when you use generics, you can still dynamically link all non-generic code (which is most of it), and your runtime performance is unaffected by your use of generics.

Go, of course, doesn't have generics. I don't personally believe that buys them much though; the programmer ends up working around it in a way that either involves boxing (via interface{}) or by-hand monomorphization (by duplicating the code for each type). To me, generics are just a way for the compiler to do work the programmer would end up having to do anyway.

(2) Insufficiency of the type system regarding reference and by-reference types. It's spooky to read this, because it's precisely the problem we ran into with Rust. At the moment we have by-value and by-reference modes for parameters, and we've found that this isn't sufficiently expressive. (We also tried making the only difference between by-value and by-immutable-reference internal to the compiler, which didn't work well due to higher-order functions and interoperability with C code.) We also found that parameters really aren't the only place you want by-reference types; you really want to be able to return references and place them within other data structures. Whenever we wanted to do this, we had to fall back onto heap allocation, and that was significantly hurting our performance, especially when unique types were involved (since aliasing a unique type is impossible, you have to copy it). Profiling Rust programs showed an alarming amount of time spent in malloc and free. So we're in the process of bringing up a new regions system that I'm excited about: it's too early to say for sure, but I think we've stumbled upon a way to make regions not require a doctorate in type theory to understand. Regions allow you to have safe pointers into the stack and into the heap and pass them around as first-class values.

Go doesn't have zero-cost reference types at all; it just does simple escape analysis to allocate structures on the stack when it can and falls back on tracing GC for the rest (note that this is what Java does nowadays too). This is one of the most significant differences between Go and Rust; Go's memory model is essentially identical to that of Java, plus the ability to allocate structures inside other structures, while Rust has a much more C++-like memory model (but safe, unlike C++). This decision is based on our experience with Firefox; fine-grained control over memory use is so important that we didn't want to place our bets on pervasive use of GC.

(3) Inheritance and encapsulation: Rust still has no concept of inheritance; it's our hope that a combination of enums (datatypes like in Haskell or case classes from Scala) and traits will allow us to avoid introducing the complexity of inheritance into the language. Time will tell, of course. As for encapsulation, we thought we didn't need it, but it turns out that we really did want private fields. This we're solving with the class system, mentioned above.

Go achieves inheritance through anonymous fields. Anonymous fields are multiple inheritance in all but name, complete with the "dreaded diamond" problems of C++. We were hoping to avoid that. Go has standard privacy through "unexported fields".

(4) Instance coherence. Since you can define multiple typeclass implementations for a given type, and the caller chooses which implementation to use, you have to make sure in many contexts (for example, the hash method of a hash table) that the same implementation gets used for the same data. That's one of the reasons we introduced classes -- they tie implementations of methods to a type.

Go doesn't have this problem, because it only permits one implementation of an interface for each type, and it has to be defined in the same module as the type. Basically, Go's interfaces are much like our classes in that regard. We wanted to allow people to add extra methods to types -- for example, to add extension methods on vectors (think what Rails does to the Ruby standard library, but in a way that doesn't involve monkey patching) -- so we didn't want to force this restriction on users.

I think that one of the most important things to underscore is that we would have never found these things so early unless we had written the Rust compiler in Rust itself. It forces us to use the language constantly, and we quickly find pain points. I highly encourage all languages to do the same; it's a great way to find and shake out design issues early.


OK, I got nerd-sniped here. You can actually construct the indices for the shuffle fairly easily with PEXT. Basically, you have 6 64-bit masks, each corresponding to a different bit of the index of each byte in the 64-byte vector. So for mask 0, a bit is set in the mask if its index has bit (1 << 0) set, mask 1 has the same but for bit (1 << 1), etc. The masks have a simple pattern, that changes between 1 and 0 bits every (1 << i) bits. So for 3 bits the masks would be: 10101010, 11001100, 11110000.

These masks are then extracted with PEXT for all the non-whitespace bytes. What this does is build up, bit by bit, the byte index of every non-whitespace byte, compressed down to the least-significant end, without the indices of whitespace bytes.

I wasn't actually able to run this code, since I don't have an AVX-512 machine, but I'm pretty sure it should be faster. I put the code on github if anyone wants to try: https://github.com/zwegner/toys/blob/master/avx512-remove-sp...

    const uint64_t index_masks[6] = {
        0xaaaaaaaaaaaaaaaa,
        0xcccccccccccccccc,
        0xf0f0f0f0f0f0f0f0,
        0xff00ff00ff00ff00,
        0xffff0000ffff0000,
        0xffffffff00000000,
    };
    const __m512i index_bits[6] = {
        _mm512_set1_epi8(1),
        _mm512_set1_epi8(2),
        _mm512_set1_epi8(4),
        _mm512_set1_epi8(8),
        _mm512_set1_epi8(16),
        _mm512_set1_epi8(32),

    };

  ...later, inside the loop:

    mask = ~mask;
    __m512i indices = _mm512_set1_epi8(0);
    for (size_t index = 0; index < 6; index++) {
        uint64_t m = _pext_u64(index_masks[index], mask);
        indices = _mm512_mask_add_epi8(indices, m, indices, index_bits[index]);
    }
    output = _mm512_permutexvar_epi8(indices, input);

I'm not sure if anyone is still reading this thread, but I wrote up some thoughts on the differences between Sequoia and Legion and some of the lessons we learned. Maybe this is a little too much "inside baseball", but hopefully some will enjoy it.

Sequoia was basically an answer to the question, "what could a compiler do for you if it knew everything about your program?" And I do mean everything: in addition to your source code Sequoia had to be given, at compile time, the sizes of every input array in addition to the exact layout of the machine it was to execute on. While these restrictions were lifted to some degree later on, Sequoia basically had to be able to statically compute every task that it would execute along with the exact sizes of all inputs and outputs. And with this information it was able to do some pretty [fantastic optimizations](http://theory.stanford.edu/~aiken/publications/papers/ppopp0...). I believe, though I can't find the reference at the moment, that the compiler was able to generate (from completely hardware agnostic source code) implementations of BLAS routines competitive with Intel MKL. This was a pretty mind-blowing achievement and to this day I'm not aware of any work that has really been able to equal it.

Requiring programs to be fully static also had a dramatic cost, one which it turned out not to be so easy to lift. In what I believe was the [final paper to be published on Sequoia](http://theory.stanford.edu/~aiken/publications/papers/ppopp1...), Mike Bauer and others showed how to extend Sequoia to support limited forms of dynamic behavior. But the solution wasn't all that satisfying and the work to get there was fairly excruciating. Instead of continuing to work with the language, Mike ended up switching gears to start a new project entirely, Legion.

As an aside, one of the other lessons learned from the Sequoia project is that research projects tend to be tied to the Ph.D. students who work on them. In the Sequoia project, there was a gap in continuity between the first and second generations of students to work on the project. This meant that second-generation students like Mike basically ended up picking up a big legacy code base on their own. This turned out to be a huge drain on productivity and was among the reasons that Sequoia was eventually abandoned.

The biggest decision made about Legion up front was that it had to be dynamic. This was a huge risk because the analysis performed by Sequoia simply could not have been performed at runtime with reasonable cost. Sequoia relied on knowing every single task that would ever execute and the exact bounds of every array input and output. This meant that Sequoia could do perfect alias analysis on these input and output arrays. Because Legion would be a dynamic system, any analysis would have to be performed during the execution of the program. There was no way we were going to do this in Legion without fundamentally changing the abstractions of the programming model, and frankly it wasn't obvious up front if this would even be possible to make tractable.

One of the central insights of Legion was to [explicitly reify the notion of data partitioning](http://legion.stanford.edu/pdfs/oopsla2013.pdf) in the programming model. Sequoia didn't need this because it simply knew the inputs and outputs to every task. Legion couldn't afford to pay the $O(N \operatorname{log} N)$ cost to perform this analysis at runtime. By lifting partitioning into a first-class primitive, we were able to radically reduce the runtime cost of this analysis by leveraging what the user already know about their data partitioning. Originally this required users to declare certain properties of their partitions, such as disjointness, in order to make things efficient. Eventually we discovered a [partitioning sublanguage](http://legion.stanford.edu/pdfs/dpl2016.pdf) which allowed us to statically discover most of these important properties, and to provide many static guarantees even in the presence of data-dependent behavior.

The other big insight of Legion was what we call index tasks. Index tasks are basically collections of tasks that can be described in $O(1)$ space. This turns out to be essential if you want to scale your system to very large distributed machines. Any representation which is $O(N)$ inevitably becomes a scalability bottleneck, either because of memory-capacity constraints or because of runtime complexity. By making the index task representation $O(1)$ we were able to design an [optimization to keep the runtime analysis complexity $O(1)$ as well](http://legion.stanford.edu/pdfs/cr2017.pdf).

One consequence of being a dynamic runtime system is that Legion itself isn't able to optimize the low-level code executed by the system. It seems to be a general principle that the closer you are to the hardware, the more static you need to be. Conceptually, Legion solves this problem by splitting it into two parts: the kernels, or low-level code that has to execute as efficiently as possible on the processor, and the orchestration of the parallelism that exists between those kernels. Legion focuses nearly entirely on the upper layer, leaving the lower layer to be handled by programmers. We've recovered some of this capability via the [Regent programming language](http://regent-lang.org/) which sits on top of Legion, but we haven't taken this nearly as far as Sequoia did.

Despite being a dynamic runtime system, we still put a lot of effort into making sure that it had a principled foundation. Legion is one of the few runtime systems I'm aware of that has its own [type system and operational semantics](http://legion.stanford.edu/pdfs/oopsla2013.pdf). This also made it relatively easy to develop Regent.

With Regent in some ways we've now come full circle, since we're back to having a programming language again. However, one key difference between Regent and Sequoia is what happens when the compilers run into behavior they can't analyze. In Sequoia, it's of the utmost importance that the compiler understand your program, because if the compiler can't prove that parallelism is safe, then it will serialize that part of the program. In contrast, if Regent can't determine that parallelism is safe, it will simply fall back to the runtime system. This means that in general Regent suffers from fewer performance cliffs, or places where performance can suddenly (and sometimes inexplicably) degrade.

There were some lessons we failed to learn from Sequoia. Sequoia made big use of [graph-based IRs](http://theory.stanford.edu/~aiken/publications/papers/ppopp0...) for its compiler. This was a design I copied when developing the [RDIR format](https://github.com/StanfordLegion/rdir) for Regent, a decision I now consider to be a mistake. Graph-based IRs are superficially appealing because they seem to represent the fundamental invariants that you want to reason about. That is, if two tasks are mutually unreachable in the IR, then they are provably safe to execute in parallel. However, working with the graph-based IR as a representation of the program turns out to be a huge pain. RDIR has been by far the biggest source of bugs in Regent and continues to be largest maintenance burden. The other big motivation for a graph-based IR, that it makes certain aliasing properties of the program more explicit than a traditional CFG, is something that I now think could be done adequately well in more traditional formats.

One final note about hardware support: Sequoia made a big bet on the [Roadrunner supercomputer](https://en.wikipedia.org/wiki/IBM_Roadrunner). Roadrunner was an excellent match for Sequoia's programming model, because it used scratchpads (instead of caches) for the compute-heavy processing units. This meant that data (and even instructions!) had to be explicit copied in and out of the memories of the individual processors, something which Sequoia excelled at. Unfortunately, Roadrunner ended up not being so representative of the machines that would follow, and the effort required to retool all the infrastructure ended up being a big drain on the project. Though it would not surprise anyone if these ideas ended up resurfacing in the future, a project like this simply can't afford not to work on contemporary hardware.

In contrast, Legion made its bet on GPUs and heterogeneous architectures more generally. So far this appears to be turning out well and has positioned us to work well on architectures such as [Summit](https://en.wikipedia.org/wiki/Summit_(supercomputer)). However, investment in technologies such as LLVM mean that we're also in a better position to adapt if future technologies end up going in a different direction.


There was a whole course (notes for which are here [1]) dedicated to this topic, by prof. Jeremy Siek.

[1] ecee.colorado.edu/ecen4553/fall09/notes.pdf


This is a great question, and its one that isn't really made clear anywhere iin compiler books that teach it. The tl;dr is that you get lots of nice properties for free, and that it makes writing analysis algorithms much simpler with fewer edge cases.

For context, I was involved for a bit in writing this book, and I was at the conference that spawned it, and I wrote chapter 25. That said, it's been a few years and my memory is poor, so I may have some details below wrong. I'm basing some of this on a fun paper I wrote (http://paulbiggar.com/research/fit-2009.pdf) and stealing directly from my PhD thesis (http://paulbiggar.com/research/#phd-dissertation, section 3.3)

Factored use-def chain: With a def-use chain, dataflow results may be propagated directly from the assignment of a variable to all of its uses. However, a def-use chain requires an edge from each definition to each use, which may be expensive when a program has many definitions and uses of the same variable. In practice, this occurs in the presence of switch-statements. SSA factors the def-use chain over a φ-node, avoiding this pathological case.

OK, english version: in program analysis, you want to know the value of a variable, so you create a "def-use chain" which connects a definition of a variable to all of its uses. In some cases, you might have very many uses and definitions, esp of the same variable. This esp happens with switch-statements. In SSA form, each variable is only defined once, so you dont get pathological cases here.

Flow-sensitivity: A flow-insensitive algorithm performed on an SSA form is much more precise than if SSA form were not used. The flow-insensitive problem of multiple definitions to the same variable is solved by the single assignment property. The allows a flow-insensitive algorithm to approach the precision of a flow-sensitive algorithm.

English version: if you put something in SSA form, you can connect the use of a value directly to its definition (since it's only defined once, with one value). Without SSA, you might have more definitions of that variable, and so not know it's exact value, type, etc, when you go to use it. Without SSA form, to get that data you'd have to analyse the flow of the program (look at loops and if statements, etc). With SSA form, you can ignore those and get a similar level of precision.

Memory usage: Without SSA form, an analysis must store information for every variable at every program point. SSA form allows a sparse analysis, where an analysis must store information only for every assignment in the program. With a unique version per assignment, the memory usage of storing the results of an analysis can be considerably lower than using bit-vector or set-based approaches.

In English: When storing the results of your analysis, you can save them much more cheaply in SSA form, from a CPU and memory perspective. Without SSA, you need a hashtable of variables, with a list of values each. In SSA form, you only need on value, not a list of values, for each variable. You can also number single-assignment variables by number, rather than by name, and so use simple arrays rather than hashtables to store the results.


It's also much easier to decode x86 instructions when you look at them in octal instead of the hexadecimal that most tables use, since both the main opcode map and ModRM/SIB are organised in a 2-3-3 layout:

http://reocities.com/SiliconValley/heights/7052/opcode.txt

The 8080/8085/Z80 instruction sets also look much better in octal:

http://www.z80.info/decoding.htm


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: