Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great article, but I'm curious why automatic reference counting (ARC) and smart pointers never seemed to really catch on outside of Objective-C and Swift:

https://en.wikipedia.org/wiki/Automatic_Reference_Counting

https://en.wikipedia.org/wiki/Smart_pointer

They almost "just work", except for circular references:

https://en.wikipedia.org/wiki/Reference_count#Dealing_with_r...

I'd like to see some real-world studies on what percentage of unclaimed memory is taken up by orphaned circular references, because my gut feeling is that it's below 10%. So that really makes me question why so much work has gone into various garbage collection schemes, nearly all of which suffer from performance problems (or can't be made realtime due to nondetermistic collection costs).

Also I can't prove it, but I think a major source of pain in garbage collection is mutability, which is exacerbated by our fascination with object-oriented programming. I'd like to see a solid comparison of garbage collection overhead between functional and imperative languages.

I feel like if we put the storage overage of the reference count aside (which becomes less relevant as time goes on), then there should be some mathematical proof for how small the time cost can get with a tracing garbage collector or the cycle collection algorithm of Bacon and Paz.



> They almost "just work", except for circular references

Well, and they're often slower since they require mutating the reference count on every single time a field is stored. You can optimize that using lazy reference counting, or not updating refcounts for references on the stack and instead scanning the stack before a collection. But at that point... you're halfway to implementing a tracing collector.

Every refcounting implementation eventually gets "optimized" to the point that it has most of a tracing collector hiding inside of it. I think it's simpler, cleaner, and faster to just start with tracing in the first place.

There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception and they would switch to tracing if they could. They can't because they exposed deterministic finalization to the user which means now their GC strategy is a language "feature" that they're stuck with.

That being said, ref-counting is fine for other kinds of resource management where resources are pretty coarse-grained and don't have cyclic references. For example, I think Delphi uses ref-counting to manage strings, which makes a lot of sense. Many games use ref-counting for tracking resources loaded from disc, and that also works well. In both of those cases, there's nothing to trace through, and the overhead of updating refcounts is fairly low.


There's a classic paper that shows that refcounting and GC are in fact duals of each other, where GC is effectively tracing live objects and refcounting is tracing dead objects:

https://dl.acm.org/citation.cfm?id=1028982

This had a number of important corollaries to this. One is that they have opposite performance characteristics: GCs make creating new references fast but then create a large pause when RAM is exhausted and a collection happens, while refcounts make creating new references slow and create a large pause when large object graphs are freed all at once. Another is that nearly all high-performance memory management systems are hybrids of the two: lazy refcounting is basically like implementing a local garbage collector and then feeding the live set of that into the refcounts, while generational garbage collection is like implementing a local refcount (the write barrier) and feeding the results of that into the root set.


> There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception and they would switch to tracing if they could. They can't because they exposed deterministic finalization to the user which means now their GC strategy is a language "feature" that they're stuck with.

Reference counting and address-semantics (which prohibits moving objects, so even if you work around the refcount, you can only do a tracing GC, but you don't get to compact) are deeply ingrained into CPython's C API as well, which is very widely used.


For what it's worth, Perl (5) has the same problem as CPython. Accordingly, there's not only heaps of allocation- and ref-counting-avoidance optimizations (eg. the stack isn't recounted) but also heaps of bugs from the optimizations. Again, the (main) stack not owning references is the biggest source of these bugs.

The problem with predictable destruction, we typically called "timely destruction", but more accurate would've been "immediate destruction" and if your language guarantees it, it tends to be great for automatically releasing resources (memory, file handles,...) in a lexically predictable manner. Softening any such guarantee should lead to entertaining resource races magically appearing in the real world. It occurs to me now that I never checked what pypy guarantees if anything?

We (Perl) never really came up with a great strategy for working around this (you tended to get the worst of both worlds in most naive reimplementations of the behavior in a GC-based interpreter) and while I certainly haven't tried to work out a clean argument to the effect, I'm fairly convinced it won't fly at all.


Raku (née Perl 6) has phasers that will be called when exiting a scope. One such idiom is:

    {
        # do stuff
        my $dbh = DB.connect(...);
        # do stuff
        LEAVE .disconnect with $dbh;
        # do stuff
    }
Whenever the scope is left (this could be because of an execution error, or a `return`), the code of the LEAVE phaser will be executed. The `with $dbh` checks whether the $dbh variable contains an instantiated object. If so, it topicalizes (set $_ to it) and calls the disconnect method (`.disconnect` being short for `$_.disconnect`).

For more complex uses, and more tuneable finalizing, there is the `FINALIZER` module in the ecosystem: https://modules.raku.org/dist/FINALIZER


Sure, but "schedule action at scope exit" is an easy (assuming your runtime already knows how to unwind the stacks) problem to solve. "Perform escape analysis and guarantee immediate destruction; efficiently without full gc nor refcounting" is not and they're not the same problem at all.


"There's a reason almost no widely used language implementation relies on ref-counting. CPython is the only real exception"

Perl is also ref counted. Swift too.

Edit: Also TCL, and PHP is mostly recounted...the "GC" just deals with circular refs.


Just for the record: Raku (née Perl 6) most definitely does not do refcounting. If you want to do serious async programming, refcounting becomes a nightmare to get right and/or a source of potential deadlocks.


Objective-C and then Swift are surely the most serious goes at fast widely used RC languages. Neither requires mutating the reference count on every store, and neither seems at risk of growing a tracing GC.

ObjC had some elision tricks like objc_retainAutoreleasedReturnValue, but more importantly the optimizer was taught about RC manipulation. Swift then extended this with a new ABI that minimizes unnecessary mutations.

The big advantages of this scheme are efficient COW collections and a simpler FFI (very important with Swift). More generally RC integrates better. Imagine teaching the JS GC to walk the Java heap!


> ObjC had some elision tricks like objc_retainAutoreleasedReturnValue, but more importantly the optimizer was taught about RC manipulation.

Cough. Actually, before ARC the ownership rules really kept RC traffic very low, and you could drive it lower still if you knew a bit about how ownership worked in your code ("these will balance and the object is already owned elsewhere...").

ARC just went mad with trying to give guarantees it really had no business making, and then hoping the compiler would be able to remove them again (difficult with message sending).

This caused some amusing bugs, for example a crash in the following method:

   -someMethod:arg and:arg2
   {
       return 0;
   }
How could this possibly crash? All it could possibly do is clear register AX and return.

Well, with ARC enabled and in debug mode, it inserted code to first retain all the arguments, and then immediately release all the arguments. Apparently one of the args was a bogus pointer (framework-provided, so out of our control) and so it crashed.


So the result of your instructions that have undefined behavior turned out to be not what you wanted? That’s hardly a problem with the language.


Interesting that

    return 0 
is now considered "undefined behavior".

Whatever.


Obviously the problem is with using objects in the message which are invalid. The compiler may or may not work as you expect and that’s your problem, not the compilers.

You can try to defend yourself by stating the call is done by some framework but that just means the problem is still not in the language, it’s in the framework.


First, the compiler and framework come from the same vendor, so it's perfectly fine to blame them. Unless you consider things like compilers and frameworks to be independent subjects capable of being blamed.

Second, whether the behavior is defined or undefined, the compiler has no business touching arguments that the code didn't tell it to touch. If it does so, it isn't fit for purpose and can be criticised as being unfit for purpose.

And the C standard is very clear and adamant that being in compliance with the standard is not, and in many ways cannot be, equivalent to being fit for purpose.

Last not least, I think we are probably not going to agree as to whether dereferencing unused arguments is a valid response to the presence of undefined behavior. See

https://blog.metaobject.com/2018/07/a-one-word-change-to-c-s...


So efficient that Swift's RC implementation loses against all major tracing GC implementations.

https://github.com/ixy-languages/ixy-languages


That project doesn’t compare GC implementations, so it’s probably not that useful here.

Also, the Swift implementation is a bit questionable if performance is a goal. That is, why not try to remove the memory management from the inner loop? Probably the first thing to try is value types instead of reference types, which are more generally preferred anyway.


Sure it does, memory management impacts the performance of the task being achieved, writing an high-speed network driver.


By that measure, every project is a good measure of GC performance. Is that really a good argument?

I believe all general purpose languages let the code allocate memory and therefore will let you allocate memory in a way inefficient to your task.


The argument is that to eventually write Swift code whose performance matches the competition, one needs to work around ARC's implementation.

Hence why there is so much emphasis on value driven programming alongside protocols at WWDC talks.


My claim was about efficient COW collections and the FFI. The point of COW collections is enabling a simpler programming model.


ARC is just reference counting, as there's nothing in reference counting that requires a runtime to do this for you (that is, just because the compiler inserts increments/decrements doesn't make it an entirely different thing).

The reason reference counting is not more common is because of its performance. Naive reference counting works in some cases (e.g. when the counts are not modified often), but doesn't work too well for most programming languages, as the reference counts will be modified frequently.

Deferred and coalesced reference counting are two techniques of improving performance, but they come at a cost: they are quite complex to implement, and require additional memory to keep track of what/when to increment/decrement. You will also end up facing similar nondeterministic behaviour and pauses, as an optimised RC collector may still need to pause the program to scan for cycles. You can handle cycles by supporting weak and strong references, but this puts the burden on the developer, and it's not always clear if cycles may appear when writing code.

Combining this with a good allocator you can achieve performance that is on par with a tracing collector (http://users.cecs.anu.edu.au/~steveb/pubs/papers/rcix-oopsla...), but it's not easy. If you are going down the path of implementing a complex collector, I think you're better off focusing on a real-time or concurrent collector.


> Also I can't prove it, but I think a major source of pain in garbage collection is mutability, which is exacerbated by our fascination with object-oriented programming. I'd like to see a solid comparison of garbage collection overhead between functional and imperative languages.

I can only speak for Haskell's GC, but it's pretty conventional. You might think that GC is simpler for Haskell without mutability but you'd be wrong, because (1) Haskell actually gives you plenty of safe and unsafe ways to have mutability, safe ways such as the ST monad (not to be confused with the State monad), (2) laziness effectively is mutation: evaluating a value is effectively overwriting the closure to compute the value with the value itself. The lazy list is a pretty common sight in most Haskell code. So basically Haskell has a pretty conventional stop-the-world, generational GC not unlike imperative languages.


The issue here is that while GC does take some time to perform, it's easy to make the assumption that removing the GC will reclaim the performance lost by the GC. This is a fallacy.

We can compare memory strategies like so:

Garbage Collection:

- Allocations: Very fast (often a single instruction)

- Ownership transfer: Free

- Pointer release: Zero

- Post-free: Garbage collection phase (this is where the time is spent)

Reference Counting:

- Allocations (similar to malloc(), requires a binary search at least)

- Ownership transfer: At least one instruction for single-threaded. Multi-threading: Requires a lock.

- Pointer release: Slow. Locking issues, memory housekeeping.

- Post-free: Zero. Good for realtime.

The first point might require some clarification. When you have a compacting garbage collector, the free memory is usually in a single block. This means that allocations merely require a single pointer to be updated. If the pointer hits the end of the free block, you trigger a GC. You don't even have to check for the end of the free block if the subsequent memory page is marked no-access.

One can spend a lot of time measuring the performance impact of all these different steps, and I am not going to try to prove that GC is always faster than refcounting, but at least it should be clear that it's not a simple matter of assuming that having no GC means that you will have no memory management overhead.


There's another difference: memory usage. For garbage collection, it's at the minimum only after a full GC (and there's a memory/performance tradeoff, in that using more memory allows doing less GC), while for reference counting, it's always at the minimum. Using more memory for the GC means less memory for other things like the disk cache (which caused a performance problem at work once, which we solved by giving the GC less memory to work with).

(Also, "ownership transfer" can be free in reference counting, since the number of references is the same before and after so there's no need to modify the reference count. What isn't free is sharing ownership, since it needs reference count manipulation.)


Indeed. Memory usage is a point I never addressed in my comment. When managing large Java applications, that is indeed one of the most important issues that one has to be aware of. Some runtimes tries to be more automatic than others, and there are books worth discussions to be had about that.

As for the second paragraph, thank you for clarifying that. I used poor terminology. I should probable have said pointer sharing, or copying.

For ownership transfer to be zero cost, the compiler have to be clever enough to figure out that the original reference isn't used after copy. This can be handled by the compiler itself, or be enforced by the language (as is the case with Rust, as far as I understand).


Ref counting uses additional memory to store the refcounts. And possibly more for the suspected set of the cycle collector.

GC uses additional memory to amortize collections over allocations. (Zero overhead means constantly collecting, using 2x steady state memory size is not uncommon.)

Overall, I think you're right, refcounting comes out ahead on memory usage. But it's not completely straightforward to determine.


That depends pretty much how the programming language allows for value types, stack allocation, global memory static allocation and off heap access, even in the presence of a GC.


The reason they are not used is not the circular reference problem, it's performance. Single-threaded ARC is passable, but thread-safe ARC is really, really slow. It basically hits all the worst performance pitfalls of modern CPUs.


Performance is really bad if you use Arc for practically everything, as with Swift. If you can deal with the simplest (tree-like) allocation patterns statically as Rust does, and use refcounting only where it's actually needed, you can outperform tracing GC.


> Performance is really bad if you use Arc for practically everything, as with Swift.

Do you have examples where Swift suffers compared to other languages solely because of ARC?

Also, is this something that Apple might theoretically make up for by optimizing their custom CPUs for it, without changing Swift?


> Do you have examples where Swift suffers compared to other languages solely because of ARC?

Literally all of them. Every single program where the ARC system is actually widely used.

Swift otherwise generates really good code, and should be meeting/beating Java most of the time, but if the ARC system is used, performance instantly takes a massive hit. This is very visible in the small synthetic benchmarks at:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

In every benchmark where it was possible to write the program so that no references are updated in the hot loop, Swift comes out really impressive, beating Java. In benchmarks where reference updates are unavoidable, performance is more in line with Python.

> Also, is this something that Apple might theoretically make up for by optimizing their custom CPUs for it, without changing Swift?

There are various theoretical ideas that have been bounced about for decades, but so far none of those has panned out. While Apple might do that too, at this point they seem to be chasing the opposite approach:

https://github.com/apple/swift/blob/master/docs/OwnershipMan...

Essentially, nicking the ownership/lifetimes system out of Rust, to avoid updating references wherever possible. The main difference from Rust will be that the current way of doing things will still be possible and the default, but programmers can opt into more performance by decorating their types with lifetimes.

I am cautiously optimistic for this approach: I love Rust, but learning it did definitely feel like being tossed right into the deep end. Which was filled with sharks. If Swift can succeed in a softer approach of introducing lifetimes, that might help the wider introduction of the concept.


> Every single program where the ARC system is actually widely used.

Interestingly enough, the semi-automatic reference counting mechanism used previously was reasonably fast if used as-is and could be made very fast with a bit of thought.


Yes, here it arrives always last.

https://github.com/ixy-languages/ixy-languages

Since the Xerox PARC workstations and Genera Lisp Machines, all hardware specific instructions for memory management have proven to be worse than doing it eventually in software.


How does that compare to SwiftNIO?

https://github.com/apple/swift-nio


SwiftNIO is not high-speed network driver to start with.

As for memory management impact in overall performance, a similar set of benchmarks would be needed.


What is the difference between Clang's ARC implementation and Rust's std::rc::Rc implementation? Rust's ownership system provides the scaffolding for the Drop implementation that manages reference counting instead of Clang doing it as part of the Swift/ObjC frontend but otherwise, they seem to be very similar.

It's just another tool in Rust's box (no pun intended) but I wouldn't say it hasn't caught on. It's used quite often, along with its multithreaded cousin std::sync::Arc.


One difference is optimization. Rust's Arc is runtime-only, while Clang's ARC is understood by the compiler and optimizer. This allows eliding balanced mutations, and also weird ABIs - functions which return a +1 object, or functions which accept a +1 object. See Arc#Optimization [1].

More generally there's an ABI. While to Rust "Arc is just another tool", reference counting is deeply pervasive on iOS and macOS. You would not attempt to pass a Rust Arc to C, and expect something useful to happen. But Clang's ARC works well with C, and Swift, and with ObjC and even with ObjC-pre-ARC. This isn't a critique of Rust, just reflecting different priorities.

1: https://clang.llvm.org/docs/AutomaticReferenceCounting.html#...


Those compiler optimizations need some work I guess.

https://github.com/ixy-languages/ixy-languages


Personally, I appreciate you posting this link once, but seeing it this many times gets tedious.


Unfortunately it is hard to give an answer were all advocates of "Swift ARC performance" are able to read, thus urban myths keep being spread.

ARC makes sense given Objective-C compatibility, as otherwise a layer similar to RCW on .NET for dealing with COM/UWP would be needed.

And Objective-C only dropped its tracing GC, because it was too unstable while mixing GC enabled frameworks with classical ones, while making the compiler automatically call retain/release would provide better productivity without being unstable.


I understand the urge to reply in multiple places. I think it would be better to write up some more detailed reasoning, like you did here, and link to this from multiple subthreads. Better than just repeating the same ixy link with a new snide comment each time.


Makes sense.


> too unstable while mixing GC enabled frameworks with classical ones

Actually, you couldn't mix them. The GC simply never really worked properly.


True, however Apple's documentation suggested otherwise, that it was possible "with care".

But as you say, it never worked properly.


I think what you are referring to is "mixed mode" frameworks, which were able to work in both GC and RC environments. You had to do a bit of work to get that.

Which is similar but not quite the same thing. Mixing pure GC and pure RC frameworks was never possible. In fact, I think there were some flags and the (dynamic) linker would balk.


Which urban myth do you see being spread in this thread?


ARC outperforming modern tracing GCs.


Literally nobody here has said that.


I could collect statements about Swift compiler performance, how it removes retain/release calls, or how reference counting is so much better than tracing GC, but yeah lets live at here.


Another major user of reference counting is COM and WinRT. A major feature of WinRT projections over just using the raw C APIs is that you get reference counting implemented for you using whatever your higher level language uses for object lifetimes.


> and smart pointers never seemed to really catch on outside of Objective-C and Swift:

Smart pointers are a central feature of C++ even before they were added to the C++ standard in 2011.


Indeed, OWL, MFC and VCL already supported them for COM programming since its early days.


I'd agree that mutability is an issue with GC, in fact some LISPs like Losak are deterministic simply by being entirely functional:

http://losak.sourceforge.net/

It is true that you can get cyclical refs higher up in the language even with functional code though, and at some level you need a cycle checker.

There is another type of managed memory not talked about much, and that's the model used by Composita- RAII but with defined message interfaces between modules, such that you can deterministically allocate memory for that:

http://concurrency.ch/Content/publications/Blaeser_ETH_Diss_...

It looks to be the best way of doing deterministic performant memory management, though you'll need a cycle checker in the language runtime.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: