Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The situation for Go and Java isn't comparable.

Java has expanded into the big data space in recent years and underpins Hadoop, Spark, HBase, Cassandra etc where they are dealing with heap sizes up to 1TB. The previous GC algorithms are fine for smaller heaps (<32GB) but they needed something for bigger ones hence the introduction of G1GC.

Go is very much just trying to get their foundation GC perfected.



Is there a perfect GC? Seems it would free memory as soon as it's no longer used.


There was a LISP machine that did that. It embedded the GC into the memory interface where the apps didn't even know it existed. GC activity ran concurrently with memory operations.

Best modern one might be Azul Systems' Vega Machines:

http://www.azulsystems.com/products/vega/overview

The LISP and Vega machines just try to solve the problem at its core. Works pretty well when you do that. The modern systems try to solve these hard problems on architectures, OS's, and apps that inherently suck at them. That's a difficult problem that requires constant attention by very smart people. Whole Ph.D.'s worth of effort were spent on getting this far with GC's.


There's actually always a duality between:

   - Reference counting/mark and sweep
   - Throughput/latency
   - RAII/GC
A good read: https://www.cs.virginia.edu/~cs415/reading/bacon-garbage.pdf

   We present a formulation of the two algorithms that shows that
   they are in fact duals of each other. Intuitively, the difference is that
   tracing operates on live objects, or “matter”, while reference counting
   operates on dead objects, or “anti-matter”.


Depends on the circumstances. If you would free memory as soon as it's no longer used, it could lead to delays when freeing a large object graph.

Also, a lot of performance can be gained be re-using frequently-allocated short-lived objects' memory, rather than freeing them without special treatment.


> If you would free memory as soon as it's no longer used, it could lead to delays when freeing a large object graph.

A GC still has those same delays when freeing a large object graph (assuming that large object graph is in the tenured generation). They're just sometimes incrementalized or done in a background thread. Your malloc implementation could do that too. If this were actually much of a problem, batched deallocation APIs that work on a background thread could be easily added to (or layered on top of!) the popular modern mallocs.

> Also, a lot of performance can be gained be re-using frequently-allocated short-lived objects' memory, rather than freeing them without special treatment.

Your malloc is already doing this, if it's any good. free is usually implemented with a free list.


A GC still has those same delays when freeing a large object graph (assuming that large object graph is in the tenured generation). They're just sometimes incrementalized or done in a background thread. Your malloc implementation could do that too.

Of course, I was just making an argument against 'immediately deallocating is always the best strategy'. It's clear that malloc()/free() could implement the same behaviour.

free is usually implemented with a free list.

Sorry, I was a bit vague: I was referring to moving garbage collectors, where one of the motivations was to be more performant with a lot of short-lived objects.


Yea but who would ever do that. If you know that the performance of freeing the entire graph is important, then group the entire thing in an arena and free the whole arena in O(1) time.


The biggest problem IMO is that GC is usually used in languages where it gets no free information about variable scope.

GC could actually be really useful in C++ or Rust for postponing the destruction of many small objects and doing it in one big sweep instead.


> GC could actually be really useful in C++ or Rust for postponing the destruction of many small objects and doing it in one big sweep instead.

I've seen this claim before, but I've never understood it. You could add delayed reclamation to your system malloc if it actually helped things: just replace free() with a function that adds to a free list but doesn't actually recycle the memory. That wouldn't require more than a couple of writes and a TLS lookup.

I suspect that the reason why mallocs don't do this is that there's little benefit compared to just recycling the memory right away. Prompt reclamation is actually really nice for cache reasons, and adding memory blocks to a free list is what free's fast path does in the first place.

Now that's not to say that GC wouldn't be useful in C++ or Rust. I think what a GC would be useful for is for lock-free data structures and long-lived structures with dynamic lifetimes, especially ones shared between threads, to eliminate reference counting traffic. But for short-lived objects, if you have any sort of mark phase, you've already lost. Fundamentally, it's really hard to beat a system that precomputes the object lifetimes at compile time—which is what manual memory management is—with a dynamic system that has to compute them at runtime.


> But for short-lived objects, if you have any sort of mark phase, you've already lost.

A GC with compactation gives you bump pointer allocation though. No need to traverse free lists. With malloc you potentially have short-lived and long-lived interspersed with each other, creating lots of holes/fragmentation.

> That wouldn't require more than a couple of writes and a TLS lookup.

While you can just null a reference with a single, unfenced write to something that's probably in your L1 cache already. Plus thread-local lists to avoid contention. Complexity grows quickly. It's not exactly free lunch.


> A GC with compactation gives you bump pointer allocation though. No need to traverse free lists. With malloc you potentially have short-lived and long-lived interspersed with each other, creating lots of holes/fragmentation.

You only get bump pointer allocation in the nursery, not in the tenured generation (unless you want an inefficient tenured generation). In a manually-managed system, short-lived objects usually aren't going to be allocated on the heap at all—they're on the stack. Even for the few that are allocated on the heap, the way free lists work does a great job of keeping them in cache.

Fragmentation really isn't much of a problem anymore with modern mallocs like jemalloc and tcmalloc.

> While you can just null a reference with a single, unfenced write to something that's probably in your L1 cache already. Plus thread-local lists to avoid contention.

I have a hard time believing that the cost of a TLS lookup and a couple of writes per freed object is more expensive than a Cheney scan for objects in the nursery.


> You only get bump pointer allocation in the nursery, not in the tenured generation (unless you want an inefficient tenured generation).

That's not universally true; e.g. V8 can and does allocate pretenured objects in the old generation with bump-pointer allocation.


Surely that can't be done in all circumstances without severe memory usage issues or fragmentation problems though. You have to fill those holes eventually, if not during allocation than during compaction.


Sure, the GC eventually triggers compaction and that by design will produce a lot of contiguous free memory.


I have a hard time believing that the cost of a TLS lookup and a couple of writes per freed object is more expensive than a Cheney scan for objects in the nursery

This paper describes how it is possible if you are happy to throw a lot of memory at the problem:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.49....

Section 3 is titled "Explicit freeing is more expensive".


As with many late 80s memory management papers, cache effects obsolete the conclusions in this one. In fact, the compiler's ability to coalesce all the stack allocations in a single procedure call into a single pair of stack manipulation instructions makes the conclusions almost meaningless in practice: inlining optimizations and modern CPUs make that cost essentially zero nowadays. I'd almost argue that the methodology in this paper, combined with 2015 assumptions, offers an effective argument against GC. Appel's memory management papers really need to be taken with a grain of salt :)


C++ and rust don't pre compute object lifetimes, they make the developer manage them. The only place I know of object lifetimes being statically analysed and computed is in compilers that do escape analysis (JVM, go, maybe others, though go ea is very limited)


Do some reference chasing on region inference. There is a branch of research that was spawned by Cyclone, and papers are still being published today about inference of regions.

E.g. see "Safe and Efficient hybrid memory management for Java."

http://dl.acm.org/citation.cfm?id=2754185&CFID=694525936&CFT...


By "precomputing object lifetimes" I just mean that the object lifetimes are determined at compile time, without runtime scans. Human assistance is still required, of course.


> GC could actually be really useful in C++

Here is a very stable one for C and C++:

http://www.hboehm.info/gc/


It might be stable, but it surely isn't fast, given it is conservative.


Yes, but it is still better than the Go GC. Both are super conservative (i.e. not moving), precise (i.e ptr overhead), and just mark & sweep (i.e. full heap scans).

But Boehm-Weiser's GC can parallelize marking, and store and restore the sweep phase for lower latencies, i.e. in incremental mode. http://www.hboehm.info/gc/gcdescr.html

Better GC's, Mark & Compact and Copying as used in more mature languages, are still on the horizon for Go. (If you go for larger heaps and more speed). I see it at just 30% done yet. Go controls its ABI, so it can easily use better GC's in the future.


Oops, I mixed up the two phases. Marking is incremental (also in v8), and sweeping is parallelizable.

http://jayconrod.com/posts/55/a-tour-of-v8-garbage-collectio... gives a good overview of good GC techniques, championed in earlier lisps and functional languages.


1TB heap?! In one shard of a massively parallel system?

So you are going to have programs using 1PB of RAM across the cluster?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: