Practical Garbage Collection, part 1 – Introduction

Introduction

This is the first part of a series of blog posts I intend to write, whose aim will be to explain how garbage collection works in the real world (in particular, with the JVM). I will cover some theory that I believe is necessary to understand garbage collection enough for practical purposes, but will keep it to a minimum. The motivation is that garbage collection related questions keeps coming up in variety of circumstances, including (for example) on the Cassandra mailing list. The problem when trying to help is that explaining the nuances of garbage collection is too much of an effort to do ad-hoc in a mailing list reply tailored to that specific situation, and you rarely have enough information about the situation to tell someone what their particular problem is caused by.

I hope that this guide will be something I can point to in answering these questions. I hope that it will be detailed enough to be useful, yet easy to digest and non-academic enough for a broad audience.

I very much appreciate any feedback on what I need to clarify, improve, rip out completely, etc.

Much of the information here is not specific to Java. However, in order to avoid constantly invoking generic and abstract terminology, I am going to speak in concrete terms of the Hotspot JVM wherever possible.

Why should anyone have to care about the garbage collector?

That is a good question. The perfect garbage collector would do its job without a human ever noticing that it exists. Unfortunately, there exists no known perfect (whatever perfection means) garbage collection algorithm. Further, the selection of garbage collectors practically available to most people is additionally limited to a subset of garbage collection algorithms that are in fact implemented. (Similarly, malloc is not perfect either and has its issues, with multiple implementations available with different characteristics. However, this post is not trying to contrast automatic and explicit memory management, although that is an interesting topic.)

The reality is that, as with many technical problems, there are some trade-offs involved. As a rule of thumb, if you’re using the freely available Hotspot based JVM:s (Oracle/Sun, OpenJDK), you mostly notice the garbage collector if you care about latency. If you do not, chances are the garbage collector will not be a bother – other than possibly to select a maximum heap size different from the default.

By latency, in the context of garbage collection, I mean pause times. The garbage collector needs to pause the application sometimes in order to do some of its work; this is often referred to as a stop-the-world pause (the “world” being the observable universe from the perspective of the Java application, or mutator in GC speak (because it is mutating the heap while the garbage collector is trying to collect it). It is important to note that while all practically available garbage collectors impose stop-the-world pauses on the application, the frequency and duration of these pauses vary greatly with the choice of garbage collector, garbage collector settings, and application behavior.

As we shall see, there exists garbage collection algorithms that attempt to avoid the need to ever collect the entire heap in a stop-the-world pause. The reason this is an important property is that if at any point (even if infrequent), you stop the application for a complete collection of the heap, the pause times suffered by the application scale proportionally to the heap size. This is typically the main thing you want to avoid when you care about latency. There are other concerns as well, but this is usually tbe big one.

Tracing vs. reference counting

You may have heard of reference counting being used (for example, cPython uses a reference counting scheme for most of it’s garbage collection work). I am not going to talk much about it because it is not relevant to JVM:s, except to say two things:

  • One property that reference counting garbage collection has is that an object will be known to be unreachable  immediately at the point where the last reference is removed.
  • Reference counting will not detect as unreachable cyclic data structures, and has some other problems that cause it to not be the end-all be-all garbage collection choice.

The JVM instead uses what is known as a tracing garbage collector. It is called tracing because, at least at an abstract level, the process of identifying garbage involves taking the root set (things like your local variables on your stack or global variables) and tracing a path from those objects to all objects that are directly or indirectly reachable from said root set. Once all reachable (live) objects have been identified, the objects eligible for being freed by the garbage collector have been identified by a proces of elimination.

Basic stop-the-world, mark, sweep, resume

A very simple tracing garbage collector works using the following process:

  1. Pause the application completely.
  2. Mark all objects that are reachable (from the root set, see above) by tracing the object graph (i.e., following references recursively).
  3. Free all objects that were not reachable.
  4. Resume the application.

In a single-threaded world, this is pretty easy to imagine: The call that is responsible for allocating a new object will either return the new object immediately, or, if the heap is full, initiate the above process to free up space, followed by completing the allocation and returning the object.

None of the JVM garbage collectors work like this. However, it is good to understand this basic form of a garbage collector, as the available garbage collectors are essentially optimizations of the above process.

The two main reasons why the JVM does not implement garbage collection like this are:

  • Every single garbage collection pause will be long enough to collect the entire heap; in other words, it has very poor latency.
  • For almost all real-world applications, it is by far not the most efficient way to perform garbage collection (it has a high CPU overhead).

Compacting vs. non-compacting garbage collection

An important distinction between garbage collectors is whether or not they are compacting. Compacting refers to moving objects around (in memory) to as to collect them in one dense region of memory, instead of being spread out sparsely over a larger region.

Real-world analogy: Consider a room full of things on the floor in random places. Taking all these things and stuffing them tightly in a corner is essentially compacting them; freeing up floor space. Another way to remember what compaction is, is to envision one of those machines that take something like a car and compact it together into a block of metal, thus taking less space than the original car by eliminating all the space occupied by air (but as someone has pointed out, while the car id destroyed, objects on the heap are not!).

By contrast a non-compacting collector never moves objects around. Once an object has been allocated in a particular location in memory, it remains there forever or until it is freed.

There are some interesting properties of both:

  • The cost of performing a compacting collection is a function of the amount of live data on the heap. If only 1% of data is live, only 1% of data needs to be compacted (copied in memory).
  • By contrast, in a non-compacting collector objects that are no longer reachable still imply book keeping overhead as their memory locations must be kept track of as being freed, to be used in future allocations.
  • In a compacting collector, allocation is usually done via a bump-the-pointer approach. You have some region of space, and maintain your current allocation pointer. If you allocate an object of n bytes, you simply bump that pointer by n (I am eliding complications like multi-threading and optimizations that implies).
  • In a non-compacting collector, allocation involves finding where to allocate using some mechanism that is dependent on the exact mechanism used to track the availability of free memory. In order to satisfy an allocation of n bytes, a contiguous region of n bytes free space must be found. If one cannot be found (because the heap is fragmented, meaning it consists of a mixed bag of free and allocated space), the allocation will fail.

Real-world analogy: Consider your room again. Suppose you are a compacting collector. You can move things around on the floor freely at your leisure. When you need to make room for that big sofa in the middle of the floor, you move other things around to free up an appropriately sized chunk of space for the sofa. On the other hand, if you are a non-compacting collector, everything on the floor is nailed to it, and cannot be moved. A large sofa might not fit, despite the fact that you have plenty of floor space available – there is just no single space large enough to fit the sofa.

Generational garbage collection

Most real-world applications tend to perform a lot allocation of short-lived objects (in other words, objects that are allocated, used for a brief period, and then no longer referenced). A generational garbage collector attempts to exploit this observation in order to be more CPU efficient (in other words, have higher throughput). (More formally, the hypothesis that most applications have this behavior is known as the weak generational hypothesis.)

It is called “generational” because objects are divided up into generations. The details will vary between collectors, but a reasonable approximation at this point is to say that objects are divided into two generations:

  • The young generation is where objects are initially allocated. In other words, all objects start off being in the young generation.
  • The old generation is where objects “graduate” to when they have spent some time in the young generation.

The reason why generational collectors are typically more efficient, is that they collect the young generation separately from the old generation. Typical behavior of an application in steady state doing allocation, is frequent short pauses as the young generation is being collected – punctuated by infrequent but longer pauses as the old generation fills up and triggers a full collection of the entire heap (old and new). If you look at a heap usage graph of a typical application, it will look similar to this:

Typical sawtooth behavior of heap usage

Typical sawtooth behavior of heap usage with the throughput collector

The ongoing sawtooth look is a result of young generation garbage collections. The large dip towards the end is when the old generation became full and the JVM did a complete collection of the entire heap. The amount of heap usage at the end of that dip is a reasonable approximation of the actual live set at that point in time. (Note: This is a graph from running a stress test against a Cassandra instance configured to use the default JVM throughput collector; it does not reflect out-of-the-box behavior of Cassandra.)

Note that simply picking the “current heap usage” at an arbitrary point in time on that graph will not give you an idea of the memory usage of the application. I cannot stress that point enough. What is typically considered the memory “usage” is the live set, not the heap usage at any particular time. The heap usage is much more a function of the implementation details of the garbage collector; the only effect on heap usage from the memory usage of the application is that it provides a lower bound on the heap usage.

Now, back to why generational collectors are typically more efficient.

Suppose our hypothetical application is such that 90% of all objects die young; in other words, they never survive long enough to be promoted to the old generation. Further, suppose that our collection of the young generation is compacting (see previous sections) in nature. The cost of collecting the young generation is now roughly that of tracing and copying 10% of the objects it contains. The cost associated with the remaining 90% was quite small. Collection of the young generation happens when it becomes full, and is a stop-the-world pause.

The 10% of objects that survived may be promoted to the old generation immediately, or they may survive for another round or two in young generation (depending on various factors). The important overall behavior to understand however, is that objects start off in the young generation, and are promoted to the old generation as a result of surviving in the young generation.

(Astute readers may have noticed that collecting the young generation completely separately is not possible – what if an object in the old generation has a reference to an object in the new generation? This is indeed something a garbage collector must deal with; a future post will talk about this.)

The optimization is quite dependent on the size of the young generation. If the size is too large, it may be so large that the pause times associated with collecting it is a noticeable problem. If the size is too small, it may be that even objects that die young do not die quite quickly enough to still be in the young generation when they die. Recall that the young generation is collected when it becomes full; this means that the smaller it is, the more often it will be collected. Further recall that when objects survive the young generation, they get promoted to the old generation. If most objects, despite dying young, never have a chance to die in the young generation because it is too small – they will get promoted to the old generation and the optimization that the generational garbage collector is trying to make will fail, and you will take the full cost of collecting the object later on in the old generation (plus the up-front cost of having copied it from the young generation).

Parallel collection

The point of having a generational collector is to optimize for throughput; in other words, the total amount of work the application gets to do in a particular amount of time. As a side-effect, most of the pauses incurred due to garbage collection also become shorter. However, no attempt is made to eliminate the periodic full collections which will imply a pause time of whatever is necessary to complete a full collection.

The throughput collector does do one thing which is worth mentioning in order to mitigate this: It is parallel, meaning it uses multiple CPU cores simultaneously to speed up garbage collection. This does lead to shorter pause times, but there is a limit to how far you can go – even in an unrealistic perfect situation of a linear speed-up (meaning, double CPU count -> half collection time) you are limited by the number of CPU cores on your system. If you are collecting a 30 GB heap, that is going to take some significant time even if you do so with 16 parallel threads.

In garbage collection parlance, the word parallel is used to refer to a collector that does work on multiple CPU cores at the same time.

Incremental collection

Incremental in a garbage collection context refers to dividing up the work that needs to be done in smaller chunks, often with the aim of pausing the applications for multiple brief periods instead of a single long pause. The behavior of the generational collector described above is partially incremental in the sense that the young generation collectors constitute incremental work.  However, as a whole, the collection process is not incremental because of the full heap collections incurred when the old generation becomes full.

Other forms of incremental collections are possible; for example, a collector can do a tiny bit of garbage collection work for every allocation performed by the application. The concept is not tied to a particular implementation strategy.

Concurrent collection

Concurrent in a garbage collection context refers to performing garbage collection work concurrently with the application (mutator). For example, on an 8 core system, a garbage collector might keep two background threads that do garbage collection work while the application is running. This allows significant amounts of work to be done without incurring an application pause, usually at some cost of throughput and implementation complexity (for the garbage collector implementor).

Available Hotspot garbage collectors

The default choice of garbage collector in Hotspot is the throughput collector, which is a generational, parallel, compacting collector. It is entirely optimized for throughput; total amount of work achieved by the application in a given time period.

The traditional alternative for situations where latency/pause times are a concern, is the CMS collector. CMS stands for Concurrent Mark & Sweep and refers to the mechanism used by the collector. The purpose of the collector is to minimize or even eliminate long stop-the-world pauses, limiting garbage collection work to shorter stop-the-world (often parallel) pauses, in combination with longer work performed concurrently with the application. An important property of the CMS collector is that it is not compacting, and thus suffers from fragmentation concerns (more on this in a later blog post).

As of later versions of JDK 1.6 and JDK 1.7, there is a new garbage collector available which is called G1 (which stands for Garbage First). It’s aim, like the CMS collector, is to try to mitigate or eliminate the need for long stop-the-world pauses and it does most of it’s work in parallel in short stop-the-world incremental pauses, with some work also being done concurrently with the application. Contrary to CMS, G1 is a compacting collector and does not suffer from fragmentation concerns – but has other trade-offs instead (again, more on this in a later blog post).

Observing garbage collector behavior

I encourage readers to experiment with the behavior of the garbage collector. Use jconsole (comes with the JDK) or VisualVM (which produced the graph earlier on in this post) to visualize behavior on a running JVM. But, in particular, start getting familiar with garbage collection log output, by running your JVM with (updated with jbellis’ feedback – thanks!):

  • -XX:+PrintGC
  • -XX:+PrintGCDetails
  • -XX:+PrintGCDateStamps
  • -XX:+PrintGCApplicationStoppedTime
  • -XX:+PrintPromotionFailure

Also useful but verbose (meaning explained in later posts):

  • -XX:+PrintHeapAtGC
  • -XX:+PrintTenuringDistribution
  • -XX:PrintFLSStatistics=1

The output is pretty easy to read for the throughput collector. For CMS and G1, the output is more opaque to analysis without an introduction. I hope to cover this in a later update.

In the mean time, the take-away is that those options above are probably the first things you want to use whenever you suspect that you have a GC related problem. It is almost always the first thing I tell people when they start to hypothesize GC issues; have you looked at GC logs? If you have not, you are probably wasting your time speculating about GC.

Conclusion

I have tried to produce a crash-course introduction that I hope was enlightening in and of itself, but is primarily intended as background for later posts. I welcome any feedback, particularly if things are unclear or if I am making too many assumptions. I want this series to be approachable by a broad audience as I said in the beginning, though I certainly do assume some level of expertise. But intimate garbage collection knowledge should not be required. If it is, I have failed – please let me know.


Solving the EULA problem

Everyone recognizes the situation. Some piece of software has released an update, or you are installing it for the first time, and you’re asked to accept a new EULA. The EULA is typically 5-15 pages of dense legal text. Almost everyone just tries their best to find the “accept” button as quickly as possible, sometimes giving a sigh as they realize you first have to scroll down before you can hit the accept button.

I think there is something fundamentally flawed here. (In reality this extends to more than just EULA:s and include e.g. contracts signed for a cell phone subscription, and many other things – essentially most legal contracts intended for consumers)

The problem

As the title indicates, I intend to suggest a solution. But first I need to be clear about what I am solving, and make a case for it being a problem to begin with. I will assume the following premises as true (and I think most readers will agree):

  • A very tiny portion of end-users actually read the EULA.
  • If a user does decide to read (not sift) it, it will take, on average,  at least 10 minutes for an average-length EULA to be properly understood.
  • Most EULA:s tend to contain a lot of standard boilerplate that may not be exactly identical to other EULA:s, but is pretty much the same. Some key variation exists depending on the use-case and the desires of the author of the EULA.
  • To the extent that users at large notice the contents of EULA:s, it is typically because a change or a particular part of the EULA is brought to public attention by a blog post, news article or the like.

Given that the average user by far, never reads the EULA, it is essentially still just down to “do you trust the maker of this software to do something sensible?”. The details of the EULA are mostly irrelevant to the user. The user wants to use the software, and either explicitly or implicitly chooses to trust the maker of said software to some extent. If the software does something the user does not like (such as posting information the user thought was private, publicly) the user will become angry regardless of whether the EULA happened to have a clause that directly or indirectly allowed it. The effects (whether positive or negative) are there regardless of the EULA. The individual user has very little chance to do anything about it, regardless of the EULA. A company that changes its EULA will probably do so in response to overall public pressure for PR purposes or for legal/political reasons.

Further, let us take the theoretical legal stance and just say that it is up to a user to read and accept the EULA and that’s the end of discussion. There are at least two huge problems with this:

  • If a piece of software with 100 million users gets updated and prompts for EULA acceptance, given the above premises that means that 1902 human-years are wasted on reading the EULA (100000000*10/60/24/365). Let me repeat that. A couple of thousand human years would be spent reading the EULA. Can you think of something better to do with two thousand human years? Note that no attempt is typically made to optimize this process; for example by presenting a nicely annotated view of the EULA which shows which sections have been changed relative to the last version that was accepted.
  • The argument assumes the user can realistically be expected to opt out of using the application because a part of the EULA is unclear, slightly too allowing or restrictive, etc. Most of the time, the user really isn’t interested in precise interpretation of legal paragraphs and they just want to use the software/service and have it behave decently.

But, despite this, EULA:s still exist. I am not a lawyer, but as far as I can tell, this is because they actually do matter legally. In the rare case that someone does bring suit, the fine print matters. Despite the fact that the original users affected by the alleged violation of the EULA might not ever have read the EULA to begin with. When the fight turns legal, the legal text is the weapon and the battleground. It’s as if we have two different worlds – the real world, and the legal world. And there are some doors in between them. In the real world, we have to deal with EULA:s that are completely useless to us in the real world (and as an end-user one exists in the real world only), because the company behind the service or software exists in both worlds and they must protect themselves in the legal world by subjecting the real world to certain side-effects.

Rejecting non-solutions

I find that there is an overly adversarial view going on whenever EULA:s are discussed. There is plenty of people complaining about companies not caring about privacy and having EULA:s that allow all sorts of things.

Well, I think that is because just like the end-user lives in the real world and not the legal world, the same is true for the developers, marketers and monetizers on the company end. They want to do what they want to do, while maintaining some decent behavior, without legal issues getting in the way. In order to get things done, you thus want an EULA that opens up as much as possible to be flexible so as to not get into a huge legal issue as soon as some tiny change needs to be made.

Pretend you’re writing an EUAL for something like iTunes or Facebook. You  could go into excruciating detail about exactly how each piece of information is allowed to be used. For example, in Facebook’s case, they could specify exactly how information is presented, exactly to whom, and under what circumstances.

But what happens when you make a minor tweak to the UI? Or you introduce new privacy features – even if they are intended to, and legitimately try to, improve privacy features now suddenly an entire EULA needs to be very carefully re-written to match. As a developer, would you want this constantly impeding your progress? As an end-user, would you really want to read new EULA:s every time some little detail is changed?

This is why I think it makes total sense that EULA:s often seem to be very much geared towards letting the company do whatever they want with you. I don’t think this is evilness (usually); I think it’s about being practical in a real world hampered by spill-over from the legal world.

Another potential solution is legislation. Making it illegal to use information in certain ways, or for software to behave in certain ways. I wil immediately reject this notion as unrealistic because no law will ever be flexible enough to allow true innovation and freedom while somehow preventing companies from “doing evil”. We have a public that expects everything to be free (in part I suspect because there is no usable micro payment system, but that’s another blog entry), yet any attempt to monetize on content or statistics is hailed as evil. Let’s say such laws were enacted a few years ago, and that these laws said that software installed on a computer must never send personal user information over the network to servers operated by a third party or the company providing the software. That would probably have made sense to many people at the time, but would make a huge part of today’s software legally impossible. Before the buzzword of “cloud computing” existed, people didn’t really get it. There are plenty of reasons to want your data available centrally and online instead of tying it to a specific physical computer. This was always the case, but before the advent of the cloud computing buzz, it was not commonly realized.

No, we need to have a free system that can grow organically and where innovation is possible and encouraged. We cannot legislate away this problem.

A quick look at open source licenses

Most open source software today is distributed under one of the well-known open sources licenses. I want to mention some things about the situation, in part because some readers may not be familiar with the situation, and in part because even for readers that are I want to highlight certain specific properties.

There is plenty of room for debate concerning which open source license is the better choice for a particular project. But the unique aspect of an open source project that adopts one of them, is that it does not invent its own license. Someone inquiring about the license of a piece of software may receive the response “the MIT license“, or “the Apache license“. While there is some potential for confusion (for example, if you say “The BSD license” are you talking about the 2-clause version or the 3-clause version?), in general the name of the license is enough for the person asking. There is no need to read through the license in detail, because you already know what the license entails by name.

From the point of view of the software developer, adopting a well-known established license is an easy way to get going with actually writing the software, which is often what you care about, without getting a lawyer to draft you a license. (This does not mean that all open source software projects get away without legal assistance, particularly if there are multiple contributors, but it does make the barrier of releasing something much smaller.)

For many end-users, this is probably not a very big deal. Many people probably only care about the fact that it is free-as-in-beer and generally open, rather than each particular detail. But even though it may not be perceived as a big deal, it does allow the individual end-user to make a true decision to accept the license with reasonable knowledge of what it entails, without constantly be reading dense legal text.

More importantly however, look at the typical company. Companies typically have to be very careful in what software they use internally (or as part of released products), mainly for legal reasons. A big company has a lot at stake, and a foul-up with incorrect use of a software can be extremely costly. The difference between the company and the typical private individual, is that the company has the motivation to actually treat the software license the way it is really supposed to be treated; they cannot do the “scroll down and accept” dance that the average person does when he or she gets an iTunes update. And that causes practical problems. If you want to use some piece of software with an arbitrary license, you are likely to have to clear it through legal. That means costs (for legal services), and inconvenience (developers having to go through a bureaucratic process to get to use software), and ultimately less use of the software if the software has a problematic license.

But a company can easily have a policy that anything under a certain set of licenses are always okay to use. Typically licenses like the 2-clause BSD license, the MIT license, the Apache license etc fall in that category (sometimes the GPL).

Solving the EULA problem by generalizing the named license model

My proposed solution is the following.

  1. Establish an authoritative directory of contract clauses where each clause has a globally unique identifier (similarly to how you can link to a license at the Open Source Initiative).
  2. Establish a format and software to express combinations of these clauses.
  3. When constructing a contract or EULA, pick a combination of clauses that exist in this directory.
  4. If needed, add additional clauses – but then also add the clause to said authoritative directory.

Imagine the ecosystem that can be built on top of this. Imagine a database of clauses, which over time are debated and discussed and even tested in court. There can be an accumulation of interpretation and evidence of legal effectiveness and implications for the user. An end-user can decide that a certain set of clauses are acceptable, while others are not.

In this world, whenever I, as a user, am prompted with a new license agreement for iTunes (or whatever), I would (thanks to the above points) see on my screen (presented by my license handling software, not trusting iTunes):
  • Any clauses I have not previously approved as generally acceptable will be very obviously highlighted, and have direct links to evaluation, discussion and legal precedence (if any) for the clause, to give me the best possible chance to evaluate whether I want to accept it.
  • Any notes I have associated with certain clauses (e.g., maybe I have flagged some clauses as dubious and I want to see when a contract contains them).
  • If I so desire, the complete list of clauses if I really want to look at it.

Consider a world where this was standard practice. Not only it is so much easier for the user to make informed decisions, it is also much more difficult for companies to sneak in unreasonable changes in contracts because they will be so obviously flagged – and everyone following directly links will be able to see whatever discussion and evaluation has happened. The user is not alone.

But it is also better for the company – hopefully there will be an ecosystem of clauses that give companies reasonable rights to be flexible and monetize in a reasonable fashion, while also being reasonable to the user. There should be no, or less (only when doing unique things that have not yet had time to establish an ecosystem of contract clauses), reason to construct clauses that are seen as questionable to the user and maybe even are in fact more allowing than they need to be,.

Note that this system assumes that clauses are independent of each other. The way contracts are expressed has to actually change to make this a reality; you can no longer have subtle relations between difference clauses of a contract; each has to stand on its own. That said, some context is probably going to be necessary, such as identifying different parties to the contract in a way which allows automatic treatment of it (for example, I might consider a certain clause acceptable if I am party A, but not if I am party B).

I do not pretend to have all the details worked out, but it seems very much plausible that something along these lines would be possible. I would appreciate comments, especially by lawyers, as to the feasibility of such a system.