A simple pattern I have implemented myself, seen in many places and often results in problems is the following.
We need a buffer to hold some data before we process it, so we allocate an array large enough to hold the data. To avoid continual garbage we keep this array and reuse it. If the data we need to buffer is too large, we throw away the current buffer and make a new larger one. We might even go fancy and ensure that the new buffer is greater in size than the current one by a minimum percentage margin, to ensure we don’t suffer from mass garbage from a 1,2,3,4,5,6 attack.
Where the problem comes in is the spike. A single extraordinarily large piece of data is sent. In response we successfully allocate a huge buffer, maybe taking up almost half of the entire available memory space. Because of our design we never let this go. Now if we are multi-threaded and have multiple processing queues each with their own buffer, a well crafted scenario will take up almost all available memory just with these buffers, until rather than failing due to an OOM exception somewhere we can catch it, the runtime itself throws OOM and kills the process.
Solution? Well we could simply outlaw all large data sets, but if you have multiple processing queues the threshold you have to place can be very low, to the point where normal processing is no longer feasible. Another idea is a bit of a compromise, have a high maximum, and a lower ‘spike threshold’. Any data set over the spike threshold is considered an anomaly and the buffer is not kept for reuse.
A bit more complicated is the reverse extension of the minimum margin of growth. If more than ‘k’ buffer uses in a row are less than the current size reduced by a percentage ‘p’, throw the current buffer away and use a smaller buffer. Question in my mind here is whether k is a constant, or a number which decreases as the buffer gets larger. I think probably the later results in best behaviour, but some theoretical analysis for random and worst case scenarios would be in order.
And the reason I brought this topic up at all? System.Data.SqlClient. Large string values to or from the database are internally handled using a buffer of doom. There is one buffer per database connection. You have to be careful with large data or you end up allocating almost all available memory to database connection buffers…