intermittent corrupt output from FSTObjectOutput #106

bradedelman · 2015-12-11T06:16:41Z

I'm seeing very intermittent issues with FSTObjectOutput (when iterating over serializing 100's of objects). I kept thinking: shouldn't it be completely deterministic? In the course of trying to debug it, I stumbled upon the use of SoftReference in FSTConfiguration. So then it occurred to me, oh, I wonder if the intermittent nature stems from the behavior of the Garbage Collector?

So, I experimented with 2 things:

putting System.gc() in the loop. this makes it a lot slower, but failures still occurred intermittently. of course, this call isn't guaranteed to actually do anything.
removing the use of SoftReference in FSTConfiguration and just holding references to the Objects directly. In this configuration, everything appears to be stable! Promising!

I noticed that you had this comment:
/**
* reuse heavy weight objects. If a FSTStream is closed, objects are returned and can be reused by new stream instances.
* the objects are held in soft references, so there should be no memory issues. FIXME: point of contention !
* @param cached
*/

Which suggests (perhaps) some uncertainty about the use of SoftReference?

I don't know your code well enough to decide whether it's safe for me to use my modification (this is a utility that runs and exit, so I'm not concerned about leaking memory, but I am very concerned about reliability),

Have you seen problems with this? My test case can't be shared directly, but I've spent enough time on this problem, that I could work towards developing a standalone example. But before I spent more time on it, I thought it was worth asking you about it.

Thanks,
Brad

RuedigerMoeller · 2015-12-11T09:17:29Z

Thanks for the report.
No, the current version is in production in several large production systems. The reuse mostly occurs if one uses JDK-alike API, however I tend to use FST specific API most of the time (like FSTConfiguration.asObject etc).

Also things like this might happen due to incorrect multithreading/concurrency. You cannot write to an FSTObjectOutput from different threads.

Do you have a reproducing test case ?

bradedelman · 2015-12-11T15:55:33Z

Our program runs on 1 thread only.
I will try and develop a standalone test case that I can share with you without sending you our entire project.

bradedelman · 2015-12-12T00:49:01Z

I have developed a stand-alone reproducible case. :-)

I'm not sure about my theory regarding SoftReference, but I know for sure that this program behaves unexpectedly. It's a stress-test of serializing/deserializing what I think should be the same object 100 times.

For me, the deserializer will fail 0-10 times per run. Usually around 3 or 4 times. I'm surprised that the program isn't completely deterministic. Working theory is that GC is the non-deterministic element.

To try it:
unzip
cd to that directory
gradle build
java -cp "build/libs/*" com.fingerprintplay.Main

and you should see a few failures. If you don't, run it a few more times. It's very intermittent.
Fwiw, I'm running on OSX 10.11
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

I'm hopeful that you are able to reproduce and take a look.

Apologies in advance for any wrong theories here... I really appreciate your help.

Your library is awesome! We are using it to really speed up loading JavaScripts in Rhino on Android. Our preprocessor fails sometimes, and this investigation
reprocase.zip
is to determine why.

Thank you,
Brad

RuedigerMoeller · 2015-12-13T14:01:58Z

appreciate your help very much. Will have a look as soon as possible ..

bradedelman · 2015-12-13T23:31:55Z

thank you!

bradedelman · 2015-12-16T01:35:28Z

curious - were you able to unzip and run the example? did it reproduce the intermittent failure for you?

RuedigerMoeller · 2015-12-16T02:36:51Z

haven't had time for open source work, stay tuned :)

bradedelman · 2016-01-26T22:23:39Z

I've been trying to debug this, but haven't been successful. Would love to chat about debugging strategies, if you were free for a chat session somewhere. Thanks!

RuedigerMoeller · 2016-01-27T11:53:15Z

hm .. maybe just add a flag disabling softreference caching at all using a publis static. The current implementation afaik contains a fallback doing an alloc, so only adding to cache must be ommited.

bradedelman · 2016-02-12T05:57:29Z

Still working to debug the issue I'm seeing with serialization. I no longer think it is related to SoftReference. My latest theory is that there's a bug in the resize method of FSTIdentity2IdMap.

After hours of analysis of my bigger sample, here's a program that, for me, reliably reproduces a failure, just by using FSTIdentity2IdMap in isolation. Note that if I make the initial size big enough that it's never resized, then the failure doesn't happen, thus my suspicion of the resize method.

I would be very grateful if you could take a look / help me understand/fix/update FSTIdentity2IdMap.

Thanks!

public static void main(String[] args) throws Exception
    {
        int COUNT = 8192;

        // if initial size is COUNT, then won't fail - failure seems to be due to FSTIdentity2IdMap resize - the more
        // resizing, the greater percent chance of failure.
        // usually fails after 2280 - 2290 iterations, at least for me.

        FSTIdentity2IdMap map = new FSTIdentity2IdMap(2);
        ArrayList<String> keys = new ArrayList<>();
        HashMap<String, Integer> values = new HashMap<>();

        for (int j=0; j<COUNT; j++) {
            if (j > 2000 && (j % 100 >= 20)) { // create 2000 strings, then reuse some of them 20% of the time
                int index = j % 2000;
                String key = keys.get(index);
                int value = map.putOrGet(key, j);
                int compare = values.get(key);
                if (value != compare) {
                    System.err.println("error " + j);
                    return;
                } else {
                    System.err.println("ok " + j);
                }
            } else {
                String key = UUID.randomUUID().toString();
                keys.add(key);
                map.putOrGet(key, j);
                int value = map.get(key);
                values.put(key, value);
            }
        }
    }

bradedelman · 2016-02-12T07:11:58Z

More info:

The above failure goes away with a modification to the FSTIdentity2IdMap.rePut method, so that it also works with a max-depth object that is using the linear list:

private void rePut(FSTIdentity2IdMap kfstObject2IntMap) {

        // ADD THIS AT THE TOP OF METHOD:
        if ( linearScanList != null ) {
            for (int i = 0; i < linearScanList.size(); i++) {
                Object key = linearScanList.get(i);
                int value = linearScanVals.get(i);
                kfstObject2IntMap.put(key, value);
            }
            return;
        }

I am now investigating why I end up with a max depth FSTIdentity2IdMap faster than I might expect.

It's still quite odd that in my real use case using the serializer, the failure is so intermittent... I've been able to prove that the Java System.identityHashCode results for specific instances cause problems, whereas others do not... trying to connect the dots from that to this...

bradedelman · 2016-02-12T07:30:49Z

Ok, I figured out why the max depth is intermittently reached more rapidly than intended.

putOrGetHash uses calcIndexFromHash to get an index for the key array. if the slot is already used, it tries the next two, but if all 3 are used, it gives up and calls putOrGetNext, which increases the depth.

If in putOrGetNext, I track to see what percentage of the mKey array actually got used, I can see that it varies a lot... sometimes as low as 14%, quite often as low as 33%... so most of the allocated array is wasted and so it gets to the max-depth faster (depending on how unlucky the hash codes get) and then can lead to the resize bug (in my previous comment) where data in the linear lists isn't copied forward.

    final int putOrGetNext(final int hash, final Object key, final int value, int depth) {
        if (next == null) { // new
            int newSiz = mKeys.length / 10;

            // ADD THIS TO SEE: see how many of the key slots are used
            int usedSlots = 0;
            for (int i=0; i<mKeys.length; i++) {
                if (mKeys[i] != null) {
                    usedSlots++;
                }
            }
            float percentUsed = (float)usedSlots/(float)mKeys.length;

Is HashMap really that much slower? Here's a slide in replacement class FSTIdentity2IdMapSimple


public class FSTIdentity2IdMapSimple {

    public HashMap<Object, Integer> mData = new HashMap<>();

    public FSTIdentity2IdMapSimple(int initialSize) {
    }

    public int size() {
        return mData.size();
    }

    public int keysLength() {
        return mData.size();
    }  // instead of .mKeys.length

    public void clear() {
        mData.clear();
    }

    final public int putOrGet(Object key, int value)
    {
        if (mData.containsKey(key)) {
            return mData.get(key);
        } else {
            mData.put(key, value);
            return Integer.MIN_VALUE;
        }
    }

    final public void put(Object key, int value)
    {
        mData.put(key, value);
    }

    final public int get(final Object key)
    {
        if (mData.containsKey(key)) {
            return mData.get(key);
        } else {
            return Integer.MIN_VALUE;
        }
    }
}

bradedelman · 2016-02-12T07:33:30Z

So, in short, I believe FSTIdentity2IdMap has a bug and an unintended behavior.

Your library is really great / fast, has been huge plus for our project. I really appreciate your work on it, and I'm hopeful my effort here leads to improvements.

Please review and let me know if you have some "official fixes".

Thanks!

RuedigerMoeller · 2016-02-13T08:22:19Z

Thanks for your valuable help. Benchmarks have shown (some time ago), that
HashMap is significant slower because of its intermediate bucket objects,
it requires more object allocation which has subtle effects in real world
applications because of higher rate of cache misses (both during put and
lookup). Its hard to benchmark, ussually a small benchmark does not display
its weakness correctly (as everything stays in cpu cache then).
In addition HashMap cannot be reused efficiently as all buckets are thrown
away on clear(), whereas open hashmaps can keep their preallocated arrays
(so the stay in the cpu cache).
Further more: the use and allocation of Integer Objects instead of native
int's adds even more allocation and memory usage to HashMap.

Regarding memory waste: as an object requires >16 bytes of object header, a
single bucket object + link + "next" link is ~24 bytes which means an open
array based hashmap is on par when using only each 6th slot of an array.
Given that Integer Objects > 20000 need to be allocated also you can add
another 20 bytes for each entry which increases the cost of a single
hashmap object to ~44 bytes which is equal to 11 (!) array slots, so even a
fill factor of <10 percent is more efficient in case of array based
hashmaps.

As each dereferencing has a risk of triggering a cache miss, a lookup to a
"cold" (not in cpu cache) HashMap<Object,Integer> triggers significant more
cache misses than a lookup to an array based hashmap using native int's
(Integer objects also require a dereferenceing just to read the int value).

Summary: The design of j.u.HashMap is one of the main causes for GC
pressure + memory consumption + performance degradation of Java data
structures at scale :).

Anyway I'll try to fix and release your observation this weekend if
possible.

2016-02-12 8:33 GMT+01:00 bradedelman notifications@github.com:

So, in short, I believe FSTIdentity2IdMap has a bug and an unintended
behavior.

Your library is really great / fast, has been huge plus for our project. I
really appreciate your work on it, and I'm hopeful my effort here leads to
improvements.

Please review and let me know if you have some "official fixes".

Thanks!

—
Reply to this email directly or view it on GitHub
#106 (comment)
.

RuedigerMoeller · 2016-02-13T09:42:38Z

Great find+fix !
The bug depends on the distribution of object id's.
Still wondering why the map degrades to linear lookup so fast ..

RuedigerMoeller · 2016-02-13T09:48:56Z

checked length of lists and none of them exceeds 20 elements, so it seems hashkeys still distribute well.

will release this with 2.44

Many thanks for your patient investigation and the solution. Basically the linlist was just a quick fix some years ago and I obviously forgot to adapt the resize/reput for it :).
Still strange it has not hit other use cases as fst is used in some pretty large production production systems ..

RuedigerMoeller · 2016-02-28T12:01:10Z

unfortunately several testcases (e.g. basicfsttest) fail with your changes, reason is for multidimensional arrays the original position determination was correct, changed in a way hopefully both cases work now. will release with 2.45´

bradedelman · 2016-02-29T21:32:08Z

Sorry about that. Thanks for the fix to the fix. Are there instructions on how to run the tests? Sorry, I didn't pick up on that.

bradedelman changed the title ~~intermittent corrupt output from FSTObjectOutput - seems related to use of SoftReference~~ intermittent corrupt output from FSTObjectOutput Feb 12, 2016

RuedigerMoeller closed this as completed Feb 13, 2016

RuedigerMoeller mentioned this issue Feb 13, 2016

problem deserializing self-referential Array #112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intermittent corrupt output from FSTObjectOutput #106

intermittent corrupt output from FSTObjectOutput #106

bradedelman commented Dec 11, 2015

RuedigerMoeller commented Dec 11, 2015

bradedelman commented Dec 11, 2015

bradedelman commented Dec 12, 2015

RuedigerMoeller commented Dec 13, 2015

bradedelman commented Dec 13, 2015

bradedelman commented Dec 16, 2015

RuedigerMoeller commented Dec 16, 2015

bradedelman commented Jan 26, 2016

RuedigerMoeller commented Jan 27, 2016

bradedelman commented Feb 12, 2016

bradedelman commented Feb 12, 2016

bradedelman commented Feb 12, 2016

bradedelman commented Feb 12, 2016

RuedigerMoeller commented Feb 13, 2016

RuedigerMoeller commented Feb 13, 2016

RuedigerMoeller commented Feb 13, 2016

RuedigerMoeller commented Feb 28, 2016

bradedelman commented Feb 29, 2016

intermittent corrupt output from FSTObjectOutput #106

intermittent corrupt output from FSTObjectOutput #106

Comments

bradedelman commented Dec 11, 2015

RuedigerMoeller commented Dec 11, 2015

bradedelman commented Dec 11, 2015

bradedelman commented Dec 12, 2015

RuedigerMoeller commented Dec 13, 2015

bradedelman commented Dec 13, 2015

bradedelman commented Dec 16, 2015

RuedigerMoeller commented Dec 16, 2015

bradedelman commented Jan 26, 2016

RuedigerMoeller commented Jan 27, 2016

bradedelman commented Feb 12, 2016

bradedelman commented Feb 12, 2016

bradedelman commented Feb 12, 2016

bradedelman commented Feb 12, 2016

RuedigerMoeller commented Feb 13, 2016

RuedigerMoeller commented Feb 13, 2016

RuedigerMoeller commented Feb 13, 2016

RuedigerMoeller commented Feb 28, 2016

bradedelman commented Feb 29, 2016