Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
658 views
in Technique[技术] by (71.8m points)

garbage collection - How to handle :java.util.concurrent.TimeoutException: android.os.BinderProxy.finalize() timed out after 10 seconds errors?

We're seeing a number of TimeoutExceptions in GcWatcher.finalize, BinderProxy.finalize, and PlainSocketImpl.finalize. 90+% of them happen on Android 4.3. We're getting reports of this from Crittercism from users out in the field.

enter image description here

The error is a variation of: "com.android.internal.BinderInternal$GcWatcher.finalize() timed out after 10 seconds"

java.util.concurrent.TimeoutException: android.os.BinderProxy.finalize() timed out after 10 seconds
at android.os.BinderProxy.destroy(Native Method)
at android.os.BinderProxy.finalize(Binder.java:459)
at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:187)
at java.lang.Daemons$FinalizerDaemon.run(Daemons.java:170)
at java.lang.Thread.run(Thread.java:841)

So far we haven't had any luck reproducing the problem in house or figuring out what might have caused it.

Any ideas what can cause this? Any idea how to debug this and find out which part of the app causes this? Anything that sheds light on the issue helps.

More Stacktraces:

1   android.os.BinderProxy.destroy  
2   android.os.BinderProxy.finalize Binder.java, line 482
3   java.lang.Daemons$FinalizerDaemon.doFinalize    Daemons.java, line 187
4   java.lang.Daemons$FinalizerDaemon.run   Daemons.java, line 170
5   java.lang.Thread.run    Thread.java, line 841  

2

1   java.lang.Object.wait   
2   java.lang.Object.wait   Object.java, line 401
3   java.lang.ref.ReferenceQueue.remove ReferenceQueue.java, line 102
4   java.lang.ref.ReferenceQueue.remove ReferenceQueue.java, line 73
5   java.lang.Daemons$FinalizerDaemon.run   Daemons.java, line 170
6   java.lang.Thread.run

3

1   java.util.HashMap.newKeyIterator    HashMap.java, line 907
2   java.util.HashMap$KeySet.iterator   HashMap.java, line 913
3   java.util.HashSet.iterator  HashSet.java, line 161
4   java.util.concurrent.ThreadPoolExecutor.interruptIdleWorkers    ThreadPoolExecutor.java, line 755
5   java.util.concurrent.ThreadPoolExecutor.interruptIdleWorkers    ThreadPoolExecutor.java, line 778
6   java.util.concurrent.ThreadPoolExecutor.shutdown    ThreadPoolExecutor.java, line 1357
7   java.util.concurrent.ThreadPoolExecutor.finalize    ThreadPoolExecutor.java, line 1443
8   java.lang.Daemons$FinalizerDaemon.doFinalize    Daemons.java, line 187
9   java.lang.Daemons$FinalizerDaemon.run   Daemons.java, line 170
10  java.lang.Thread.run

4

1   com.android.internal.os.BinderInternal$GcWatcher.finalize   BinderInternal.java, line 47
2   java.lang.Daemons$FinalizerDaemon.doFinalize    Daemons.java, line 187
3   java.lang.Daemons$FinalizerDaemon.run   Daemons.java, line 170
4   java.lang.Thread.run
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Full disclosure - I'm the author of the previously mentioned talk in TLV DroidCon.

I had a chance to examine this issue across many Android applications, and discuss it with other developers who encountered it - and we all got to the same point: this issue cannot be avoided, only minimized.

I took a closer look at the default implementation of the Android Garbage collector code, to understand better why this exception is thrown and on what could be the possible causes. I even found a possible root cause during experimentation.

The root of the problem is at the point a device "Goes to Sleep" for a while - this means that the OS has decided to lower the battery consumption by stopping most User Land processes for a while, and turning Screen off, reducing CPU cycles, etc. The way this is done - is on a Linux system level where the processes are Paused mid run. This can happen at any time during normal Application execution, but it will stop at a Native system call, as the context switching is done on the kernel level. So - this is where the Dalvik GC joins the story.

The Dalvik GC code (as implemented in the Dalvik project in the AOSP site) is not a complicated piece of code. The basic way it work is covered in my DroidCon slides. What I did not cover is the basic GC loop - at the point where the collector has a list of Objects to finalize (and destroy). The loop logic at the base can be simplified like this:

  1. take starting_timestamp,
  2. remove object for list of objects to release,
  3. release object - finalize() and call native destroy() if required,
  4. take end_timestamp,
  5. calculate (end_timestamp - starting_timestamp) and compare against a hard coded timeout value of 10 seconds,
  6. if timeout has reached - throw the java.util.concurrent.TimeoutException and kill the process.

Now consider the following scenario:

Application runs along doing its thing.

This is not a user facing application, it runs in the background.

During this background operation, objects are created, used and need to be collected to release memory.

Application does not bother with a WakeLock - as this will affect the battery adversely, and seems unnecessary.

This means the Application will invoke the GC from time to time.

Normally the GC runs is completed without a hitch.

Sometimes (very rarely) the system will decide to sleep in the middle of the GC run.

This will happen if you run your application long enough, and monitor the Dalvik memory logs closely.

Now - consider the timestamp logic of the basic GC loop - it is possible for the device to start the run, take a start_stamp, and go to sleep at the destroy() native call on a system object.

When it wakes up and resumes the run, the destroy() will finish, and the next end_stamp will be the time it took the destroy() call + the sleep time.

If the sleep time was long (more than 10 seconds), the java.util.concurrent.TimeoutException will be thrown.

I have seen this in the graphs generated from the analysis python script - for Android System Applications, not just my own monitored apps.

Collect enough logs and you will eventually see it.

Bottom line:

The issue cannot be avoided - you will encounter it if your app runs in the background.

You can mitigate by taking a WakeLock, and prevent the device from sleeping, but that is a different story altogether, and a new headache, and maybe another talk in another con.

You can minimize the problem by reducing GC calls - making the scenario less likely (tips are in the slides).

I have not yet had the chance to go over the Dalvik 2 (a.k.a ART) GC code - which boasts a new Generational Compacting feature, or performed any experiments on an Android Lollipop.

Added 7/5/2015:

After reviewing the Crash reports aggregation for this crash type, it looks like these crashes from version 5.0+ of Android OS (Lollipop with ART) only account for 0.5% of this crash type. This means that the ART GC changes has reduced the frequency of these crashes.

Added 6/1/2016:

Looks like the Android project has added a lot of info on how the GC works in Dalvik 2.0 (a.k.a ART).

You can read about it here - Debugging ART Garbage Collection.

It also discusses some tools to get information on the GC behavior for your app.

Sending a SIGQUIT to your app process will essentially cause an ANR, and dump the application state to a log file for analysis.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...