Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
176 views
in Technique[技术] by (71.8m points)

c# - How to cancel Task await after a timeout period

I am using this method to instantiate a web browser programmatically, navigate to a url and return a result when the document has completed.

How would I be able to stop the Task and have GetFinalUrl() return null if the document takes more than 5 seconds to load?

I have seen many examples using a TaskFactory but I haven't been able to apply it to this code.

 private Uri GetFinalUrl(PortalMerchant portalMerchant)
    {
        SetBrowserFeatureControl();
        Uri finalUri = null;
        if (string.IsNullOrEmpty(portalMerchant.Url))
        {
            return null;
        }
        Uri trackingUrl = new Uri(portalMerchant.Url);
        var task = MessageLoopWorker.Run(DoWorkAsync, trackingUrl);
        task.Wait();
        if (!String.IsNullOrEmpty(task.Result.ToString()))
        {
            return new Uri(task.Result.ToString());
        }
        else
        {
            throw new Exception("Parsing Failed");
        }
    }

// by Noseratio - http://stackoverflow.com/users/1768303/noseratio    

static async Task<object> DoWorkAsync(object[] args)
{
    _threadCount++;
    Console.WriteLine("Thread count:" + _threadCount);
    Uri retVal = null;
    var wb = new WebBrowser();
    wb.ScriptErrorsSuppressed = true;

    TaskCompletionSource<bool> tcs = null;
    WebBrowserDocumentCompletedEventHandler documentCompletedHandler = (s, e) => tcs.TrySetResult(true);

    foreach (var url in args)
    {
        tcs = new TaskCompletionSource<bool>();
        wb.DocumentCompleted += documentCompletedHandler;
        try
        {
            wb.Navigate(url.ToString());
            await tcs.Task;
        }
        finally
        {
            wb.DocumentCompleted -= documentCompletedHandler;
        }

        retVal = wb.Url;
        wb.Dispose();
        return retVal;
    }
    return null;
}

public static class MessageLoopWorker
{
    #region Public static methods

    public static async Task<object> Run(Func<object[], Task<object>> worker, params object[] args)
    {
        var tcs = new TaskCompletionSource<object>();

        var thread = new Thread(() =>
        {
            EventHandler idleHandler = null;

            idleHandler = async (s, e) =>
            {
                // handle Application.Idle just once
                Application.Idle -= idleHandler;

                // return to the message loop
                await Task.Yield();

                // and continue asynchronously
                // propogate the result or exception
                try
                {
                    var result = await worker(args);
                    tcs.SetResult(result);
                }
                catch (Exception ex)
                {
                    tcs.SetException(ex);
                }

                // signal to exit the message loop
                // Application.Run will exit at this point
                Application.ExitThread();
            };

            // handle Application.Idle just once
            // to make sure we're inside the message loop
            // and SynchronizationContext has been correctly installed
            Application.Idle += idleHandler;
            Application.Run();
        });

        // set STA model for the new thread
        thread.SetApartmentState(ApartmentState.STA);

        // start the thread and await for the task
        thread.Start();
        try
        {
            return await tcs.Task;
        }
        finally
        {
            thread.Join();
        }
    }
    #endregion
}
Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Updated: the latest version of the WebBrowser-based console web scraper can be found on Github.

Updated: Adding a pool of WebBrowser objects for multiple parallel downloads.

Do you have an example of how to do this in a console app by any chance? Also I don't think webBrowser can be a class variable because I am running the whole thing in a parallell for each, iterating thousands of URLs

Below is an implementation of more or less generic **WebBrowser-based web scraper **, which works as console application. It's a consolidation of some of my previous WebBrowser-related efforts, including the code referenced in the question:

A few points:

  • Reusable MessageLoopApartment class is used to start and run a WinForms STA thread with its own message pump. It can be used from a console application, as below. This class exposes a TPL Task Scheduler (FromCurrentSynchronizationContext) and a set of Task.Factory.StartNew wrappers to use this task scheduler.

  • This makes async/await a great tool for running WebBrowser navigation tasks on that separate STA thread. This way, a WebBrowser object gets created, navigated and destroyed on that thread. Although, MessageLoopApartment is not tied up to WebBrowser specifically.

  • It's important to enable HTML5 rendering using Browser Feature Control, as otherwise the WebBrowser obejcts runs in IE7 emulation mode by default. That's what SetFeatureBrowserEmulation does below.

  • It may not always be possible to determine when a web page has finished rendering with 100% probability. Some pages are quite complex and use continuous AJAX updates. Yet we can get quite close, by handling DocumentCompleted event first, then polling the page's current HTML snapshot for changes and checking the WebBrowser.IsBusy property. That's what NavigateAsync does below.

  • A time-out logic is present on top of the above, in case the page rendering is never-ending (note CancellationTokenSource and CreateLinkedTokenSource).

using Microsoft.Win32;
using System;
using System.Threading;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace Console_22239357
{
    class Program
    {
        // by Noseratio - https://stackoverflow.com/a/22262976/1768303

        // main logic
        static async Task ScrapeSitesAsync(string[] urls, CancellationToken token)
        {
            using (var apartment = new MessageLoopApartment())
            {
                // create WebBrowser inside MessageLoopApartment
                var webBrowser = apartment.Invoke(() => new WebBrowser());
                try
                {
                    foreach (var url in urls)
                    {
                        Console.WriteLine("URL:
" + url);

                        // cancel in 30s or when the main token is signalled
                        var navigationCts = CancellationTokenSource.CreateLinkedTokenSource(token);
                        navigationCts.CancelAfter((int)TimeSpan.FromSeconds(30).TotalMilliseconds);
                        var navigationToken = navigationCts.Token;

                        // run the navigation task inside MessageLoopApartment
                        string html = await apartment.Run(() =>
                            webBrowser.NavigateAsync(url, navigationToken), navigationToken);

                        Console.WriteLine("HTML:
" + html);
                    }
                }
                finally
                {
                    // dispose of WebBrowser inside MessageLoopApartment
                    apartment.Invoke(() => webBrowser.Dispose());
                }
            }
        }

        // entry point
        static void Main(string[] args)
        {
            try
            {
                WebBrowserExt.SetFeatureBrowserEmulation(); // enable HTML5

                var cts = new CancellationTokenSource((int)TimeSpan.FromMinutes(3).TotalMilliseconds);

                var task = ScrapeSitesAsync(
                    new[] { "http://example.com", "http://example.org", "http://example.net" },
                    cts.Token);

                task.Wait();

                Console.WriteLine("Press Enter to exit...");
                Console.ReadLine();
            }
            catch (Exception ex)
            {
                while (ex is AggregateException && ex.InnerException != null)
                    ex = ex.InnerException;
                Console.WriteLine(ex.Message);
                Environment.Exit(-1);
            }
        }
    }

    /// <summary>
    /// WebBrowserExt - WebBrowser extensions
    /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
    /// </summary>
    public static class WebBrowserExt
    {
        const int POLL_DELAY = 500;

        // navigate and download 
        public static async Task<string> NavigateAsync(this WebBrowser webBrowser, string url, CancellationToken token)
        {
            // navigate and await DocumentCompleted
            var tcs = new TaskCompletionSource<bool>();
            WebBrowserDocumentCompletedEventHandler handler = (s, arg) =>
                tcs.TrySetResult(true);

            using (token.Register(() => tcs.TrySetCanceled(), useSynchronizationContext: true))
            {
                webBrowser.DocumentCompleted += handler;
                try
                {
                    webBrowser.Navigate(url);
                    await tcs.Task; // wait for DocumentCompleted
                }
                finally
                {
                    webBrowser.DocumentCompleted -= handler;
                }
            }

            // get the root element
            var documentElement = webBrowser.Document.GetElementsByTagName("html")[0];

            // poll the current HTML for changes asynchronosly
            var html = documentElement.OuterHtml;
            while (true)
            {
                // wait asynchronously, this will throw if cancellation requested
                await Task.Delay(POLL_DELAY, token);

                // continue polling if the WebBrowser is still busy
                if (webBrowser.IsBusy)
                    continue;

                var htmlNow = documentElement.OuterHtml;
                if (html == htmlNow)
                    break; // no changes detected, end the poll loop

                html = htmlNow;
            }

            // consider the page fully rendered 
            token.ThrowIfCancellationRequested();
            return html;
        }

        // enable HTML5 (assuming we're running IE10+)
        // more info: https://stackoverflow.com/a/18333982/1768303
        public static void SetFeatureBrowserEmulation()
        {
            if (System.ComponentModel.LicenseManager.UsageMode != System.ComponentModel.LicenseUsageMode.Runtime)
                return;
            var appName = System.IO.Path.GetFileName(System.Diagnostics.Process.GetCurrentProcess().MainModule.FileName);
            Registry.SetValue(@"HKEY_CURRENT_USERSoftwareMicrosoftInternet ExplorerMainFeatureControlFEATURE_BROWSER_EMULATION",
                appName, 10000, RegistryValueKind.DWord);
        }
    }

    /// <summary>
    /// MessageLoopApartment
    /// STA thread with message pump for serial execution of tasks
    /// by Noseratio - https://stackoverflow.com/a/22262976/1768303
    /// </summary>
    public class MessageLoopApartment : IDisposable
    {
        Thread _thread; // the STA thread

        TaskScheduler _taskScheduler; // the STA thread's task scheduler

        public TaskScheduler TaskScheduler { get { return _taskScheduler; } }

        /// <summary>MessageLoopApartment constructor</summary>
        public MessageLoopApartment()
        {
            var tcs = new TaskCompletionSource<TaskScheduler>();

            // start an STA thread and gets a task scheduler
            _thread = new Thread(startArg =>
            {
                EventHandler idleHandler = null;

                idleHandler = (s, e) =>
                {
                    // handle Application.Idle just once
                    Application.Idle -= idleHandler;
                    // return the task scheduler
                    tcs.SetResult(TaskScheduler.FromCurrentSynchronizationContext());
                };

                // handle Application.Idle just once
                // to make sure we're inside the message loop
                // and SynchronizationContext has been correctly installed
                Application.Idle += idleHandler;
                Application.Run();
            });

            _thread.SetApartmentState(ApartmentState.STA);
            _thread.IsBackground = true;
            _thread.Start();
            _taskScheduler = tcs.Task.Result;
        }

        /// <summary>shutdown the STA thread</summary>
        public void Dispose()
        {
            if (_taskScheduler != null)
            {
                var taskScheduler = _taskScheduler;
                _taskScheduler = null;

                // execute Application.ExitThread() on the STA thread
                Task.Factory.StartNew(
                    () => Application.ExitThread(),
                    CancellationToken.None,
                    TaskCreationOptions.None,
                    taskScheduler).Wait();

                _thread.Join();
                _thread = null;
            }
        }

        /// <summary>Task.Factory.StartNew wrappers</summary>
        public void Invoke(Action action)
        {
            Task.Factory.StartNew(action,
                CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Wait();
        }

        public TResult Invoke<TResult>(Func<TResult> action)
        {
            return Task.Factory.StartNew(action,
                CancellationToken.None, TaskCreationOptions.None, _taskScheduler).Result;
        }

        public Task Run(Action action, CancellationToken token)
        {
            return Task.Factory.StartNew(action, token, TaskCreationOptions.None, _taskScheduler);
        }

     

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...