Thoughts and drafts about programming: Implementing web crawler. Part 2. Using F#.

The previous article was dedicated to description of important aspects in crawler programming. In this article we will implement web crawler in F# taking these aspects into account.

First let's implement auxiliary module which will help us to handle links. There are two functions that will be useful for us in future.
1. let prepareUrl url (parentUrl: string) — this function is intended for preparing link URL for transition. As I mentioned in previous article — there are different kinds of URLs (absolute, relative, protocol relative and invalid) — it is convenient to create special active pattern for these types for later use.

let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = 
            if input.StartsWith("http://") || input.StartsWith("https://") then AbsoluteUrl
            else if input.StartsWith("//") then ProtocolRelativeUrl
            else if not (String.IsNullOrWhiteSpace input) then RelativeUrl
            else InvalidUrl

Also to simplify code let's introduce active patterns for slash detection.

let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/") then Some StartsWithSlash else None
let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/") then Some EndsWithSlash else None

Below is the full source code of this function. If URL is not invalid it will be converted to absolute.

let prepareUrl url (parentUrl: string) =
    match url with
    | AbsoluteUrl -> Some url
    | ProtocolRelativeUrl -> 
        let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//"))
        Some (protocol + url)
    | RelativeUrl -> 
        let preparedParentUrl = 
            match parentUrl with
            | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length)
            | _ -> parentUrl
        let preparedUrl = 
            match url with
            | StartsWithSlash -> url
            | _ -> "/" + url
        Some (preparedParentUrl + preparedUrl)
    | InvalidUrl -> None

2. let getDomainName url: string — this function is used for domain name extraction from link.

let getDomainName url: string = 
    let uri = new Uri(url)
    let host = uri.Host
    if host.StartsWith("www.") then host.Substring(4)
    else host

Here is the full source code of this module.

namespace Crawler
 
open System
 
module LinksHandler =
    let (|AbsoluteUrl|RelativeUrl|ProtocolRelativeUrl|InvalidUrl|) (input: string) = 
            if input.StartsWith("http://") || input.StartsWith("https://") then AbsoluteUrl
            else if input.StartsWith("//") then ProtocolRelativeUrl
            else if not (String.IsNullOrWhiteSpace input) then RelativeUrl
            else InvalidUrl
 
    let (|StartsWithSlash|_|) (input: string) = if input.StartsWith("/") then Some StartsWithSlash else None
    let (|EndsWithSlash|_|) (input: string) = if input.EndsWith("/") then Some EndsWithSlash else None
 
    let prepareUrl url (parentUrl: string) =
        match url with
        | AbsoluteUrl -> Some url
        | ProtocolRelativeUrl -> 
            let protocol = parentUrl.Substring(0, parentUrl.IndexOf("//"))
            Some (protocol + url)
        | RelativeUrl -> 
            let preparedParentUrl = 
                match parentUrl with
                | EndsWithSlash -> parentUrl.Substring(0, parentUrl.Length)
                | _ -> parentUrl
            let preparedUrl = 
                match url with
                | StartsWithSlash -> url
                | _ -> "/" + url
            Some (preparedParentUrl + preparedUrl)
        | InvalidUrl -> None
 
    let getDomainName url: string = 
        let uri = new Uri(url)
        let host = uri.Host
        if host.StartsWith("www.") then host.Substring(4)
        else host

The next module contains crawling logic. It contains some additional types.

    [<Measure>] type ms
 
type ParserOptions = interface
    abstract SleepTimeInsideDomain: int<ms> with get, set
    abstract StartUrl : string with get, set
    abstract CancellationTokenSource : CancellationTokenSource with get, set
    abstract ReturnPureText: bool with get, set
    abstract GetStartNode: HtmlDocument -> HtmlNode
    abstract HandleUrl : url: string * content: string -> unit
    abstract HandleException : ex: Exception * url: string -> unit
    abstract FilterUrl: url: string -> bool
end

The ParserOptions type is used for crawler behavior customization. Below is the description of available options.
1. SleepTimeInsideDomain — delay in milliseconds between requests to the same domain. Measure type 'ms' is used to specify value in milliseconds.
2. StartUrl — URL we starting crawling from.
3. CancellationTokenSource — it is used for crawling cancellation, we pass it to the crawler through crawler options and when we will decide to cancel crawling we should just call CancellationTokenSource.Cancel() method.
4. ReturnPureText — depending of specifics of our analysis we may need either full HTML code, or extracted plain text. If we set it to true, returned text will be cleaned from HTML envelope.
5. GetStartNode — it is usually used when you crawl a specific site and want to analyze only the specific block of page. For example, you crawl Wikipedia and want to analyze only content without headers and side bars.
6. HandleUrl — payload function that will be called after page is downloaded. For example, you can save text content in database. This function receives page content and URL as parameters.
7. HandleException — exception handler.
8. FilterUrl — you can customize which URLs can or cannot be visited.

We will use F# agents (MailboxProcessor) as the basement of the crawler. It will help us to efficiently and easily manage shared state (we need to keep information about visited links to prevent visiting of the same page multiple times and information about last visit time by each visited domain to correctly organize delays between per-domain requests).
First, as message is received by the agent, link from the message is added to the visited links collection

visitedLinks.Add(preparedUrl) |> ignore

Second, we should determine delay time, basing on link domain and previous information about domain access time. If we even haven't have the domain registered we set delay to 0, because we still have not visited this domain and register it as visited at current time. Otherwise, if we have already visited this domain, we calculate time span between current time and domain last access time. If it will be greater than SleepTimeInsideDomain option value then the delay will be set to 0 too. However if it will be less, then delay will be calculated as (last domain access time + SleepTimeInsideDomain option value - current time).

let (accessTimeByDomains, sleepTime) = 
                                if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then
                                    accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0
                                else 
                                    let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain]
                                    let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain)
                                    let now = DateTime.Now
                                    let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0
                                    let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain)
                                    accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay

After delay calculation we should handle this URL asynchronously. While this URL is being processed the agent will be able to handle other URLs (or receive other messages.) We use F# Data library for HTML parsing and links extracting. As soon as links were extracted we send to the message queue request to processing for each not visited link. Only links satisfying the filter function (FilterUrl) from the parser options will be handled.

do! Async.StartChild
        (async {
            if sleepTime > 0 then do! Async.Sleep sleepTime
 
            printfn "***********************************************************"
            printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString())
            printfn "***********************************************************"
            
            let! html = HtmlDocument.AsyncLoad(preparedUrl)
            let startNode = parserArguments.GetStartNode(html)
            parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString())
            let loadCompleteTime = DateTime.Now
            let links = html.Descendants ["a"]
                        |> Seq.choose (fun x ->
                        x.TryGetAttribute("href")
                        |> Option.map (fun a -> x.InnerText(), a.Value()))
                        |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link))
            links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl))
        }) |> Async.Ignore

Note that the MailboxProcessor.Start method receive CancellationToken as a parameter. This helps us to cancel the crawling. On exception ExceptionHandler from the parser options will be called. When the agent finishes processing it returns updated collections with information about visited links and domains visit time. Here is the full source code of this module.

namespace Crawler
 
open FSharp.Data
open System
open System.Threading
 
module Parser =
 
    [<Measure>] type ms
 
    type ParserOptions = interface
        abstract SleepTimeInsideDomain: int<ms> with get, set
        abstract StartUrl : string with get, set
        abstract CancellationTokenSource : CancellationTokenSource with get, set
        abstract ReturnPureText: bool with get, set
        abstract GetStartNode: HtmlDocument -> HtmlNode
        abstract HandleUrl : url: string * content: string -> unit
        abstract HandleException : ex: Exception * url: string -> unit
        abstract FilterUrl: url: string -> bool
    end
 
    let startCrawling (parserArguments: ParserOptions) =
        MailboxProcessor.Start((fun self -> 
            let rec handleUrl(visitedLinks: Set<string>, accessTimeByDomains: Map<string, DateTime>) =
                async {
                    let! (targetUrl, parentUrl) = self.Receive()
                    try
                        let url = LinksHandler.prepareUrl targetUrl parentUrl
                        
                        match url with
                        | Some preparedUrl ->
                            let currentLinkDomain = LinksHandler.getDomainName preparedUrl
                            
                            visitedLinks.Add(preparedUrl) |> ignore
 
                            let (accessTimeByDomains, sleepTime) = 
                                if not (accessTimeByDomains.ContainsKey(currentLinkDomain)) then
                                    accessTimeByDomains.Add(currentLinkDomain, DateTime.Now), 0
                                else 
                                    let lastDomainAccessTime = accessTimeByDomains.[currentLinkDomain]
                                    let minAllowedAccessTime = lastDomainAccessTime.AddMilliseconds(float parserArguments.SleepTimeInsideDomain)
                                    let now = DateTime.Now
                                    let delay = if minAllowedAccessTime > now then (minAllowedAccessTime - now).TotalMilliseconds else float 0
                                    let accessTimeByDomains = accessTimeByDomains.Remove(currentLinkDomain)
                                    accessTimeByDomains.Add(currentLinkDomain, (if minAllowedAccessTime > now then minAllowedAccessTime else now)), int delay
 
                            do! Async.StartChild
                                    (async {
                                        if sleepTime > 0 then do! Async.Sleep sleepTime
 
                                        printfn "***********************************************************"
                                        printfn "query to url %s at %s" preparedUrl (DateTime.Now.ToString())
                                        printfn "***********************************************************"
                                        
                                        let! html = HtmlDocument.AsyncLoad(preparedUrl)
                                        let startNode = parserArguments.GetStartNode(html)
                                        parserArguments.HandleUrl(preparedUrl, if parserArguments.ReturnPureText then HtmlUtils.extractText startNode else html.ToString())
                                        let loadCompleteTime = DateTime.Now
                                        let links = html.Descendants ["a"]
                                                    |> Seq.choose (fun x ->
                                                    x.TryGetAttribute("href")
                                                    |> Option.map (fun a -> x.InnerText(), a.Value()))
                                                    |> Seq.filter (fun (text, link) -> not (visitedLinks.Contains(link)) && parserArguments.FilterUrl(link))
                                        links |> Seq.iter (fun (text, link) -> self.Post (link, preparedUrl))
                                    }) |> Async.Ignore
                            return! handleUrl(visitedLinks.Add(preparedUrl), accessTimeByDomains)
                        | None -> ()
                    with
                        | ex -> parserArguments.HandleException (ex, targetUrl)
                }
            handleUrl (Set.empty, Map.empty)), parserArguments.CancellationTokenSource.Token).Post (parserArguments.StartUrl, "")

Finally let's look to the parser options example. This is the options for Wikipedia parsing. Note that we will parse only content block, without including headers and sidebars. In this example parsing will be stopped after 100 URLs will have been processed.

open Crawler
open System
open System.Threading
open Parser
open TextStore
open FSharp.Data
 
type ParserArguments() = 
    member val ProcessedUrlCount = 0 with get, set
    interface ParserOptions with
            member val SleepTimeInsideDomain = 3000<ms> with get, set
            member val StartUrl = "https://en.wikipedia.org" with get, set
            member val CancellationTokenSource = new CancellationTokenSource() with get, set
            member val ReturnPureText = true with get, set
            member this.HandleException(ex: Exception, url: string) =
                printfn "%s EXCEPTION for url %s" (ex.Message + "---" + ex.StackTrace) url
                ()
            member this.HandleUrl(url: string, content: string) = 
                if this.ProcessedUrlCount = 100 then (this :> ParserOptions).CancellationTokenSource.Cancel()
                else
                    this.ProcessedUrlCount <- this.ProcessedUrlCount + 1
                    UrlHandler.handleUrl(url, content)
            member this.FilterUrl(url: string) = 
                url.StartsWith("/w")
            member this.GetStartNode(document: HtmlDocument) =
                let contentDiv = document.Descendants(fun node -> node.HasId("content") && node.HasName("div") && node.HasAttribute("role", "main")) |> Seq.tryHead
                match contentDiv with
                | Some(div) -> div
                | None -> document.Body()
 
[<EntryPoint>]
let main argv = 
    let parserArguments = new ParserArguments()
    Parser.startCrawling parserArguments
    System.Console.ReadKey() |> ignore
    0

Thoughts and drafts about programming

вторник, 10 мая 2016 г.

Implementing web crawler. Part 2. Using F#.

Комментариев нет:

Отправить комментарий

Архив блога