Extracting Links from an HTML document using a Script

Discussion in 'Scripting' started by JoJo, Aug 28, 2009.

  1. JoJo

    JoJo Guest

    Folks:

    I have an HTML document that is about 100 pages long. I assembled this
    document from the "Articles By
    This Author" section of the following web page:
    http://www.tigersharktrading.com/authors/23/Harry-Boxer

    Scattered throughout this document are many links to the web. The links of
    interest to me all start with the ">>" characters, as seen
    at TigerSharkTrading, then the name of the article is given as a link.

    * How can I quickly extract these links and transfer same to a new file
    ?
    * Is there some type of script that can quickly accomplish this task ?


    Thanks,
    JoJo.
     
    JoJo, Aug 28, 2009
    #1
    1. Advertisements

  2. hi JoJo,

    I suggest using the "all" collection (of the document
    object).

    Let's say that your links appear in an "anchor" (A) tag.

    Then you could get your collection of anchor tags like this:

    document.all.tags("A")

    To get the tags you want, you could "walk-the-list" with
    some sort of a loop (your choice, try "For Each").

    The individual items would be addressed as:

    document.all.tags("A")(i) ' where i is your index

    And the number of items would be:

    document.all.tags("A").Length

    In your discussion, you mentioned the URL's, which are
    probably appearing as the "href" attribute of the "A"
    tag. My guess is that you can get the URL as:

    document.all.tags("A")(i).href


    cheers, jw
    ____________________________________________________________

    You got questions? WE GOT ANSWERS!!! ..(but, no guarantee
    the answers will be applicable to the questions)
     
    mr_unreliable, Aug 28, 2009
    #2
    1. Advertisements


  3. As indicated by mr_unreliable, you will probable want to use the DOM
    objects to parse the document. I was just going to add that it appears
    all the links of interest are contained in SPAN objects that have a class name
    of 'title'. So, instead of grabbing 'all' anchors, you could grab all 'SPAN'
    objects and check for a className of title, and then do another grab
    within that object for all anchors (of which there is only one, the one you
    want)

    Something like: (warning - air code)

    For each sp in document.all.tags("SPAN")
    If sp.className = "title" Then
    For each ref in sp.all.tags("A")
    ' Save hRef to new file ex...
    AppendToFile ref.hRef
    Next
    End If
    Next

    Your own AppendToFile routine night as well make the file an HTML
    document, so you can load it in a browser and click on any interesting
    links....

    Have fun!
    LFS
     
    Larry Serflaten, Aug 29, 2009
    #3
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.