Windows Vista Tips

Windows Vista Tips > Newsgroups > Windows Server > Scripting > Extracting Links from an HTML document using a Script

Reply
Thread Tools Display Modes

Extracting Links from an HTML document using a Script

 
 
JoJo
Guest
Posts: n/a

 
      08-28-2009
Folks:

I have an HTML document that is about 100 pages long. I assembled this
document from the "Articles By
This Author" section of the following web page:
http://www.tigersharktrading.com/authors/23/Harry-Boxer

Scattered throughout this document are many links to the web. The links of
interest to me all start with the ">>" characters, as seen
at TigerSharkTrading, then the name of the article is given as a link.

* How can I quickly extract these links and transfer same to a new file
?
* Is there some type of script that can quickly accomplish this task ?


Thanks,
JoJo.



 
Reply With Quote
 
 
 
 
mr_unreliable
Guest
Posts: n/a

 
      08-28-2009
JoJo wrote:
> Folks:
>
> I have an HTML document that is about 100 pages long. I assembled this
> document from the "Articles By
> This Author" section of the following web page:
> http://www.tigersharktrading.com/authors/23/Harry-Boxer
>
> Scattered throughout this document are many links to the web. The links of
> interest to me all start with the ">>" characters, as seen
> at TigerSharkTrading, then the name of the article is given as a link.
>
> * How can I quickly extract these links and transfer same to a new file
> ?
> * Is there some type of script that can quickly accomplish this task ?
>


hi JoJo,

I suggest using the "all" collection (of the document
object).

Let's say that your links appear in an "anchor" (A) tag.

Then you could get your collection of anchor tags like this:

document.all.tags("A")

To get the tags you want, you could "walk-the-list" with
some sort of a loop (your choice, try "For Each").

The individual items would be addressed as:

document.all.tags("A")(i) ' where i is your index

And the number of items would be:

document.all.tags("A").Length

In your discussion, you mentioned the URL's, which are
probably appearing as the "href" attribute of the "A"
tag. My guess is that you can get the URL as:

document.all.tags("A")(i).href


cheers, jw
__________________________________________________ __________

You got questions? WE GOT ANSWERS!!! ..(but, no guarantee
the answers will be applicable to the questions)



 
Reply With Quote
 
Larry Serflaten
Guest
Posts: n/a

 
      08-29-2009

"JoJo" <> wrote
> I have an HTML document that is about 100 pages long. I assembled this
> document from the "Articles By
> This Author" section of the following web page:
> http://www.tigersharktrading.com/authors/23/Harry-Boxer
>
> Scattered throughout this document are many links to the web. The links of
> interest to me all start with the ">>" characters, as seen
> at TigerSharkTrading, then the name of the article is given as a link.
>
> * How can I quickly extract these links and transfer same to a new file
> ?
> * Is there some type of script that can quickly accomplish this task ?



As indicated by mr_unreliable, you will probable want to use the DOM
objects to parse the document. I was just going to add that it appears
all the links of interest are contained in SPAN objects that have a class name
of 'title'. So, instead of grabbing 'all' anchors, you could grab all 'SPAN'
objects and check for a className of title, and then do another grab
within that object for all anchors (of which there is only one, the one you
want)

Something like: (warning - air code)

For each sp in document.all.tags("SPAN")
If sp.className = "title" Then
For each ref in sp.all.tags("A")
' Save hRef to new file ex...
AppendToFile ref.hRef
Next
End If
Next

Your own AppendToFile routine night as well make the file an HTML
document, so you can load it in a browser and click on any interesting
links....

Have fun!
LFS


 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Using html for a reference document jay Internet Explorer 0 06-12-2009 12:36 AM
When I print in IE I get the HTML not the document Kelvin Internet Explorer 2 02-23-2009 04:54 PM
HTML document does not open in new window Tim Internet Explorer 0 06-29-2006 08:07 PM
html local document tommy_boy53002 Internet Explorer 0 03-17-2006 02:04 PM
extracting document user name in spooler Steven Active Directory 0 03-09-2005 05:31 AM



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59