Best method to use for webscraping, screen scraping, data mining for other bet sites

**davecon** · 23-03-2014, 11:02 AM

Hi Troy
Big Subject but the best way you can go (If I have this right and just as a Starting point)
Is to Use Inner Body Text – This will give you ALL the HTML text without the tags etc so you can just edit what you want using normal text methods – In other words its WYSIWYG
For Editing Text best way to go is Stringbuilder

If of course you just want particular parts of the Large and Complicated Web Page (When within Frames or Widgets etc ) Then you will have to grab all the HTML – Strip out just the bits you want and create a new HTML doc and use that (Easiest I find) Forget stuff like Regex if I was you lol
Anyway have a go at this first to see what you get
No need for a Web Browser Control with below but you may want to add one to display the pages you want while playing around
Just use a Button with a Multiline Textbox with 2 Scrollbars and then take it from there
Here you Go - Example is Betfair Results Page

Code:

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        txt1.Text = GetInnerBodyText(strMyURL)

    End Sub

    Private WB_ As WebBrowser 'For Inner Body Text Function

    Dim strMyURL As String = "http://form.horseracing.betfair.com/daypage" ‘Betfair Results Page

    Function GetInnerBodyText(ByVal URL As String
                                  ) As String

        WB_ = New WebBrowser 'Create New WebBrowser Control
        WB_.Navigate(URL)
        WB_.ScriptErrorsSuppressed = True

        Do While WB_.ReadyState <> WebBrowserReadyState.Complete
            Application.DoEvents()
            Threading.Thread.Sleep(100)
        Loop

        GetInnerBodyText = WB_.Document.Body.InnerText.ToString()

        'get rid of the control and all related streams and data
        WB_.DocumentText = Nothing
        WB_.DocumentStream.Close()
        WB_.DocumentStream = Nothing
        For Each c As Control In WB_.Controls
            c.Dispose()
            c = Nothing
        Next
        WB_.Site = Nothing
        WB_.Dispose()
        WB_ = Nothing

        GC.Collect()

    End Function 'InnerBodyText DONE

Have Fun
Dave

**Guest** · 24-03-2014, 06:17 AM

Originally posted by davecon View Post

Hi Troy Big Subject

Dave mate you are correct this is a huge subject and there is so many different methods for doing this. But hopefully towards the end of my post I can help someone else out

I went back through some of my old VB6 code (from 2003) where I use MSHTML and scraped directly from a webbrowser to download comsec stock quotes every minute. I have made tweaks over the past 10 years (about 3) to keep the software working and it still works for me today. So I guess once you create webscrabing software, you can be confident that it won't break that often due to vendors changing the way they send the data to the browser.

The reason I posted the question is I assumed since 2003 (and now using VB.NET) that there would be something better and more powerful!

Ok so I had to do the googling myself but as always well worth it now. I have to say looking at my old VB6 code its ugly, with most of the data being placed into a huge string array where I pull out what I want by providing the correct array offset to get the bid, ask, open, close volume etc. Now knowing what I know I could probably clean my VB6 code up a little, but I won't until it breaks again. Why fix something that ain't broke.

OK so what did I find ....

You can use a number of components to get the raw HTML data or show the webpage such as

Webbrowser component on form
reference to WebBrowser, ie Dim wb As WebBrowser
reference to WebClient, ie Dim wc As WebClient
reference to HttpWebRequest and HttpWebResponse
reference to WebRequest and WebResponse

The Webbrowser component on form is good as you get some feedback on your response but it takes extra time to display this.

The reference to the WebBrowser in code is basically the same control without the GUI part. Then we have webclient, httpwebrequest, and webrequest. From google - HttpWebRequest is much better than webclient which is really just a wrapper and component. WebRequest is the base/parent class for HttpWebRequest. However I am pretty sure Webbrowser itself is a wrapper, ie simplified an supposed to be easier to use !!!

so my head is hurting .... but I figure WebRequest and WebResponse will allow greater functionality to find some data (should i need it).

Now to parse the webresponse....

In an earlier post I was crapping on about COM / DOM / XML / DHTML / MSHTML / object / ?????. The answer I was looking for is DOM the Document Object Model and MSHTML. With DOM we create a document object (of type MSHTML or HTML) and force it over the internet response. The DOM object then allows us to make calls through the document's internal elements. To read more on DOM click here http://www.w3.org/TR/DOM-Level-2-Core/introduction.html.

Now MSHTML is active X wrapper for HTML so really I should probably just use the base html.HTMLDocument for the DOM however I am using mshtml.HTMLDocument for the DOM. If sometone out there knows the difference/advantage between MSHTML V HTML feel free to post below. I may have a play latter to see if I can see any differences ?

Also I noted when parsing you can use:

Basic string functions
Regex
XML parsers
HTML parsers

OK so I was hoping we would not have to get to dirty but I guess I did know in the back of my mind there was not going to be a nice simple solution out there. And this is becuase every website is formatted differently! So we need to pick one of the above or a combination and then use a bit of trial and error.

String functions are a must but I don't want to use my string functions on the full HTML document but only within one of the DOM document elements.

Daves comment was to forget Regex and although I did try it, I saw lots of comments saying leave it alone, and I will!

XML Parser sounds like all our answers - just the data no formatting but really its pointless as no website is truly XML complient, never tested for compliance so the XML paser will return nothing - you are just wasting your time.

HTML parser such as the HTML Agility Pack (http://htmlagilitypack.codeplex.com/) appeared to be an answer and I downloaded the nuget and played with it. Again the HTML Agility Pack is just another wrapper however apparantly is has very good error handling. It worked well but the documentation was poor and if I am going to spend time learning how to use it well then I may as well learn how to use DOM with HTML document of MSHTML document. I am keeping the HTML agility pack handy just in case I get stuck on one particular website.

So here is some code I used to scrape out the horse names of a race. Note: if you are going to scrape data then (as I have come to realize) you have to get your hands dirty. Get the webpage you want, right click and view source and then look for the data you want. In my case I am chasing the horse name "DOG TAGS". I have shown some of the webpage below. I note that the horse name is between from tags, ie span>DOG TAGS but outside of this <A> tag with class name RnrNameLink, ie

Code:

             <tr bgcolor="#eaeaea">
                  <td width="6" height="27">*</td>
                  <td height="27">1</td>
                  <td height="27"><a class="RnrNameLink" target="_blank" href="/racing/formguide.aspx?year=2014&amp;month=3&amp;day=24&amp;meeting=CR&amp;race=5#1"><span>DOG TAGS</span></a></td>
                  <td height="27" style="text-align:right;padding-right:15px;">1.0</td>
                  <td width="100" height="27" id="1_toteOdds">
                    <table width="100%" border="0" cellspacing="0" cellpadding="0">
                      <tr bgcolor="#ebe3d1">
                        <td width="6%" height="27">*</td>
                        <td width="44%" height="27" style="text-align:right;padding-right:5px;" valign="middle" id="1_betOnWin" class="condChangedfav"><a href="#" style="text-decoration:underline;font-weight:normal;">1.4</a></td>
                        <td width="44%" height="27" style="text-align:right;padding-right:5px;"><a href="#" style="text-decoration:underline;font-weight:normal;" id="1_betOnPlace">3.9</a></td>
                        <td width="6%" height="27">*</td>
                      </tr>
                    </table>
                  </td>
                  <td width="10" height="27">*</td>
                  <td style="display: none;" id="1_betTd">

In fact all the horse details have this class RnrNameLink which must be defined in their CSS. So I can use this to quickly jump to the data I want.

The code below comes from this link http://www.dreamincode.net/forums/to...iew-just-text/

Code:

Private Sub btnYourButton(sender As Object, e As EventArgs) Handles btnYourButton.Click
    Dim strWebPage As String
    strWebPage = getHTML("www.theURL.com.au")
end sub

Private Function getHTML(ByVal address As String) As String
   Dim RT As String = ""
   Dim WRequest As WebRequest
   Dim WResponse As WebResponse
        Dim SR As StreamReader

        WRequest = WebRequest.Create(address)
        WResponse = WRequest.GetResponse
        SR = New StreamReader(WResponse.GetResponseStream)
        RT = SR.ReadToEnd()
        SR.Close()
        Return RT
End Function

    Function textFromHtml(ByVal htmlToParse As String) As String
        Dim htmlDocument As mshtml.IHTMLDocument = New mshtml.HTMLDocument
        Dim sCollect As String = ""

        htmlDocument.write(htmlToParse)
        htmlDocument.close()

        Debug.Print(htmlDocument.title.ToString)

        Dim allElements As mshtml.IHTMLElementCollection = htmlDocument.body.all	'Get all elements (everything)
        Dim all_a_tags As mshtml.IHTMLElementCollection = allElements.tags("a")		'Make a collection with all the <A> tags


        For Each elem As mshtml.IHTMLElement In all_a_tags

            If elem.className = "RnrNameLink" Then					'If my A tag has the class I am looking for
                Debug.Print(elem.innerText)						'Returns 	<SPAN>DOG TAGS</SPAN>
                Debug.Print(elem.innerHTML)						'Return 	DOG TAGS
            End If

        Next elem

        Return sCollect
    End Function

Note how the elem.innerText jumps to the next inner tags which are and therefore the innerHTML return the horse name DOG TAGS. If the horse name was burried in further HTML tags ie ,<TD> etc then either we could call a string remove function to remove the tags if easy to remove. If not we could create another mshtml.IHTMLElementCollection and drill down into this.

WOW what a post! I hope this makes sense and it helps someone else.

Troy

**betdynamics** · 24-03-2014, 10:02 PM

Personally I would persevere with the HtmlAgilityPack route - it makes this sort of thing an absolute snap.

**Guest** · 25-03-2014, 12:36 AM

Originally posted by betdynamics View Post

Personally I would persevere with the HtmlAgilityPack route - it makes this sort of thing an absolute snap.

Thanks Betdynamics

I intend to scrape around 5 or 6 other sites, so once I finishing scarping my first betting site, I will go back and do it using the HTML Agility Pack.

Then use my preferred method for the other sites. Keep you posted.

**bnl** · 25-03-2014, 08:21 AM

I have had luck with lynx --dump.
But that is on a linux box.
It gives you the webpage without html, so you can treat the page as a
Textfile. But as with html scraping, you need to know how the page looks like.
And you need another box, or a virtual box to run linux in

--Bj�rn

**Guest** · 27-03-2014, 01:31 AM

Originally posted by bnl View Post

It gives you the webpage without html, so you can treat the page as a Textfile. But as with html scraping, you need to know how the page looks like.
--Bj�rn

The problem with a big text file is that you have to do lots of hard work. If you use the DOM and say (MS.HTML) then you can narrow down your searching by using the HTML element class names OR ids which the vendors have kindly included.

Currently I am just scraping the odds off these sites buts eventually I want my bot to be able to place bets automatically on third party sites. Maybe I should open a thread up and we can all post our code to do such operations

I am mainly interested in the betting such as sportsbet, tatts, sportingbet, centrebet, pinnacle etc

Best method to use for webscraping, screen scraping, data mining for other bet sites

Best method to use for webscraping, screen scraping, data mining for other bet sites

Comment

Comment

Comment

Comment

Comment

Comment