Best method to use for webscraping, screen scraping, data mining for other bet sites

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Guest

    #1

    Best method to use for webscraping, screen scraping, data mining for other bet sites

    First apologies for posting under the API-NG board its just that I feel that this appears to have a lot of knowledgeable respondents who follow this board.

    So I have coded up my bot which is extracting the data I need out of BetFair and it appears to be working well.

    What I want to do now is to scrape matching odds from other betting agencies to ensure I am getting the best odds at Betfair before putting my bet on.

    I have seen plenty of VB NET examples where you
    • use a webclient
    • send of a URL request
    • get a response
    • parse the text


    The problem is (to me) is the parsing part is quite ugly.


    I remember in my VB6 days I could use something like ...
    Dim mDocument As MSHTML.HTMLDocument objects
    Dim A_collection As IHTMLElementCollection

    However it has been a while since I have used the above and of course now I am using VB.NET with hopefully better methods.

    From memory the above would return a collection which could be filtered using different HTML or XML tags passed in. ie we could pass in "SPAN" and get elements between the SPAN tags. Many betting sites format their data and use class names such as ...

    <a class="RnrNameLink" target="_blank" href="/racing/formguide.aspx?year=2014&amp;month=3&amp;day=23&am p;meeting=NR&amp;race=1#1"><span>TWISTED MILLER</span></a>

    Therefore we could use <A> span and then look for the class="RnrNameLink" to help us locate the information.



    So now I am in VB.NET, I am hoping there is some type of COM / DOM / XML / DHTML / MSHTML / object / class / ???? that we can set against the webresult that does all the hard formatting and parsing work, and then I simply drill down through some class or structure to extract the information I need

    Or am I dreaming.

    Troy
  • davecon
    Junior Member
    • Dec 2010
    • 86

    #2
    Hi Troy
    Big Subject but the best way you can go (If I have this right and just as a Starting point)
    Is to Use Inner Body Text – This will give you ALL the HTML text without the tags etc so you can just edit what you want using normal text methods – In other words its WYSIWYG
    For Editing Text best way to go is Stringbuilder

    If of course you just want particular parts of the Large and Complicated Web Page (When within Frames or Widgets etc ) Then you will have to grab all the HTML – Strip out just the bits you want and create a new HTML doc and use that (Easiest I find) Forget stuff like Regex if I was you lol
    Anyway have a go at this first to see what you get
    No need for a Web Browser Control with below but you may want to add one to display the pages you want while playing around
    Just use a Button with a Multiline Textbox with 2 Scrollbars and then take it from there
    Here you Go - Example is Betfair Results Page
    Code:
        Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    
            txt1.Text = GetInnerBodyText(strMyURL)
    
        End Sub
    
        Private WB_ As WebBrowser 'For Inner Body Text Function
    
        Dim strMyURL As String = "http://form.horseracing.betfair.com/daypage" ‘Betfair Results Page
    
        Function GetInnerBodyText(ByVal URL As String
                                      ) As String
    
            WB_ = New WebBrowser 'Create New WebBrowser Control
            WB_.Navigate(URL)
            WB_.ScriptErrorsSuppressed = True
    
            Do While WB_.ReadyState <> WebBrowserReadyState.Complete
                Application.DoEvents()
                Threading.Thread.Sleep(100)
            Loop
    
            GetInnerBodyText = WB_.Document.Body.InnerText.ToString()
    
            'get rid of the control and all related streams and data
            WB_.DocumentText = Nothing
            WB_.DocumentStream.Close()
            WB_.DocumentStream = Nothing
            For Each c As Control In WB_.Controls
                c.Dispose()
                c = Nothing
            Next
            WB_.Site = Nothing
            WB_.Dispose()
            WB_ = Nothing
    
            GC.Collect()
    
        End Function 'InnerBodyText DONE
    Have Fun
    Dave
    Last edited by davecon; 23-03-2014, 01:58 PM. Reason: Better Example page

    Comment

    • Guest

      #3
      Originally posted by davecon View Post
      Hi Troy Big Subject
      Dave mate you are correct this is a huge subject and there is so many different methods for doing this. But hopefully towards the end of my post I can help someone else out

      I went back through some of my old VB6 code (from 2003) where I use MSHTML and scraped directly from a webbrowser to download comsec stock quotes every minute. I have made tweaks over the past 10 years (about 3) to keep the software working and it still works for me today. So I guess once you create webscrabing software, you can be confident that it won't break that often due to vendors changing the way they send the data to the browser.

      The reason I posted the question is I assumed since 2003 (and now using VB.NET) that there would be something better and more powerful!

      Ok so I had to do the googling myself but as always well worth it now. I have to say looking at my old VB6 code its ugly, with most of the data being placed into a huge string array where I pull out what I want by providing the correct array offset to get the bid, ask, open, close volume etc. Now knowing what I know I could probably clean my VB6 code up a little, but I won't until it breaks again. Why fix something that ain't broke.

      OK so what did I find ....

      You can use a number of components to get the raw HTML data or show the webpage such as
      • Webbrowser component on form
      • reference to WebBrowser, ie Dim wb As WebBrowser
      • reference to WebClient, ie Dim wc As WebClient
      • reference to HttpWebRequest and HttpWebResponse
      • reference to WebRequest and WebResponse



      The Webbrowser component on form is good as you get some feedback on your response but it takes extra time to display this.

      The reference to the WebBrowser in code is basically the same control without the GUI part. Then we have webclient, httpwebrequest, and webrequest. From google - HttpWebRequest is much better than webclient which is really just a wrapper and component. WebRequest is the base/parent class for HttpWebRequest. However I am pretty sure Webbrowser itself is a wrapper, ie simplified an supposed to be easier to use !!!

      so my head is hurting .... but I figure WebRequest and WebResponse will allow greater functionality to find some data (should i need it).


      Now to parse the webresponse....

      In an earlier post I was crapping on about COM / DOM / XML / DHTML / MSHTML / object / ?????. The answer I was looking for is DOM the Document Object Model and MSHTML. With DOM we create a document object (of type MSHTML or HTML) and force it over the internet response. The DOM object then allows us to make calls through the document's internal elements. To read more on DOM click here http://www.w3.org/TR/DOM-Level-2-Core/introduction.html.

      Now MSHTML is active X wrapper for HTML so really I should probably just use the base html.HTMLDocument for the DOM however I am using mshtml.HTMLDocument for the DOM. If sometone out there knows the difference/advantage between MSHTML V HTML feel free to post below. I may have a play latter to see if I can see any differences ?

      Also I noted when parsing you can use:
      • Basic string functions
      • Regex
      • XML parsers
      • HTML parsers


      OK so I was hoping we would not have to get to dirty but I guess I did know in the back of my mind there was not going to be a nice simple solution out there. And this is becuase every website is formatted differently! So we need to pick one of the above or a combination and then use a bit of trial and error.

      String functions are a must but I don't want to use my string functions on the full HTML document but only within one of the DOM document elements.

      Daves comment was to forget Regex and although I did try it, I saw lots of comments saying leave it alone, and I will!

      XML Parser sounds like all our answers - just the data no formatting but really its pointless as no website is truly XML complient, never tested for compliance so the XML paser will return nothing - you are just wasting your time.

      HTML parser such as the HTML Agility Pack (http://htmlagilitypack.codeplex.com/) appeared to be an answer and I downloaded the nuget and played with it. Again the HTML Agility Pack is just another wrapper however apparantly is has very good error handling. It worked well but the documentation was poor and if I am going to spend time learning how to use it well then I may as well learn how to use DOM with HTML document of MSHTML document. I am keeping the HTML agility pack handy just in case I get stuck on one particular website.

      So here is some code I used to scrape out the horse names of a race. Note: if you are going to scrape data then (as I have come to realize) you have to get your hands dirty. Get the webpage you want, right click and view source and then look for the data you want. In my case I am chasing the horse name "DOG TAGS". I have shown some of the webpage below. I note that the horse name is between from <SPAN> </SPAN> tags, ie span>DOG TAGS</span> but outside of this <A> tag with class name RnrNameLink, ie

      Code:
                   <tr bgcolor="#eaeaea">
                        <td width="6" height="27">*</td>
                        <td height="27">1</td>
                        <td height="27"><a class="RnrNameLink" target="_blank" href="/racing/formguide.aspx?year=2014&amp;month=3&amp;day=24&amp;meeting=CR&amp;race=5#1"><span>DOG TAGS</span></a></td>
                        <td height="27" style="text-align:right;padding-right:15px;">1.0</td>
                        <td width="100" height="27" id="1_toteOdds">
                          <table width="100%" border="0" cellspacing="0" cellpadding="0">
                            <tr bgcolor="#ebe3d1">
                              <td width="6%" height="27">*</td>
                              <td width="44%" height="27" style="text-align:right;padding-right:5px;" valign="middle" id="1_betOnWin" class="condChangedfav"><a href="#" style="text-decoration:underline;font-weight:normal;">1.4</a></td>
                              <td width="44%" height="27" style="text-align:right;padding-right:5px;"><a href="#" style="text-decoration:underline;font-weight:normal;" id="1_betOnPlace">3.9</a></td>
                              <td width="6%" height="27">*</td>
                            </tr>
                          </table>
                        </td>
                        <td width="10" height="27">*</td>
                        <td style="display: none;" id="1_betTd">
      In fact all the horse details have this class RnrNameLink which must be defined in their CSS. So I can use this to quickly jump to the data I want.


      The code below comes from this link http://www.dreamincode.net/forums/to...iew-just-text/

      Code:
      Private Sub btnYourButton(sender As Object, e As EventArgs) Handles btnYourButton.Click
          Dim strWebPage As String
          strWebPage = getHTML("www.theURL.com.au")
      end sub
      
      Private Function getHTML(ByVal address As String) As String
         Dim RT As String = ""
         Dim WRequest As WebRequest
         Dim WResponse As WebResponse
              Dim SR As StreamReader
      
              WRequest = WebRequest.Create(address)
              WResponse = WRequest.GetResponse
              SR = New StreamReader(WResponse.GetResponseStream)
              RT = SR.ReadToEnd()
              SR.Close()
              Return RT
      End Function
      
          Function textFromHtml(ByVal htmlToParse As String) As String
              Dim htmlDocument As mshtml.IHTMLDocument = New mshtml.HTMLDocument
              Dim sCollect As String = ""
      
              htmlDocument.write(htmlToParse)
              htmlDocument.close()
      
              Debug.Print(htmlDocument.title.ToString)
      
              Dim allElements As mshtml.IHTMLElementCollection = htmlDocument.body.all	'Get all elements (everything)
              Dim all_a_tags As mshtml.IHTMLElementCollection = allElements.tags("a")		'Make a collection with all the <A> tags
      
      
              For Each elem As mshtml.IHTMLElement In all_a_tags
      
                  If elem.className = "RnrNameLink" Then					'If my A tag has the class I am looking for
                      Debug.Print(elem.innerText)						'Returns 	<SPAN>DOG TAGS</SPAN>
                      Debug.Print(elem.innerHTML)						'Return 	DOG TAGS
                  End If
      
              Next elem
      
              Return sCollect
          End Function

      Note how the elem.innerText jumps to the next inner tags which are <SPAN> and therefore the innerHTML return the horse name DOG TAGS. If the horse name was burried in further HTML tags ie <B>,<TD> etc then either we could call a string remove function to remove the tags if easy to remove. If not we could create another mshtml.IHTMLElementCollection and drill down into this.


      WOW what a post! I hope this makes sense and it helps someone else.

      Troy
      Last edited by Guest; 27-03-2014, 01:06 AM.

      Comment

      • betdynamics
        Junior Member
        • Sep 2010
        • 534

        #4
        Personally I would persevere with the HtmlAgilityPack route - it makes this sort of thing an absolute snap.

        Comment

        • Guest

          #5
          Originally posted by betdynamics View Post
          Personally I would persevere with the HtmlAgilityPack route - it makes this sort of thing an absolute snap.
          Thanks Betdynamics

          I intend to scrape around 5 or 6 other sites, so once I finishing scarping my first betting site, I will go back and do it using the HTML Agility Pack.

          Then use my preferred method for the other sites. Keep you posted.

          Comment

          • bnl
            Junior Member
            • Nov 2012
            • 108

            #6
            I have had luck with lynx --dump.
            But that is on a linux box.
            It gives you the webpage without html, so you can treat the page as a
            Textfile. But as with html scraping, you need to know how the page looks like.
            And you need another box, or a virtual box to run linux in

            --Björn

            Comment

            • Guest

              #7
              Originally posted by bnl View Post
              It gives you the webpage without html, so you can treat the page as a Textfile. But as with html scraping, you need to know how the page looks like.
              --Björn

              The problem with a big text file is that you have to do lots of hard work. If you use the DOM and say (MS.HTML) then you can narrow down your searching by using the HTML element class names OR ids which the vendors have kindly included.

              Currently I am just scraping the odds off these sites buts eventually I want my bot to be able to place bets automatically on third party sites. Maybe I should open a thread up and we can all post our code to do such operations I am mainly interested in the betting such as sportsbet, tatts, sportingbet, centrebet, pinnacle etc
              Last edited by Guest; 27-03-2014, 01:56 AM.

              Comment

              Working...
              X