Friday, March 30, 2007

Screen scrapping

What is Screen Scraping?

Screen Scraping means reading the contents of a web page. Suppose you go to yahoo.com, what you see is the interface which includes buttons, links, images etc. What we don't see is the target url of the links, the name of the images, the method used by the button which can be POST or GET. In other words we don't see the HTML behind the pages. Screen Scraping pulls the HTML of the web page. This HTML includes every HTML tag that is used to make up the page.

Why use screen scraping?

The question that comes to our mind is why do we ever want the HTML of any web page. Screen Scraping does not stop only on pulling out the HTML but displaying it also. In other words you can pull out the HTML from any web page and display that web page on your page. It can be used as frames. But the good thing about screen scraping is that it is supported by all browsers and frames unfortunately are not.

Also sometimes you go to a website which has many links which says image1, image2, image3 and so on. In order to see those images you have to click on the image and it will enlarge in the parent or the new window. By using screen scraping you can pull all the images from a particular web page and display them on your own page.



code
===================================================================
//Html part
//start html

----------------------------------------------------
< %@ Page language="c#" Codebehind="screenscrapping.aspx.cs" AutoEventWireup="false" Inherits="oops.screenscrapping" %>
< !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" >
< HTML>
< HEAD>
< title> screenscrapping< /title>
< meta content="Microsoft Visual Studio 7.0" name=GENERATOR>
< meta content=C# name=CODE_LANGUAGE>
< meta content=JavaScript name=vs_defaultClientScript>
< meta content=http://schemas.microsoft.com/intellisense/ie5 name=vs_targetSchema>
< /HEAD>
< body MS_POSITIONING="GridLayout">
< form id=screenscrapping method=post runat="server">
< h1> Screen Scrape of www.yahoo.com< /H1>
< p> < asp:label id=lblHTMLOutput runat="server"> < /asp:label>
< /FORM>
< /body>
< /HTML>

//end of html----------------------------------------------------
//code-behind part
//start code behind--------------------------------------------
using System;
using System.Collections;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Web;
using System.Web.SessionState;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Web.UI.HtmlControls;
using System.Net;
using System.IO;
using System.Text;



protected System.Web.UI.WebControls.Label lblHTMLOutput;//put a web control label //with this name

private void Page_Load(object sender, System.EventArgs e)
{
// Put user code to initialize the page here

WebClient obj = new WebClient();
const string strUrl="http://yahoo.com";
byte[] reqHTML;
reqHTML = obj.DownloadData(strUrl);
UTF8Encoding objUTF8 = new UTF8Encoding();
lblHTMLOutput.Text=objUTF8.GetString(reqHTML);
}
//end of code-behind-------------------------------------------------------------------
//end of code==================================================================

1 comment:

Mahesh said...

can u tell some thing about UTF8Encoding..means what are the parameters it will take...? and y can't we use HTMLDecode insted of this