web crawler (also known The crawler browses this URL and
as a web spider or web then seeks for hyperlinks (A tag
robot) is a program or in the HTML language).
automated script which browses
the internet seeking for web Then the crawler browses those
pages to process. links and moves on the same way.
Many applications mostly search Up to here it was the basic idea.
engines, crawl websites everyday Now, how we move on it completely
in order to find up-to-date data. depends on the purpose of the
Most of the web crawlers save a software itself.
copy of the visited page so they
could easily index it later and If we only want to grab emails
the rest crawl the pages for page then we would search the text on
search purposes only such as each web page (including
searching for emails ( for SPAM hyperlinks) and look for email
). addresses. This is the easiest
type of software to develop.
How does it work?
Search engines are much more
A crawler needs a starting point difficult to develop.
which would be a web address, a
URL. When building a search engine we
need to take care of a few other
In order to browse the internet things.
we use the HTTP network protocol
which allows us to talk to web 1. Size - Some web sites are very
servers and download or upload large and contain many
data from and to it. directories and files. It may
consume a lot of time harvesting
all of the data. We must look for bold or italic
text, font colors, font size,
2. Change Frequency – A web site paragraphs and tables. This means
may change very often even a few we must know HTML very good and
times a day. Pages can be deleted we need to parse it first. What
and added each day. We need to we need for this task is a tool
decide when to revisit each site called "HTML TO XML Converters".
and each page per site. One can be found on my website.
You can find it in the resource
3. How do we process the HTML box or just go look for it in the
output? If we build a search Noviway website: www.Noviway.com.
engine we would want to
understand the text rather than That's it for now. I hope you
just treat it as plain text. We learned something.
must tell the difference between
a caption and a simple sentence.
About the Author:
Eran Aharonovich
Software Programmer
Noviway - Smart Solutions
Web Crawler
HTML To XML Converter
Read more articles by:
Eran Aharonovich
Article Source: www.iSnare.com