Professional Documents
Culture Documents
INTRODUCTION
Web crawler forms an integral part of any search engine. The basic task of a crawler is to
fetch pages, parse them to get more URLs, and then fetch these URLs to get even more
URLs. In this process crawler can also log these pages or perform several other operations on
pages fetched according to the requirements of the search engine. Most of these auxiliary
tasks are orthogonal to the design of the crawler itself.
The explosive growth of the web has rendered the simple task of crawling the web non-
trivial. With this rapid increase in the search space, crawling the web is becoming more
difficult day by day. But all is not lost, newer computational models are being introduced to
make resource intensive tasks more manageable.
The price of computing is decreasing monotonically. It has now become very economical to
use several cheap computation units in distributed fashion to achieve high throughputs. The
challenge while using a distributed model such as one described above, is to efficiently
distribute the computation tasks avoiding overheads for synchronization and maintenance of
consistency.
In this project, design architecture of a scalable, distributed web crawler has been proposed
and implemented . It has been designed to make use of cheap resources and tries to remove
some of the bottleneck of the present crawlers in novel way. For sake of simplicity and focus,
we only worked on the crawling part of the crawler, logging only the URLs. Other functions
can be easily integrated to the design.
1
1.2. OBJECTIVES OF THE PROJECT
The objective of the project is to improve is to allow the user to store the website and the
links on his system and analyze the result. The project will also helps in finding the broken
link in any website. Our main objectives during the development of projects were:
Besides catering to these capabilities our design also includes probabilistic hybrid search
model. This is done using a probabilistic hybrid of stack and queue ADTs (Abstract Data
Type) for maintaining the pending URL lists. Details of the probabilistic hybrid model are
presented later in the project. This distributed crawler is a peer-to-peer distributed
crawler, with no central entity.
Network throughput
Processing capabilities
Database capabilities
Storage capabilities.
A database capability bottleneck is avoided by dividing the URL space into disjoint sets,
each of which is handled by a separate crawler. Each crawler parses and logs only the
URLs that lie in its URL space subset, and forwards rest of the URL to corresponding
crawler entity. Each crawler will have a prior knowledge of the look up table relating each
URL subset to [IP:PORT] combination identifying all the crawler threads
2
CHAPTER 2
LITERATURE REVIEW
The crawler system consists of a number of crawler entities, which run on distributed sites
and interact in peer-to-peer fashion. Each crawler entity has the knowledge to its URL subset,
as well as mapping from URL subset to network address of corresponding peer crawler entity.
Whenever the crawler entity encounters a URL from a different URL subset, it is forwarded
to the appropriate peer crawler entity based on URL subset to crawler entity lookup. Each
crawler entity maintains its own database, which only stores the URLs from the URL subset
assigned to the particular entity. The databases are disjoint and can be combined offline
when the crawling task is complete.
CRAWLER ENTITY
Each crawler entity consists of several of crawler threads, a URL handling thread, a URL
packet dispatcher thread and URL packet receiver thread. The URL set assigned to each
crawler entity will be further divided into subsets for each crawler thread. Each crawler
thread has its own pending URL list. Each thread picks up an element from URL pending
list, generates an HTTP fetch requests, gets the page, parses through this page to extracts
any URLs in it and finally puts them in the job pending queue of the URL handling thread.
During initialization URL handling thread reads the hash to [IP:PORT] mapping. It also has
a job queue. This thread gets a URL from the job queue, checks to see if the URL belongs to
the URL set corresponding to the crawler entity. It does so based on the last few bits of the
hash of the domain name in the URL with conjunction of hash to [IP:PORT] mapping.
If the URL belongs to another entity it will put the URL on the dispatcher queue and get a
new URL from its job queue. If the URL belongs to its set, it firsts checks the URL-seen
cache, if the test fails it queries the URL database to check if the URL has been seen, and puts
the URL in the URL database. It then puts the URL into URL pending list of one of the
crawler threads.
3
URLs are assigned to a crawler thread based on domain names. Each domain name will only
be serviced by one thread; hence only one connection will be maintained with any given
server. This will make sure that the crawler doesnt overload a slow server.
A different hash is used while distributing jobs in between the crawler thread and while
determining the URL subset. The objective behind this to isolate the two operations such that
there is no correlation between a crawler entity and the thread that is assigned to it; thus
balancing the load evenly within the threads. The decision to divide URL space on the bases
to domain names was based on the observation that a lot of pages on the web tend to have
links to pages in the same domain name. Hence if all URLs with a particular domain name
will lie in the same URL space, these URLs will not be needed to be forwarded to other
crawler entities. Thus this scheme provides and effective strategy to divide the crawl task
between different peer-to-peer nodes of this distributed system. We validate this argument in
our experiments described in Section 7. URL dispatcher thread communicates the URLs
corresponding crawler entity. A URL receiver thread collects the URLs received from other
crawler entities i.e. communicated via dispatcher threads of those crawler entities and puts
them on the job queue of the URL handling thread.
4
THE IMPLEMENTATION
The system was implemented in Java platform for portability reasons. MySQL was used for
the URL database. Even though Java is less efficient than other languages that can be
compiled to the native machine code and none of the team members were proficient with it,
we selected Java for this prototype. The reasons behind this decision were to keep the
software architecture modular, make the system portable, and to deal with complexity of
such a system. In retrospect this turned to be a good decision as we might not have been able
to complete this project in time if we would have implemented it in other languages
such as C.
The comprehensive libraries provided with Java us to concentrate our efforts on design of the
system and software architecture. A java class was written for each of the various components
of the system ( i.e. different kind of threads, database, synchronized job queues, caches etc.).
First we wrote generic classes for various infrastructure components of the system like
synchronized job queues and caches. The LRUCache class implements an approximate LRU
cache based of hash table with overlapping buckets. The JobQueue class
implements a generic synchronized job queue with option for probabilistic hybrid of stack
and queue ADT.
The main Crawler class performs the initialization, by reading the configuration files,
spawning various threads accordingly and initializing various job queues. It then behaves as
the Handler Thread. A class named CrawlerThread performs the operation of the Crawler
Thread. This thread simply gets a URL from its job queue, messages the URLlist class with
this URL. The URLlist class then spawns a new thread that fetches the page, parses it for
URL links and returns the list of these URLs back to the CrawlerThread.
In java the URL fetch operation is not guaranteed to return and in case of a malicious web
server the whole thread can possibly hang, waiting for the operation to complete. This is the
reason why the URLlist class spawns a new thread every time to fetch the URL. The thread is
completed with a certain time-out, hence if the URL fetch operation isnt completed in time
the thread stops after time-out and normal operation is resumed. Spawning a new thread to
fetch each page does put an extra overhead on the operation but is essential for the robustness
of the system.
5
The Sender and Receiver classes implement the Sender and Receiver threads respectively.
The Receiver class starts a UDP socket at pre-determine port and waits for any packet. The
Sender class transmits the URLs via UDP packet to appropriate remote node. Besides the
classes that form the system architecture described before, we added a Probe Thread to the
system and a Measurement class.
The relevant classes report the appropriate measurements to the Measurements class and the
Probe Threads messages the Measurement class to output the measurements at configurable
periodic time intervals.
In this project a group computers are used to implement the distributed crawler. Every
node in the computer has its maximum capacity of storing a number of sites. While
using any site the user will select the IP Address of the the target machine and a shared
location. On clicking search button the content will be downloaded on the remote
machin. The user is also having the choice of saving the file into local drive if the
remoter computer is not available.
6
CHAPTER 3
SYSTEM ANALYSIS
Information Retrieval is the area of computer science concerned with retrieving information
about a subject from a collection of data objects. This is not the same as Data Retrieval,
which in the context of documents consists mainly in determining which documents of a
collection contain the keywords of a user query. Information Retrieval deals with satisfying a
user need. Although there was an important body of Information Retrieval techniques
published before the invention of the World Wide Web, here are unique characteristics of the
Web that made them unsuitable or insufficient.
The low cost of publishing in the "open Web" is a key part of its success, but implies that
searching information on the Web will always be inherently more difficult then searching
information in traditional, closed repositories.
The typical design of search engines is a "cascade", in which a Web crawler creates a
collection which is indexed and searched. Most of the designs of search engines consider the
Web crawler as just a first stage in Web search, with little feedback from the ranking
algorithms to the crawling process. This is a cascade model, in which operations are executed
in strict order: first crawling, then indexing, and then searching. An aim of this approach is to
provide the crawler with access to all the information about the collection to guide the
crawling process effectively. This can be taken one step further, as there are tools available
for dealing with all the possible interactions between the modules of a search engine,
7
Existing documentation, forms, file and records
Research and site visits
Observation of the work environment
Questionnaires
Interviews and group work sessions.
From the above mentioned techniques following techniques are used in the project
Web Crawler using Distributed Links for the requirements determination.
The above documents provide information about the forms and the reports to be built
and the type of information to be stored.
3.2.2 Research and the Site visits: This is also fact finding technique it means studying
the application and the problem area. In this project many industries have been visited
to find out the answer of some common questions such as:
8
Performance evaluation of various algorithm.
3.2.3 Questionnaires: It is a document prepared for a special purpose that allows the
analysts to collect information and opinions from a number of respondents.
3.2.4 Personal Interviews: There are always two roles in the personal interview. The
analyst is the interviewer who is responsible for organizing and conducting the
interview. The other role is interviewee who is the end-user or the manager or the
decision maker. The interviewee is asked a number of questions by the interviewer.
In this project the interviews of company head, department heads and employee are
conducted to ascertain their expectation to the system.
Feasibility Study is an important part of the Preliminary Investigation because only feasible
projects go to development stages. A very basic feasibility study for the current project is
given below:
9
3.3.1 Technical Feasibility: Technical feasibility raises questions like, is it possible that the
work can be done with the current equipment; software technology is required what
the possibility that it can be developed is?
In case of this project it fully supports windows XP/2000 but its lacks the support for
windows 98 and lower version. Also the front end tools and the back end tools for the
development of this project are also available. In this project SWING, Servlets has
been used as front end while the MySQL is used as the back end. Both the softwares
are easily available.
Thus it can be concluded that the project is technical feasible.
3.3.2 Economic Feasibility: It deals with economical impacts of the system on the
environment it is used, i.e., benefits in creating the system.
In case of this project it will save the precious time of recording the same data again
and again. The software is also designed to reduce the time and cost during the
calculation of critical data. The security provided by the software is an additional
benefit.
Thus it can be concluded that the project is economically feasible
3.3.3 Operational Feasibility: It deals with the user friendliness of the system, i.e., wills
the system be used if it is developed and implemented? Or will there be resistance
from the users?
In case of this project care has been taken to make this project highly user friendly so
that a person having only a little knowledge of English can handle it. By the way on-
line as well as special help programs which help in training the user are also built.
Thus the project is operationally feasible.
3.3.4 Legal Feasibility: This type of feasibility evaluates whether out project breaks any
law or not. According to the analysis, this project doesnt break any laws. So, it is
legally feasible.
10
CHAPTER 4
SOFTWARE SPECIFICATION
SOFTWARES USED
There were many technologies available for the development of the project. For example for
the front-end development Visual Basic 6, power Builder, X-Windows, Visual Basic.NET,
Oracle Developer 2000, VC++ and Jbuilder. And for the back end Oracle, Ingress, Sybase,
SQL Plus, MY SQL etc. But among these technologies SWING & SERVLET is selected as
Front End tool and MySQL is used as Back End because of the following reasons.
SWING & SERVLET is a Website development technology that has been developed
by Sun Microsystems. It is a powerful programming language to develop
sophisticated web application very quickly. In Java everything is Object Oriented. All
items, even variables, are objects in Java
SWING provide direct integration of Java Code in HTML, that allow the user to
develop websites efficientyly and effectively, apart from this Java is platform
independent and can run on any server.
SWING also provides the support of AJAX that enables the user to partially refresh
the web pages. Programmer can done this with the help of some pre-defined controls.
Thus Java enables the programmer to build efficient websites.
11
SWING supports the use of HTML, CSS and Java Script and a set of pre-defined
classes in the form of JDBC that can be used to access and update databases.
MySQL is one of widely used Back End Tools for developing the application software. Its
gaining the popularity due to the following reasons.
MySQL provides the following advantage for both clients and servers:-
Client Advantages:
Easy to use.
Supports multiple hardware platforms.
Supports multiple software applications
Familiar to the user
Server Advantages:
Reliable
Concurrent
Sophisticated locking
Fault tolerant
12
Thats why MySQL is selected as a Back End tool.
Apart from the above mentioned reasons relevant experience in SWING, SERVLET
and MySQL Server made to select them as front end and back end tools for
developing the project.
CHAPTER 5
SYSTEM SPECIFICATION
The project Web Crawler requires following hardwares for its successful implementation.
HARDWARE
The project Web Crawler System requires following hardwares for its successful
implementation.
SOFTWARE
13
CHAPTER 6
PROJECT DESCRIPTION
Based on the System Analysis described in last few pages a complete Software Requirement
Specification can be prepared which is described below:
6.1 INTRODUCTION
Purpose: The purpose of the software is to provide system support to the users in
storing the web pages of a website by performing a series of crawl operations upto the
given level. Theese pages will be stored in a folder and user can reference these pages
for further study.
Scope: The software would be of great importance for a company. Although the
software is specially designed for the companies but it could be individually used by
any organization of institute to provide offline study of the webpages.
Benefits:The project will automatically navigate through the pages, generate records,
save a lot of bandwidth, allow offline study of webpages.
Product Description: The product is named Web Crawler. The system is going to be
developed using the technologies like Servlet, AWT, Swings and MySQL.
14
Product Functioning: The client will be able to store frequently visited webpages on
his local hard disk. The raw data is then verified and finally a set of operations are to
be performed. For example for user database a new user can be added, existing user
can be removed, or the password can be changed.
Functions of the Project: There are six major function of the software
a) User Verification
b) Upload Raw Data
c) Validate Data
d) Use Validate Data
e) Take Input From The User
f) Save Data Again
Users of the product: There will be five major users of the software:
15
d) Printer
e) Color Display Monitor
6.4. APPENDICES
16
CHAPTER 7
PROJECT DESCRIPTION
Request Response
Web Sites
17
a. USERS DETAILS
b. LOGIN DETAILS
c. WEB DETAILS
d. PAGES INFO
7.2. ER DIAGRAM:
Passwor Web
User_i dd Name URL
d
Nam
e
Page
* Info
Mana Pages
ges
Sele Locatio
addres
cts n
s Web
Name
Compute 1
Nam r Page
e ID
User_i
d
18
7.3. DATA FLOW DIAGRAM
19
7.4. DATABASE DESIGN
Database is a collection of related table and it is the heart of any software because it stores the
most critical part, the data about the system. So proper planning needs to done be done to
ensure the design of an effective database. An effective database design includes:
Normalized Tables
Data Dictionary
Constraints
NOTE:
20
CHAPTER 8
SNAPSHOTS
SNAPSHOT 1:
SNAPSHOT 2:
21
SNAPSHOT 3:
SNAPSHOT 4:
22
SNAPSHOT 5:
SNAPSHOT 6:
23
SNAPSHOT 7:
24
CHAPTER 9
CODING
CODING OF CRAWLER.JAVA
package coding;
import javax.swing.JOptionPane;
import java.sql.*;
25
setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE);
addWindowListener(new java.awt.event.WindowAdapter() {
public void windowOpened(java.awt.event.WindowEvent evt) {
formWindowOpened(evt);
}
});
jButton2.setText("Exit");
jLabel4.setForeground(new java.awt.Color(255, 0, 0));
jLabel4.setText("*");
jLabel5.setForeground(new java.awt.Color(255, 0, 0));
jLabel5.setText("*");
jLabel6.setFont(new java.awt.Font("Tahoma", 1, 24)); // NOI18N
jLabel6.setForeground(new java.awt.Color(255, 0, 51));
jLabel6.setText("Distributed Web Crawler");
javax.swing.GroupLayout layout = new javax.swing.GroupLayout(getContentPane());
getContentPane().setLayout(layout);
layout.setHorizontalGroup(
layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(javax.swing.GroupLayout.Alignment.TRAILING,
layout.createSequentialGroup()
26
.addContainerGap(134, Short.MAX_VALUE)
.addComponent(jButton1, javax.swing.GroupLayout.PREFERRED_SIZE, 90,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addComponent(jButton2, javax.swing.GroupLayout.PREFERRED_SIZE, 90,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addGap(145, 145, 145))
.addGroup(javax.swing.GroupLayout.Alignment.TRAILING,
layout.createSequentialGroup()
.addContainerGap(185, Short.MAX_VALUE)
.addComponent(jLabel1)
.addGap(189, 189, 189))
.addGroup(layout.createSequentialGroup()
.addGap(79, 79, 79)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addComponent(jLabel2)
.addComponent(jLabel3))
.addGap(57, 57, 57)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING,
false)
.addComponent(jPasswordField1)
.addComponent(jTextField1,
javax.swing.GroupLayout.PREFERRED_SIZE, 164,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
27
.addComponent(jLabel5, javax.swing.GroupLayout.PREFERRED_SIZE,
18, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jLabel4)))
.addComponent(jLabel6))
.addContainerGap(82, Short.MAX_VALUE))
);
layout.setVerticalGroup(
layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(javax.swing.GroupLayout.Alignment.TRAILING,
layout.createSequentialGroup()
.addContainerGap(29, Short.MAX_VALUE)
.addComponent(jLabel6)
.addGap(18, 18, 18)
.addComponent(jLabel1)
.addGap(18, 18, 18)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jLabel2)
.addComponent(jTextField1, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jLabel4))
.addGap(18, 18, 18)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.TRAILING)
.addComponent(jLabel3)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jPasswordField1,
javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.GroupLayout.DEFAULT_SIZE,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jLabel5)))
.addGap(28, 28, 28)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
28
.addComponent(jButton1)
.addComponent(jButton2))
.addGap(25, 25, 25))
);
pack();
}// </editor-fold>
int flag = 0;
String str = "";
if (jTextField1.getText().equals("")) {
flag = 1;
jLabel3.setVisible(true);
str = "Username";
}
if (jPasswordField1.getText().equals("")) {
flag = 1;
str = str + " Password";
str = str.trim();
}
if (flag == 0) {
try {
DataBaseInfo data = new DataBaseInfo();
PreparedStatement stmt = data.conn.prepareStatement("select * from admin where
username=? and password=?", ResultSet.TYPE_SCROLL_INSENSITIVE,
ResultSet.CONCUR_UPDATABLE);
stmt.setString(1, jTextField1.getText());
stmt.setString(2, jPasswordField1.getText());
ResultSet rs = stmt.executeQuery();
29
if (rs.next()) {
DataBaseInfo.un = rs.getString(1);
DataBaseInfo.pwd = rs.getString(2);
DataBaseInfo.localadd = rs.getString(3);
DataBaseInfo.usedistributed = rs.getString(4);
Manage_Computers obj = new Manage_Computers();
obj.setVisible(true);
this.dispose();
} else {
JOptionPane.showMessageDialog(this, "Invalid username or password !!");
}
} else {
str = str + " can't be empty";
JOptionPane.showMessageDialog(this, str);
}
java.util.logging.Logger.getLogger(admin_login.class.getName()).log(java.util.logging.Level
.SEVERE, null, ex);
} catch (InstantiationException ex) {
java.util.logging.Logger.getLogger(admin_login.class.getName()).log(java.util.logging.Level
.SEVERE, null, ex);
} catch (IllegalAccessException ex) {
java.util.logging.Logger.getLogger(admin_login.class.getName()).log(java.util.logging.Level
.SEVERE, null, ex);
} catch (javax.swing.UnsupportedLookAndFeelException ex) {
java.util.logging.Logger.getLogger(admin_login.class.getName()).log(java.util.logging.Level
.SEVERE, null, ex);
}
//</editor-fold>
CODING OF CRAWLER.JAVA
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
/*
* Manage_Nodes.java
*
* Created on Feb 11, 2015, 11:49:21 AM
*/
package coding;
import javax.swing.JOptionPane;
32
import java.sql.*;
/**
*
* @author DSOFT
*/
public class Manage_Computers extends javax.swing.JFrame {
jButton1.setText("Save Information");
35
jButton1.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jButton1ActionPerformed(evt);
}
});
jCheckBox1.setText("Confirm Save");
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addComponent(jLabel9)
.addComponent(jLabel4)
.addComponent(jLabel3)
.addComponent(jButton1, javax.swing.GroupLayout.PREFERRED_SIZE, 188,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addGroup(jPanel1Layout.createSequentialGroup()
.addComponent(jCheckBox1)
.addContainerGap())
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addGroup(jPanel1Layout.createSequentialGroup()
36
.addComponent(jTextField3,
javax.swing.GroupLayout.PREFERRED_SIZE, 201,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addContainerGap())
.addGroup(jPanel1Layout.createSequentialGroup()
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addComponent(jTextField1,
javax.swing.GroupLayout.DEFAULT_SIZE, 619, Short.MAX_VALUE)
.addComponent(jTextField2,
javax.swing.GroupLayout.Alignment.TRAILING,
javax.swing.GroupLayout.DEFAULT_SIZE, 619, Short.MAX_VALUE))
.addGap(110, 110, 110)))))
);
jPanel1Layout.setVerticalGroup(
jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(jPanel1Layout.createSequentialGroup()
.addGap(43, 43, 43)
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG, false)
.addGroup(jPanel1Layout.createSequentialGroup()
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jLabel3)
.addComponent(jTextField1,
javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.GroupLayout.DEFAULT_SIZE,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED,
javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE)
37
.addComponent(jTextField2, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGroup(jPanel1Layout.createSequentialGroup()
.addGap(45, 45, 45)
.addComponent(jLabel4)))
.addGap(18, 18, 18)
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jTextField3, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jLabel9))
.addGap(29, 29, 29)
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jCheckBox1)
.addComponent(jButton1))
.addContainerGap(193, Short.MAX_VALUE))
);
},
new String [] {
}
));
jTable1.setAutoResizeMode(javax.swing.JTable.AUTO_RESIZE_ALL_COLUMNS);
38
jTable1.setRowHeight(25);
jScrollPane1.setViewportView(jTable1);
jButton3.setText("Show Computers");
jButton3.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jButton3ActionPerformed(evt);
}
});
.addGroup(jPanel2Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addComponent(jScrollPane1, javax.swing.GroupLayout.PREFERRED_SIZE,
948, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jButton3, javax.swing.GroupLayout.PREFERRED_SIZE, 151,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addContainerGap(21, Short.MAX_VALUE))
);
jPanel2Layout.setVerticalGroup(
jPanel2Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(jPanel2Layout.createSequentialGroup()
.addGap(23, 23, 23)
.addComponent(jButton3)
.addGap(18, 18, 18)
.addComponent(jScrollPane1, javax.swing.GroupLayout.PREFERRED_SIZE, 288,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addContainerGap(35, Short.MAX_VALUE))
39
);
jButton4.setText("Delete Computer");
jButton4.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jButton4ActionPerformed(evt);
}
});
.addGroup(jPanel6Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addGroup(jPanel6Layout.createSequentialGroup()
.addComponent(jCheckBox2)
.addContainerGap())
.addGroup(jPanel6Layout.createSequentialGroup()
.addComponent(jLabel10)
40
.addGap(57, 57, 57)
.addComponent(jTextField4, javax.swing.GroupLayout.PREFERRED_SIZE,
201, javax.swing.GroupLayout.PREFERRED_SIZE)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED,
javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE)
.addComponent(jButton4)
.addGap(452, 452, 452))))
);
jPanel6Layout.setVerticalGroup(
jPanel6Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(jPanel6Layout.createSequentialGroup()
.addGap(43, 43, 43)
.addGroup(jPanel6Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jTextField4, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jButton4)
.addComponent(jLabel10))
.addGap(22, 22, 22)
.addComponent(jCheckBox2)
.addContainerGap(276, Short.MAX_VALUE))
);
jButton5.setText("Save Details");
41
jButton5.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jButton5ActionPerformed(evt);
}
});
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG, false)
.addGroup(jPanel7Layout.createSequentialGroup()
.addGap(96, 96, 96)
42
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addComponent(jCheckBox3)
.addComponent(jLabel12)
.addComponent(jLabel13)
.addComponent(jLabel15))
.addGap(43, 43, 43))
.addGroup(javax.swing.GroupLayout.Alignment.TRAILING,
jPanel7Layout.createSequentialGroup()
.addContainerGap(javax.swing.GroupLayout.DEFAULT_SIZE,
Short.MAX_VALUE)
.addComponent(jLabel11)
.addGap(63, 63, 63)))
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addGroup(jPanel7Layout.createSequentialGroup()
.addComponent(jTextField6, javax.swing.GroupLayout.PREFERRED_SIZE,
201, javax.swing.GroupLayout.PREFERRED_SIZE)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addComponent(jLabel14))
.addComponent(jTextField5, javax.swing.GroupLayout.PREFERRED_SIZE,
201, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jButton5)
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.TRAILI
NG, false)
.addComponent(jPasswordField1,
javax.swing.GroupLayout.Alignment.LEADING)
.addComponent(jTextField7,
javax.swing.GroupLayout.Alignment.LEADING,
javax.swing.GroupLayout.DEFAULT_SIZE, 201, Short.MAX_VALUE)))
43
.addGap(440, 440, 440))
);
jPanel7Layout.setVerticalGroup(
jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(jPanel7Layout.createSequentialGroup()
.addGap(43, 43, 43)
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jTextField5, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jLabel11))
.addGap(18, 18, 18)
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jTextField6, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jLabel12)
.addComponent(jLabel14))
.addGap(18, 18, 18)
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jLabel13)
.addComponent(jTextField7, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGap(18, 18, 18)
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jLabel15)
44
.addComponent(jPasswordField1,
javax.swing.GroupLayout.PREFERRED_SIZE, javax.swing.GroupLayout.DEFAULT_SIZE,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED, 120,
Short.MAX_VALUE)
.addGroup(jPanel7Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jCheckBox3)
.addComponent(jButton5))
.addGap(64, 64, 64))
);
jMenu1.setText("Admin Panel");
jMenu1.add(jSeparator1);
jMenuItem1.setText("Show Panel");
jMenuItem1.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jMenuItem1ActionPerformed(evt);
}
});
jMenu1.add(jMenuItem1);
jMenuBar1.add(jMenu1);
jMenu3.setText("Web Crawler");
jMenu3.add(jSeparator2);
jMenuItem7.setText("Load Crawler");
jMenuItem7.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
45
jMenuItem7ActionPerformed(evt);
}
});
jMenu3.add(jMenuItem7);
jMenuBar1.add(jMenu3);
jMenu5.setText("Search Websites");
jMenu5.add(jSeparator3);
jMenuBar1.add(jMenu5);
jMenu6.setText("Logout");
jMenu6.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jMenu6ActionPerformed(evt);
}
});
jMenu6.add(jSeparator4);
jMenuBar1.add(jMenu6);
jMenu2.setText("Exit");
jMenu2.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jMenu2ActionPerformed(evt);
46
}
});
jMenuBar1.add(jMenu2);
setJMenuBar(jMenuBar1);
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jLabel6)
.addComponent(jLabel1))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addComponent(jTabbedPane1, javax.swing.GroupLayout.DEFAULT_SIZE, 415,
Short.MAX_VALUE)
47
.addContainerGap())
);
pack();
}// </editor-fold>
try {
if (jCheckBox1.isSelected() == true) {
DataBaseInfo db=new DataBaseInfo();
PreparedStatement stmt =db.conn.prepareStatement("insert into nodesinfo
values(?,?,?,0)");
stmt.setString(1, jTextField1.getText());
stmt.setString(2, jTextField2.getText());
stmt.setString(3, jTextField3.getText());
stmt.executeUpdate();
JOptionPane.showMessageDialog(this, "Node successfully added to network");
} else {
JOptionPane.showMessageDialog(this, "Please confirm node entry");
}
48
} catch (Exception e) {
JOptionPane.showMessageDialog(this, e.getMessage());
}
}
try {
ResultSet rs = stmt.executeQuery();
ResultSetMetaData rdata = rs.getMetaData();
String[] str = {"Node IP Address", "Shared Path", "Maximum Limit", "Used Limit"};
int n =DataBaseInfo.returnColumn(rs);
int col = rdata.getColumnCount();
rs.beforeFirst();
int an = 0;
while (rs.next()) {
for (int j = 1; j <= col; j++) {
data[an][j - 1] = rs.getString(j);
}
an++;
}
jTable1.setModel(new javax.swing.table.DefaultTableModel(
49
data, str));
try {
try {
if(jCheckBox3.isSelected())
{
DataBaseInfo db=new DataBaseInfo();
int n=stmt.executeUpdate();
if(n==1)
{
JOptionPane.showMessageDialog(this,"Admin information successfully updated !!");
DataBaseInfo.un=jTextField7.getText();
DataBaseInfo.pwd=jPasswordField1.getText();
DataBaseInfo.localadd=jTextField5.getText();
DataBaseInfo.usedistributed=jTextField6.getText();
}
else
{
JOptionPane.showMessageDialog(this,"Node not found !!");
}
}
else
{
JOptionPane.showMessageDialog(this,"Please confirm record update !!");
}
} catch (Exception ex) {
51
JOptionPane.showMessageDialog(this, ex);
}
}
/**
* @param args the command line arguments
*/
public static void main(String args[]) {
/* Set the Nimbus look and feel */
//<editor-fold defaultstate="collapsed" desc=" Look and feel setting code (optional) ">
/* If Nimbus (introduced in Java SE 6) is not available, stay with the default look and
feel.
* For details see
http://download.oracle.com/javase/tutorial/uiswing/lookandfeel/plaf.html
*/
try {
for (javax.swing.UIManager.LookAndFeelInfo info :
javax.swing.UIManager.getInstalledLookAndFeels()) {
if ("Nimbus".equals(info.getName())) {
javax.swing.UIManager.setLookAndFeel(info.getClassName());
break;
}
}
} catch (ClassNotFoundException ex) {
java.util.logging.Logger.getLogger(Manage_Computers.class.getName()).log(java.util.loggin
g.Level.SEVERE, null, ex);
} catch (InstantiationException ex) {
53
java.util.logging.Logger.getLogger(Manage_Computers.class.getName()).log(java.util.loggin
g.Level.SEVERE, null, ex);
} catch (IllegalAccessException ex) {
java.util.logging.Logger.getLogger(Manage_Computers.class.getName()).log(java.util.loggin
g.Level.SEVERE, null, ex);
} catch (javax.swing.UnsupportedLookAndFeelException ex) {
java.util.logging.Logger.getLogger(Manage_Computers.class.getName()).log(java.util.loggin
g.Level.SEVERE, null, ex);
}
//</editor-fold>
CODING OF CRAWLER.JAVA
package coding;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.sql.ResultSetMetaData;
import javax.swing.JOptionPane;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
/**
*
* @author DSOFT
*/
public class mywebcrawler extends javax.swing.JFrame {
setDefaultCloseOperation(javax.swing.WindowConstants.DISPOSE_ON_CLOSE);
setTitle("Web Crawler");
addWindowListener(new java.awt.event.WindowAdapter() {
public void windowOpened(java.awt.event.WindowEvent evt) {
formWindowOpened(evt);
}
});
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADI
NG)
.addComponent(jCheckBox1)
.addGroup(jPanel1Layout.createSequentialGroup()
.addComponent(jLabel1)
.addGap(72, 72, 72)
.addComponent(jComboBox1,
javax.swing.GroupLayout.PREFERRED_SIZE, 264,
javax.swing.GroupLayout.PREFERRED_SIZE)))
.addContainerGap(22, Short.MAX_VALUE))
);
jPanel1Layout.setVerticalGroup(
jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(jPanel1Layout.createSequentialGroup()
.addContainerGap()
59
.addGroup(jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASEL
INE)
.addComponent(jLabel1)
.addComponent(jComboBox1, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED, 9,
Short.MAX_VALUE)
.addComponent(jCheckBox1))
);
jMenu1.setText("Admin Panel");
jMenu1.add(jSeparator1);
jMenuItem1.setText("Show Panel");
jMenuItem1.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jMenuItem1ActionPerformed(evt);
}
});
jMenu1.add(jMenuItem1);
jMenuBar1.add(jMenu1);
jMenu3.setText("Web Crawler");
jMenu3.add(jSeparator2);
jMenuItem7.setText("Load Crawler");
jMenuItem7.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jMenuItem7ActionPerformed(evt);
}
});
60
jMenu3.add(jMenuItem7);
jMenuBar1.add(jMenu3);
jMenu5.setText("Search Websites");
jMenu5.add(jSeparator3);
jMenuBar1.add(jMenu5);
jMenu6.setText("Logout");
jMenu6.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jMenu6ActionPerformed(evt);
}
});
jMenu6.add(jSeparator4);
jMenuBar1.add(jMenu6);
jMenu2.setText("Exit");
jMenu2.addActionListener(new java.awt.event.ActionListener() {
public void actionPerformed(java.awt.event.ActionEvent evt) {
jMenu2ActionPerformed(evt);
}
});
jMenuBar1.add(jMenu2);
61
setJMenuBar(jMenuBar1);
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGap(230, 230, 230)
.addComponent(jLabel6))
.addGroup(layout.createSequentialGroup()
.addGap(196, 196, 196)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.TRAILING)
.addComponent(jPanel1, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jTextField1,
javax.swing.GroupLayout.PREFERRED_SIZE, 530,
javax.swing.GroupLayout.PREFERRED_SIZE)))
.addGroup(layout.createSequentialGroup()
.addGap(269, 269, 269)
.addComponent(jButton1, javax.swing.GroupLayout.PREFERRED_SIZE,
187, javax.swing.GroupLayout.PREFERRED_SIZE)
.addGap(18, 18, 18)
.addComponent(jButton2, javax.swing.GroupLayout.PREFERRED_SIZE,
187, javax.swing.GroupLayout.PREFERRED_SIZE))
.addGroup(layout.createSequentialGroup()
.addGap(130, 130, 130)
.addComponent(jProgressBar1,
javax.swing.GroupLayout.PREFERRED_SIZE, 676,
javax.swing.GroupLayout.PREFERRED_SIZE))
62
.addGroup(layout.createSequentialGroup()
.addGap(152, 152, 152)
.addComponent(jLabel2, javax.swing.GroupLayout.PREFERRED_SIZE, 603,
javax.swing.GroupLayout.PREFERRED_SIZE)))
.addContainerGap(164, Short.MAX_VALUE))
);
layout.setVerticalGroup(
layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)
.addGroup(layout.createSequentialGroup()
.addGap(46, 46, 46)
.addComponent(jLabel6)
.addGap(18, 18, 18)
.addComponent(jTextField1, javax.swing.GroupLayout.PREFERRED_SIZE, 42,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addGap(27, 27, 27)
.addComponent(jPanel1, javax.swing.GroupLayout.PREFERRED_SIZE,
javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)
.addGap(39, 39, 39)
.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)
.addComponent(jButton1, javax.swing.GroupLayout.PREFERRED_SIZE, 42,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addComponent(jButton2, javax.swing.GroupLayout.PREFERRED_SIZE, 42,
javax.swing.GroupLayout.PREFERRED_SIZE))
.addGap(22, 22, 22)
.addComponent(jLabel2)
.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)
.addComponent(jProgressBar1, javax.swing.GroupLayout.PREFERRED_SIZE, 23,
javax.swing.GroupLayout.PREFERRED_SIZE)
.addGap(101, 101, 101))
);
pack();
}// </editor-fold>
63
int count = 0;
if (DataBaseInfo.usedistributed.toUpperCase().equals("Y")) {
IP = jComboBox1.getSelectedItem().toString();
path = "\\\\" + jComboBox1.getSelectedItem().toString() + "\\" +
loc[jComboBox1.getSelectedIndex()];
} else {
path = DataBaseInfo.localadd;
if (jCheckBox1.isSelected()) {
try {
if (rs.next()) {
path = rs.getString(1) + "\\" + rs.getString(2);
IP = rs.getString(1);
}
64
}
abc obj = new abc(); // where abc is the name of thread class
obj.start();
}
String[] loc;
try {
rs.beforeFirst();
int i = 0;
while (rs.next()) {
jComboBox1.addItem(rs.getString(1));
loc[i] = rs.getString(2);
i++;
65
}
}
this.setLocationRelativeTo(null);
}
String sql = "select * from crawledpages where URL = '" + URL + "'";
/*ResultSet rs = db.runSql(sql);
if (rs.next()) {
} else {*/
//store the URL to database to avoid parsing again
67
if (DataBaseInfo.usedistributed.toUpperCase().equals("Y")) {
IP = jComboBox1.getSelectedItem().toString();
JOptionPane.showMessageDialog(this,jComboBox1.getSelectedIndex());
path = "\\\\" + jComboBox1.getSelectedItem().toString() + "\\" +
loc[jComboBox1.getSelectedIndex()];
} else {
path = DataBaseInfo.localadd;
}
if (jCheckBox1.isSelected()) {
try {
if (rs1.next()) {
path = "\\\\" + rs1.getString(1) + "\\" + rs1.getString(2);
IP = rs1.getString(1);
}
68
path = path + "\\" +
jTextField1.getText().substring(jTextField1.getText().indexOf("//") + 2);
String webpath = jTextField1.getText().substring(jTextField1.getText().indexOf("//")
+ 2);
sql = "INSERT INTO Crawledpages values(?,?,?,?)";
PreparedStatement stmt = db.conn.prepareStatement(sql,
Statement.RETURN_GENERATED_KEYS);
stmt.setString(1, URL);
stmt.setString(2, webpath);
if (DataBaseInfo.usedistributed.toUpperCase().equals("Y")) {
stmt.setString(3, IP);
} else {
stmt.setString(3, "Local");
}
stmt.setString(4, path);
stmt.execute();
if (DataBaseInfo.usedistributed.toUpperCase().equals("Y")) {
stmt.setString(3, IP);
69
} else {
stmt.setString(3, "Local");
}
stmt.setString(4, path);
stmt.execute();
jProgressBar1.setValue(count);
//path="\\\\10.0.1.42\\anilsir\\"+jTextField1.getText().substring(jTextField1.getText().indexOf
("//")+2);
if (DataBaseInfo.usedistributed.toUpperCase().equals("Y")) {
IP = jComboBox1.getSelectedItem().toString();
path = "\\\\" + jComboBox1.getSelectedItem().toString() + "\\" +
loc[jComboBox1.getSelectedIndex()];
} else {
path = DataBaseInfo.localadd;
}
if (jCheckBox1.isSelected()) {
try {
70
PreparedStatement stmt1 = db.conn.prepareStatement("select NodeIP,
sharedfolderlocation from nodesinfo where (no_of_sites-availablesites)= (select distinct
max(no_of_sites-availablesites) from nodesinfo)",
ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_UPDATABLE);
ResultSet rs1 = stmt1.executeQuery();
if (rs1.next()) {
path = "\\\\" + rs1.getString(1) + "\\" + rs1.getString(2);
IP = rs1.getString(1);
}
}
path = path + "\\" +
jTextField1.getText().substring(jTextField1.getText().indexOf("//") + 2);
File file = new File(path);
if (file.exists() == false) {
file.mkdirs();
}
BufferedWriter writer = new BufferedWriter(new FileWriter(path + "\\" +
link.attr("abs:href").substring(link.attr("abs:href").lastIndexOf('/') + 1)));
String inputLine;
while ((inputLine = in.readLine()) != null) {
try {
writer.write(inputLine);
} catch (IOException e) {
e.printStackTrace();
JOptionPane.showMessageDialog(this, e);
return;
}
71
}
in.close();
writer.close();
count++;
}
JOptionPane.showMessageDialog(null, count + " record found and saved to
database !!");
@Override
public void run() {
try {
// TODO add your handling code here:
// db.runSql2("TRUNCATE Record;");
processPage(jTextField1.getText());
} catch (Exception ex) {
JOptionPane.showMessageDialog(null, ex.getMessage().toString());
} finally {
JOptionPane.showMessageDialog(null, count + " record found and saved to
database !!\n\nFiles saved to " + path);
jProgressBar1.setVisible(false);
}
}
}
/**
72
* @param args the command line arguments
*/
public static void main(String args[]) {
/* Set the Nimbus look and feel */
//<editor-fold defaultstate="collapsed" desc=" Look and feel setting code (optional) ">
/* If Nimbus (introduced in Java SE 6) is not available, stay with the default look and
feel.
* For details see
http://download.oracle.com/javase/tutorial/uiswing/lookandfeel/plaf.html
*/
try {
for (javax.swing.UIManager.LookAndFeelInfo info :
javax.swing.UIManager.getInstalledLookAndFeels()) {
if ("System".equals(info.getName())) {
javax.swing.UIManager.setLookAndFeel(info.getClassName());
break;
}
}
} catch (ClassNotFoundException ex) {
java.util.logging.Logger.getLogger(mywebcrawler.class.getName()).log(java.util.logging.Lev
el.SEVERE, null, ex);
} catch (InstantiationException ex) {
java.util.logging.Logger.getLogger(mywebcrawler.class.getName()).log(java.util.logging.Lev
el.SEVERE, null, ex);
} catch (IllegalAccessException ex) {
java.util.logging.Logger.getLogger(mywebcrawler.class.getName()).log(java.util.logging.Lev
el.SEVERE, null, ex);
} catch (javax.swing.UnsupportedLookAndFeelException ex) {
java.util.logging.Logger.getLogger(mywebcrawler.class.getName()).log(java.util.logging.Lev
el.SEVERE, null, ex);
73
}
//</editor-fold>
CHAPTER 10
TESTING
10.1 TESTING
Testing is a process, which reveals errors in the program. It is the major quality measure
employed during software development. During testing, the program is executed with a set of
conditions known as test cases and the output is evaluated to determine whether the program
is performing as expected.
75
In order to make sure that the system does not have errors, the different levels of testing
strategies that are applied at differing phases of software development.
Unit Testing is done on individual modules as they are completed and become
executable. It is confined only to the designer's requirements.
Each module can be tested using the following two strategies:
In this strategy some test cases are generated as input conditions that fully execute all
functional requirements for the program. This testing has been uses to find errors in
the following categories:
76
In this the test cases are generated on the logic of each module by drawing flow
graphs of that module and logical decisions are tested on all the cases.
It has been used to generate the test cases in the following cases:
Involves in-house testing of the entire system before delivery to the user. Its aim is to
satisfy the user the system meets all requirements of the client's specifications.
Integration testing ensures that software and subsystems work together as a whole. It tests
the interface of all the modules to make sure that the modules behave properly when
integrated together.
It is a pre-delivery testing in which entire system is tested at client's site on real world data to
find errors.
10.6. VALIDATION
The system has been tested and implemented successfully and thus ensured that all the
requirements as listed in the software requirement specification are completely fulfilled. In
case of erroneous input corresponding error messages are displayed.
COMPILING TEST
77
It was a good idea to do our stress testing early on, because it gave us time to fix some of the
unexpected deadlocks and stability problems that only occurred when components were
exposed to very high transaction volumes.
EXECUTION TEST
This program was successfully loaded and executed. Because of good programming there
were no execution errors.
Test Cases:
CHAPTER 11
SYSTEM IMPLEMENTATION
78
a. First match the minimum requirement for the system. If the condition matches then
install Microsoft Windows XP SP2 or above on the system in which program is going
to be used.
c. Then we require setup NetBeans IDE. Now the software is ready to install the
software.
d. Then, Insert the Project CD in the CD-ROM Drive. Open NetBeans, click on Open
Menu and select project.
e. After that build and run the software by selection run from context menu or by
pressing Alt+F6.
f. Select one notepad file with the list of numbers and perform the required sorting
comparision.
At first we need the PC and the minimum hardware and software configuration as specified
earlier. After installation any user can make use of the software.
CHAPTER 12
CONCLUSION AND FUTURE SCOPE OF STUDY
79
The biggest contribution of this project is the concept of distributing crawl tasks based on
disjoint subsets of the URL crawl space. We also presented a scalable, multi-threaded,
peerto-
peer distributed architecture for a WebCrawler based on the above concept. Another
interesting contribution of the project is the proposed probabilistic hybrid of Depth-
First
Traversal and Breath-First Traversal, although we were unable to study its advantages or
disadvantages during this project. This traversal strategy can be used to achieve the hybrid
of the two traditional strategies without any extra book-keeping and is very easy to
implement. We also implement the complete WebCrawler that demonstrates all of the
above
concepts.
FUTURE SCOPE:
Future extension of the project includes implementing the DNS cache in the Crawler Thread
and studying the performance of the hybrid traversal strategy on the various cache-hit rates. A
lot of issues need to be dealt with to make this system usable in the real world. The Crawler
needs to conform to robot exclusion protocol. We need to handle partial failure. Although at
present failure of one node will not stop other components, it would be desirable
for other system to take over the task of the node that failed. Also dynamic reconfiguration
and dynamic load-balancing would be desirable.
80
CHAPTER 13
REFERENCES
1. Allen Heydon and Mark Najork, "Mercator: A Scalable, Extensible Web Crawler",
Compaq Systems Research Center, 130 Lytton Ave, Palo Alto, CA 94301, 2001.
81