|
Web Spider
WebSuck goes through the web-pages you specify and checks for links
and data files. The links are followed, and the data files are output
in the format of your choice (plain text or GetRight format!).
The program is best suited for downloading web galleries with large
amounts of photos. There are plenty of options to make the Web
Suck well-adapted to most sites layouts.
All command-line options can now be accessed from the GUI! (new from
version 0.6b)
Just run WebSuck with one parameter: -gui
WebSuck does NOT download the files found. You need to use the files
with a file downloader, like WebGet.
(You can also use the output of the program with software like wget
on UNIX or GetRight on Windows.)

WebSuck in GUI mode
(click to enlarge) |

WebSuck in console mode |
GUI Mode
Start WebSuck with -gui as one of the parameters, and the Swing
GUI will be displayed!
java -jar WebSuck.jar -gui <other arguments>
Command Line Arguments
java -jar WebSuck.jar <options> <url1> <url2> <url3>
... <urlN>
At least one URL must be specified, either on the command line or in
a URL file (see below).
Example: java -jar WebSuck.jar -l 3 -noexternal http://www.ake.nu/
Command Line Options
| Switch |
Function |
| -gui |
Load the GUI. Can be combined with other parameters
to change 'default' start-up settings. Also see -width and
-height below. |
| -h, --help, or -? |
Display a list of command-line options |
| -o file |
The WebSuck results are output into this file. |
| -l # |
Limits the web suck to a depth of #. The start page
has level 1, a page linked from the start page has level 2, and
so on. |
| -getright x:\path\ |
Makes WebSuck output the data list in GetRight format.
The files will be downloaded into "x:\path\<hostpath>".
|
| -onedir |
Only applies when -getright is used. Saves all files
in one directory |
|
-quiet or
-verbose
|
Toggles verbosity: Quiet just displays the
resulting data files, verbose displays even more information
than normal mode. |
| -noimg |
Makes WebSuck ignore files found in IMG tags. |
| -option |
Scans OPTION tags for links and data. Doesn't add
to document depth! |
| -usebody |
Scan BODY tag for background image. |
| -nooption |
Don't follow links found in OPTION tags (combo boxes) |
| -noexternal |
Makes WebSuck skip links pointing outside the parent
document's host. |
| -datalast |
Only add datafiles found in a document on depth =
depthlimit. This is perfect for image galleries, so you won't get
all the thumbnails, just the big files. |
| -lastlinksonlydata |
Links that go down to maxdepth are only followed if
the links point to data files. |
| -imglinks |
Only follows links that are clickable images. Also good for thumb-nailed
galleries.
|
| -u username |
Set the http username used for non-external documents. |
| -p password |
Set the http password used for non-external documents. |
| -i file |
Reads URLs to parse from the specified text-file. |
| -v file |
Reads URLs to skip parsing from the specified text-file.
Useful if you WebSuck a site in multiple parts. |
| -outv file |
Outputs a list of the visited (parsed) URLs. Can be
used to skip links that have already been followed in a new Web
Suck. |
| -nocount word |
A link containing this text will not add to the depth
of the WebSuck! Great for multi-part thumbnail galleries! |
| -parseext ext1,ext2 |
Files with these extensions will be parsed as HTML
files. |
| -dataext ext1,ext2 |
Files with these extensions will be added to the download
list. |
| -width # |
Change the default width of the GUI window, in pixels.
|
| -height # |
Change the default height of the GUI window, in pixels. |
History
| Version |
Changes
|
| 0.76b |
Retries 5 times before aborting download when datastream is stalled. |
| 0.75b |
Removed requirement for "window.open" in javascript.
Now takes first available argument as file. |
| 0.74b |
Fixed all 'external link' checking so that it only matches 'domain'
part (*.domain.com). |
| 0.73b |
Fixed dead-locks when downloading certain files (e.g. empty files). |
| 0.72b |
Fixed commenting of Homepage address in plain-text output file. |
| 0.71b |
New options: -u <user> -p <pass> |
| 0.70b |
New option: -lastlinksonlydata. |
| 0.69c |
Added support for var='data' in tags, instead of just var="data"
and var=data. (Thanks to Fernando Cassia for pointing this
out) |
| 0.69b |
Fixed the 'sticky' vertical scroll bar. Now it should (heh) work...
New option: -usebody. Gets the document background image
from the BODY tag.
Added some default data and parse extensions.
Probably something more... |
| 0.68b |
Changed the vertical scroll bar of the display window. The scroll
bar now sticks to the bottom if not moved, or stays in place if
moved. You can also 'stick' the scroll bar again by moving it to
the bottom position. |
| 0.67b |
Added file dialogs to select the different files for input and
output.
Did a minor change in the HTML parser, it now works with unclosed
tags.
Added -width and -height options which let you set the size of the
GUI window. |
| 0.666b |
You now can change the extension-lists controlling what files
are treated as parse files and what files are treated as data files.
Note: References from IMG tags are always treated as data! |
| 0.665.91b |
Two options were missing in the GUI. The -nooption and
-noimg options have been added to the GUI! (oops!) |
| 0.665.9b |
Just added an icon for the GUI window. Nice, eh? :-) |
| 0.65b |
GUI options didn't affect anything if they were already set from
the command-line. This has been fixed so GUI options always override
the command-line options! |
| 0.64b |
A bug was fixed, WebSuck only recognized extensions written in
lower-case. The GUI layout had some minor changes as well. |
| 0.63b |
Changed the thread handling to comply with new API. Also changed
the text displayed when WebSuck is run without parameters. |
| 0.62b |
GUI layout improved! Now you can also combine the -gui parameter
with all other parameters, to change the start-up settings of WebSuck!
The -gui parameter must be the first parameter on the command-line. |
| 0.6b |
GUI mode! Start WebSuck with -gui as only parameter, and a Swing
GUI will be displayed, allowing for fast and easy use! |
| 0.5b |
Lost notes
|
| 0.442b |
New option: -onedir. By default, the GetRight format now
places files in a directory structure like the url. Using this
option uses the old way of saving files, placing all files in
one dir.
Changes:
The GetRight path can be enter with or without trailing '\'.
Fixes:
The HTML parser now sees unterminated links.
There were some errors with depth-limits on data-links.
|
| 0.44b |
New option: -nocount. Follows links with a certain text
without adding to depth. Ideal for gallery-pages where some galleries
are divided into several files! |
| 0.43b |
Now compiled with Java 2 v1.3.0. Some minor code changes. |
| 0.42b |
Now can also follow some javascript links, like "window.open()".
Can output a list of the visited pages, for use in another WebSuck. |
| 0.4b |
First public release |
|