OutWit Help Files

Frequently Asked Questions

General

What is OutWit Hub and when should I use it?

When you are looking for something on the Web, search engines give you lists of links to the answers. The purpose of OutWit Hub is to actually go retrieve the answers for you and save them on your disk as data files, Excel tables, lists of email addresses, collections of documents, images…

If your question has one simple answer, it will be at the top of Wikipedia or Google results and you don’t need OutWit for that. When you know, however, that it would take you 20, 50, 500 clicks to get what you want, then odds are you do need OutWit Hub:

The Hub is an all-in-one application for extracting and organizing data, images, documents from online sources. It offers a wealth of data recognition, autonomous exploration, extraction and export features to simplify Web research. OutWit Hub exists both as a Firefox Add-on and as a standalone application for Windows, Mac OS, and Linux.

OK, I have downloaded OutWit Hub and I am running it. Now what?

We have an open list of 1,728 first things you can do with the application but we believe the best first thing is to run the built-in tutorials from the Help menu (Help>Tutorials).

Automatic Exploration

I want OutWit Hub to browse through a series of result pages but the ‘Next in Series’ and ‘Browse’ buttons are disabled. How come?
When opening a Web page, OutWit analyzes the source code and tries to understand as many things as possible about the page. The first thing it does is to find navigation links (next, previous…) and, when it does, the ‘Next in Series’ arrow and ‘Browse’ double arrows become active. If they are inactive, it is because OutWit did not find any additional pages. There are many workarounds to do the scrape without having to click on all links manually: depending on the cases, the best alternative solutions are using the Dig function (with advanced settings in the pro version), generating the URLs to explore, making a ‘self-navigating’ scraper with the #nextPage# directive or, finally, grab the URLs you want to scrape, put them in a directory of queries and use this directory to do a new automatic exploration. (Note that for the latter, it is also possible to grab the links to the catch in one macro and address the column of the Catch by the name you gave it in a second macro, by typing ‘catch/yourColumnName’ in the Start Page textbox.)

Some links are not working in the Standalone version of the Hub. What should I do?

These are links for which target=blank was specified in the source code. OutWit Hub cannot open separate popup windows but you can open them within the Hub. For this, check the “Open popup links in the application window” preference (Tools>Preferences>General).

Auto-Explore Functions and Fast Scraping are slower in the current version than in the previous. Why is that?

They are not, in fact. The program’s exploration functions work exactly the same way. It is possible, though, that your preference settings have changed during the upgrade. Temporization and pause settings should actually be more precise and reliable than in previous versions. You can fine-tune all this in Tools>Preferences>Time Settings. Another recent preference which may have an impact on the exploration speed is ‘Bypass Browser Cache’ in the ‘Advanced’ panel: not using the cache does slow the browsing down, so you may want to set it to ‘Never’. If, after this, you are still experiencing performance issues, consider disabling processes you may not need by right-clicking on ‘page’ in the left side bar.

The next page button functions correctly but when trying to do a Browse to capture the information, the application runs only 2 pages then stops. Why is that?

- Cause: the next page link is probably a javascript link and it is probably the same in all pages, so the program thinks this URL has already been visited and stops the exploration.

- Solution: there is a preference (Tools>Preferences>General) just for this. Uncheck “Only visit pages once…”. Important: Do not forget to check it back afterwards or your next Dig would probably last forever and bring back huge amounts of redundant data.
Data: Extracting, Importing, Exporting…

I would like to extract the details of all the products/events/companies in this site/directory/list of subsidiaries… Could you please advise me on how to do that?

Unfortunately this is the purpose of the hundreds of features covered in the present Help, so it is difficult to answer in one sentence, but the general principle is this:

Go through the standard extractors (documents, lists, tables, guess…) by clicking in the left side panel. Either you find that one of them gives you the results you want, –in which case it is just a matter of exporting the data– or you need to create a scraper for that site. In the second case, you first need to go to one of the detail pages, build a scraper in the ‘scrapers’ view for that page, test it on a few other pages. Then go to the list of results you need to grab and have OutWit browse through all the links and apply your new scraper. This can be done in two ways: either by actually going to each page (‘browse’ or ‘dig’ or a combination of both if you have the pro version) or by ‘Fast Scraping’ them (applying your scraper to selected URLs –right-click: Auto-Explore>Fast Scrape in any datasheet– or ‘Fast Scrape’ in a macro).

How can I import lists of links (URLs) or other strings into OutWit Hub?

There are many different ways to do this. Here are a few:

Put them into a text file (.txt or .csv), and open the file from the File menu. (Note that on some systems, the program may try to open .csv files with another application. In this case, just rename your file with the .txt extension.) You will find your URLs in the links view and the text in the text view.
Drag them directly from another application to the page or queries view of the Hub,
If they are in a local HTML file, simply open the file from the File menu and you will be able to process it with the Hub as any Web page.
Copy the links from whatever application they are in (you can also copy HTML source code or simple text containing URLs), right-click in the page view of the Hub and choose Edit>Paste Links.

Once your links are in the Hub, you simply need to select them, right-click on one of them and select ‘Send to Queries’ to create a directory of URLs that you will then be able to use in any way you like (in a macro for instance, or doing an automatic exploration directly from the right-click menu).

How can I import CSV or other tabulated data into OutWit Hub?

Simply open the file (.txt, .csv …) from the File menu. (Note that on some systems, the program may try to open .csv files with another application. In this case, just rename your file with the .txt extension.) If the original data was correctly tabulated, you should find the data well structured in the guess view. If the data was less structured, well, the Hub will do what it can.

I have made a scraper which works fine on the page I want to scrape, but when I do a browse and set the ‘scraped’ view to collect the data, it grabs the data of the first page over and over again. What is happening?

You are probably trying to scrape information from AJAX pages where the data is dynamically added to the page by Javascript scripts. You need to set the type of source to be used by your scraper to Dynamic. When you do, the source code of the page will be displayed on a pale yellow background. Note that you will probably have to adapt your scraper if it was created for the Original source, as the code may have changed slightly.

How can I convert a list of values into a String Generation Pattern?

If the values are in one of the Hub’s datasheets, just select them, right-click on one of them and select “Insert Rows…”. If they are in a file on your hard disk, simply import them into a directory of queries (see above) and do the same.

What is the maximum number of rows of data OutWit Hub can extract and export? After a certain number of rows, when exporting, I get a dialog telling me a script is unresponsive. What should I do?

In our tests, we have extracted and successfully exported up to 1.3 million rows (of two or three columns). Obviously, the limit varies a lot from system to system, depending on the platform, the RAM, etc. When exporting more than 50,000 or 100,000 rows, you may see such dialogs, even several times in a row, when you click on Continue. There is a checkbox to prevent it from coming back. (Note that Excel XML export is always much more demanding than CSV or TXT.) Don’t forget that you can move your results to the catch and save the catch itself in a file if you need to reuse the contents or just for backup purposes (File Menu). A catch file can only be read again in OutWit Hub but it is much faster to save than exporting the data.

The program doesn’t find all the email addresses in this Website, Why is that?

There are several ways to have OutWit look for emails in a site. The fastest is to select Fast-Search For emails>In Current Domain, either from the Navigation menu or from the popup menu you get when you right-click on the page. This method, however, doesn’t explore all pages in the site. It only looks for the most obvious (contacts, team, about us…) pages that can be found. If you want to systematically explore all pages in a site, you will have to use the Dig function, within domain, at the depth level you wish.

Why doesn’t the program find contact information (phone, address…) for some of the email addresses?

First, of course, the info has to be present in the page. Then, if it is there, no technology allows for perfect semantic recognition. An address or a phone number can take so many different forms, depending on the country, on the way it is presented or on how words are abbreviated, that we can never expect to reach a 100% success rate.

Email address recognition is nearly exhaustive in OutWit; phone numbers are recognized rather well in general; physical addresses are more of a challenge: they are better recognized for US, Canada, Australia and European countries than for the rest of the world. The program recognizes names in many cases. As for other fields like the title, for instance, automatic recognition in unstructured data is too complex at this point and results would not be reliable enough for us to include them unless they are clearly labled. We are constantly improving our algorithms so you should make sure to keep your application up-to-date.

I am observing the progress and I see that no new line is added for some pages when I am sure there is an email address or other info that should be found. Why is that?

This page (or one containing similar info) was probably visited before. Results are automatically deduplicated. This means that if an email address –or just a phone number or physical address– has already been found, the row containing this data will be updated (and no new row, created) when a new occurrence is found.

User Interface

How do I make a hidden column visible in a datasheet?

In the top right corner of every datasheet in the application is a little icon figuring a table with its header: the Column Picker. If you click on this icon, a popup menu will allow you to hide or show the different columns of the datasheet. Only visible columns are moved to the Catch and exported by default (this behavior can be changed with a custom export layout).

What is the Ordinal ID?

The Ordinal column is hidden by default in all datasheets. Use the column picker (icon at the top right corner of any datasheet) to display it. The Ordinal ID is an index composed of three groups of digits separated by dots. The first number is the number of the page from which the data line was extracted (it can only be higher than 1 if the ‘empty’ checkbox is unchecked or if the datasheet is the result of a fast scrape). The second number is the position of the data block in the page (can only be more than 1 in ‘tables’, ‘lists’, ‘scraped’ and ‘news’ views). The last number is the position of the data line in the block (or in the page, if there is only one data block in the page).

Install

I do not manage to enter my serial number in the Registration Dialog of OutWit Hub. The program keeps saying the key is invalid.

Your key was sent to you by email when you purchased the application. It is a series of letters and digits similar to this: 6YT3X-IU6TR-9V45E-AFS43-89U64. It must not be confused with the login password to your account on outwit.com which was also sent to you by email (if you miss one of these email messages, please check your spam box).

If you are wondering whether the Hub you are using is a pro or a light version, you will simply find the answer in the window title. Up to now, we haven’t had a single case where a valid serial number would not work. You might be experiencing a very rare bug but this seems very unlikely after several years. The key needs to be entered exactly like it is in the mail you received. So, either you are not typing it precisely right (in which case you should simply copy and paste the email address and the key from our original mail) or you are typing something completely different (the login to your outwit.com account, for instance?). If you have changed email adresses since you purchased your license, remember that the one to use is the one with which you originally placed your order.

I have installed OutWit Hub for Firefox (or Docs or Images) then reloaded Firefox but I don’t see the OutWit icon on my toolbar. What can I do?

Three possibilities:

1) You didn’t download the Firefox add-on but the standalone application. In this case, you just need to install the software and double-click on its icon, as you would for any other application.

2) You do have the add-on and the install worked but the icon is simply missing from the toolbar. In this case, select ‘OutWit’ in the ‘Tools’ menu, then select OutWit Hub (or the appropriate outfit) in the sub-menu. If you want to add the icon to your toolbar, right-click on the toolbar and select ‘Customize’ then drag and drop the OutWit icon onto it.

3) The add-on install failed. In this case, the most probable reason is that, even though you just downloaded the program, you do not have the latest version. The one you downloaded (probably from a third party) doesn’t work with the current version of Firefox. Download the latest version from outwit.com. Of course, every now and then, it may also be a real compatibility problem. So if the above doesn’t work or doesn’t apply, please create a support ticket on outwit.com and we’ll do our best to help.

How can I revert to OutWit Hub 3.0?

If you have upgraded to version 4 by mistake or have a problem with a feature and wish to revert to version 3, make sure your version (Hub and runner) is 4.0.4.35 or higher and type outwit:downgrade in the Hub’s address bar. (Please tell us if you believe you have discovered a problem in this version.)

Troubleshooting

On OutWit Hub For Firefox, I have been experiencing new issues recently: unresponsive scripts, timeouts, strange behaviors on pages that used to work fine… what can I do to revert to factory settings?
We are not aware of incompatibilities with other add-ons but it can always happen, some of your Frefox preferences could also have been changed by another extension or files may have been corrupted in your profile. You can try to create a blank profile and reinstall OutWit Hub (or other OutWit extensions) from outwit.com. This will bring you back to the initial state. Here is how to proceed on Windows:

http://kb.mozillazine.org/Creating_a_new_Firefox_profile_on_Windows

and on other platforms:

http://support.mozilla.com/kb/Managing+profiles

Can I create a new profile in OutWit Hub Standalone?

With the standalone version, the principle is almost exactly identical to the way it works in Firefox (see above paragraph).

Windows: click “Start” > “Run”, and type :
“C:\Program Files (x86)\OutWit\OutWit Hub\outwit-hub.exe” -no-remote -ProfileManager

Macintosh: Run the Terminal application and type :
/Applications/OutWit\ Hub.app/Contents/MacOS/outwit-hub -no-remote -ProfileManager

Linux: open a terminal and type :
[path to directory]/outwit-hub -no-remote -ProfileManager

If you need instructions to go further, refer to the profile manager instructions for Firefox:

http://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles

Where is my profile directory?

In OutWit Hub (Standalone or Firefox Add-on), if you type about:support in the address bar, you will get a page with important information about your system and configuration. In this page, you will find a button that will lead you to your profile directory. Among the files you will see there, the ones with .owc extensions are Catch files, and files ending with .owg are User Gear files (the User Gear is the database where all your automators are stored). You can back these files up or rename them if you plan to alter your profile.

Next in series: Loads the next page in a series

Active when OutWit finds a navigation link to the following page (i.e. if the current page is part of a series, like a result page for a query in a search engine).

Browse: Auto-browses through the pages of a series.

Active when OutWit finds a navigation link to the following page (i.e. if the current page is part of a series, like a result page for a query in a search engine). Right-clicking or holding down the Browse button opens a menu allowing you to limit the number of pages to explore. Escape or a second click of the button will stop the browse.

Dig: Automaticallly explores the links of the current page.

Active when OutWit finds links in the current page. Right-clicking or holding down the Dig button opens a menu allowing you to limit the exploration within or outside the current domain and to set the depth of the dig. Depth = 0 will browse through all the links of the page, Depth = 1 will also explore the all the links of pages visited. Escape or a second click of the button will stop the dig. (Only links matching the list of extensions set in the advanced preference panel are explored. Some link types are systematically filtered out from the exploration: log-out pages, feeds which cannot be opened by the browser, etc.)

Up to Site Home: Goes up to the home page of the current site.

Active when the current page is not the home page of a site. Goes up one level towards the top of the current site’s hierarchy.

Slideshow: Displays the images of the page as a slideshow.

Active when OutWit finds images in the current page. The slideshow can be viewed in full screen or in the page widget. If the current page is part of a series, the slideshow will go on as long as a next page is found.

Address Bar: for URLs, macros or search queries.

You can type here a URL to load, a query which will be forwarded to the preferred search engine, or a macro to execute.

The Standalone Application

OutWit Hub exists in two guises: a standalone application and a Firefox add-on.

Both versions are basically the same program and are able to fulfill the same functions. There are however a few specificities corresponding to their nature. The ones which are worth noting are the following:

(If you wish to get to your OutWit files, please first read the Frequently Asked Questions, Troubleshooting section, for info on the Profile files in both the Standalone app. and the Firefox add-on.)

The standalone application can be launched from a terminal:

Windows: “C:\Program Files (x86)\OutWit\OutWit Hub\outwit-hub.exe”
Mac OS: /Applications/OutWit\ Hub.app/Contents/MacOS/outwit-hub
Linux: run outwit-hub from the location where you unpacked the zip file.

In the command line, to run the standalone application from a terminal, you can include the following parameters:

-url “http://…” to load an URL after starting. Using the quotes around the URL is safer, in case of special characters, especially on Window.
-macro xxx to execute the macro corresponding to the Automator ID (AID) xxx in your profile (see the list of macros in the macro manager).

-quit-after to instruct the application to quit after executing the tasks of the command line.
-p to open the profile manager.

The standalone profile files with your automators and catch are located by default in (replace XXX by your user directory):

Windows: C:\Users\XXX\AppData\Roaming\OutWit\outwit-hub\Profiles\
Mac OS: /Users/XXX/Library/Application\ Support/OutWit/outwit-hub/Profiles
Linux: outwit-hub/Profiles.
Note: In OutWit Hub (Standalone or Firefox Add-on), if you type about:support in the address bar, you will get a page with important information about your system and configuration. In this page, you will find a button that will lead you to your profile directory. The file named User_Gear.owg contains your scrapers, macros, etc. and catch.owc contains the data you placed in your Catch. Your profile directory also contains backup folders where old versions of these files are stored. (In the Firefox Add-on version, your OutWit profile files are located within the Firefox profile.)

The Firefox Add-on allows the Hub to open new browser windows as new Firefox tabs or windows. The standalone version doesn’t have this capacity.

OutWit Hub’s Menus

The menus give access to the main features of the application.

The application Menus located at the top of the screen are

the File Menu
the Edit Menu
the View Menu
the Navigation Menu
the Tools Menu
the Help Menu
the Registration/Upgrade Menu

A contextual menu, the right-click popup Menu, can be used in all datasheets of the application.

The File Menu
Gives access to the file saving/loading and data export functions.

Available options may vary with the view and the license level of your product.

Open…
Opens the File Picker Dialog to select one or several files from the hard disk or a local resource. Some file types can be explored and processed by OutWit to recognize and extract content (html, htm, xhtml, xml, txt, csv, owc…). When the selected file can be processed by OutWit Hub, it will be opened in directly OutWit Hub, otherwise, it will be ignored (or, in the case of OutWit Hub for Firefox, it will be open/processed by Firefox). When several files are selected in the Open Dialog, if some or all of them can be explored by OutWit, they will be successively browsed by the program. If one or several files are OutWit Automators or Catch files, they will be imported after confirmation by the user.
Notes about importing data:

You can open .html files of course, but also .txt, .sql, .csv… files of many different types and formats and process them with the Hub. The guess view should do a good job recognizing the fields of tabulated files in most cases –if the file is not too exotic.
Putting a list of URLs in a .txt or .csv file and opening it in the Hub is one of the easiest ways to import links for automatic exploration and processing. They will appear in the links view, from which they can be grabbed, sent to a directory in the queries view…

Save Page As…
Same command as in any browser: Saves the current page on the hard disk. The attached files and images will be saved in a folder called with the name of the page siffixed with “_files”.
Download Selected Files
Downloads and saves to the current destination folder on your hard disk, all documents and images found in the selected rows. (The same option can be found in the datasheet right-click menu.)
Download Selected Files in…
Downloads all documents and images found in the selected rows, opening the folder picker to let you decide where you want the files to be saved. (The same option can be found in the datasheet right-click menu.)
Load a Catch File…
Opens the File Picker to select a Catch file to open on the hard disk or a local resource.
Save Catch File as…
Save the content of the Catch as an OutWit Catch file (.owc) to the hard disk or a local resource.
Export Catch as…
Exports the content of the Catch to a file on your hard disk, in one of the available formats (Excel, CSV, HTML, SQL).
Export Selection as…
Exports the selected data to a file on your hard disk, in one of the available formats (Excel, CSV, HTML, SQL). (The same option can be found in the datasheet right-click menu.)
Empty Catch
Deletes the contents of the Catch panel.
Manage User Gear
Allows you to Export or Import the User Gear database, which contains all your automators. This way, you can easily transfer your scrapers, macros… from one profile to the other or from the addon to the standalone version.

The Edit Menu
Gives access to the application’s text and datasheet editing functions.

Available options may vary with the view and the license level of your product.

Editing Functions
The standard Cut, Copy, Paste, Duplicate and Delete functions apply to the selection. In a datasheet, they apply to rows.
Insert, delete, edit, copy and empty functions are available for cells. Columns can be inserted or deleted.
Insert Row
Inserts an empty row to the current datasheet, before the selected row.
Insert Rows
The Insert Rows function allows you to generate strings using the Query Generation Pattern format. Inserts the generated rows to the current datasheet, after the selected row.
Select All
Selects all rows of the datasheet.
Invert Selection
Deselects all selected rows of the datasheet and selects all rows that were not selected.
Select Similar
Selects all rows of the datasheet with content similar to that of the the selected cell. The default threshold used for determining similarity is 40 (0 selecting only identical values and 100 selecting everything). Use the sub-menu items to increase or decrease the threshold and select more or less rows.
Select Identical
Selects all rows of the datasheet with content identical to that of the the selected cell.
Select Different
Selects all rows of the datasheet with different content from that of the the selected cell.

The View Menu

Gives access to the application’s display options.

Available options may vary with the view and the license level of your product.

Slideshow
Displays the images of the page as a slideshow.
Full Screen
Displays the page in full screen. In this mode, menus disappear. To exit the full screen mode, use the platform function key or press the escape key.
Show/Hide Catch
Displays or hides the Catch Panel at the bottom of the application interface.
Show/Hide Log
Displays or hides the Log Panel at the top of the application interface.
Show/Hide Info
Displays or hides the Info/Message Bar at the top of the application interface.
Switch View Mode
Rolls through the different display settings for the current view: Data only (spreadsheet display), Export Layout only (HTML, CSV…) or a split view with both.
Highlight Series of Links
When checked, the program will highlight links of the same group or level, to simplify automatic exploration.
Show Exploration Button
When checked, the program will display a button in the page with which you can display the main automatic exploration functions in a simple click. (The exploration menu can also be displayed by right-clicking on the page.)
Windows
Lists and gives access to the windows currently open in Firefox.
Views
Lists and gives access to the Hub’s views.

The Navigation Menu
Gives access to the application’s Navigation options.

Available options may vary with the view and the license of your product.

Fast Search for Contacts and Auto-Explore Pages are also accessible using the right-click menu on the page or the Exploration Button.

Back
Goes back one page in the navigation history.
Forward
Goes forward one page in the navigation history.
Next in series
Loads the next page in a series (more info on the next page function.)

Active when OutWit finds a navigation link to the following page (i.e. if the current page is part of a series, like a result page for a query in a search engine).

Fast Search for Contacts
The program sends queries to the site(s) and searches for emails without loading the pages in the browser. Available options in this sub-menu vary with the current page and context. They include:

In Current Website
The program sends queries to the current site and searches for contacts without actually loading the pages in the browser. Not all pages are explored. OutWit Hub tries to locate the ones that are likely to include contact information.
In All Links
The program sends queries to the URLs found in the current page to search for email addresses and contact information.
In Selected Links
The program sends queries to the selected links, searching for email addresses and contact information.
In Highlighted Links
The program sends queries to the highlighted links, searching for email addresses and contact information. (Hover over the links to highlight series of links.)
In Linked Websites
The program browses through the pages of the current series of result pages (if any) and sends queries to the external URLs found (linking outside the current domain) to search for email addresses and contact information.

Auto-Explore Pages
The program actually visits and loads each page of a series or selection. Available options in this sub-menu vary with the current page and context.

Browse Selected Links
Auto-browses through the links that are selected in the current page.
Browse Highlighted Links
Auto-browses through the links that are highlighted in the current page. (Hover over the links to highlight series of links.)
Browse Series of Result Pages
Auto-browses through the pages of a series.

Active when OutWit finds a navigation link to a following page (i.e. if the current page is part of a series, like a result page for a query in a search engine). Escape or a second click on the Browse button will stop the auto-browse process. Right-clicking or clicking and holding down the Browse button shows a menu allowing to choose the extent of the automatic browse to perform (2,3,5,10 or all pages).
Dig / Browse & Dig Result Pages
Gives access to the Dig sub-menu: The Dig function is a systematic exploration of all links found in a page, in a whole site or in a series of result pages. In order to not visit hundreds of unwanted pages randomly, you can set a number of limitations. You can visit pages if they are within the same domain as the current page, outside the page domain or you can visit any link found. You can also specify the Depth of your exploration: Depth 0 is the list of links found in the page, depth 1 also includes all the links found in each visited page and depth 2 does the same one level below. In the Advanced Settings dialog, you can combine all these criteria and even set an additional filter with a string (or a regular expression) which must be present in the URL for the program to explore it. (Only links matching the list of extensions set in the advanced preference panel are explored in a Dig. Some link types are also systematically filtered out from the exploration: log-out pages, feeds which cannot be opened by the browser, etc.)

Reload the page
Reloads the current page.
Stop All Processes
Aborts current processes, like the loading of a page, the dig and browse functions, the execution of a macro, etc. In many instances, the escape key has the same effect. Only active when Outwit is browsing, digging, loading a page, etc.
Pause All Processes
Pausing complex processes and resuming them at a later time is not always possible. This function gives a simple solution by suspending all processing while displaying an alert and waiting for a click. Only active when Outwit is browsing, digging, loading a page, etc.
Bookmarks
Gives access to the bookmarks.
History
Gives access to the navigation history.
Workshop
Loads the ‘Workshop Page’, a blank page where you can paste and edit any textual content or data to be processed with OutWit Hub.

The Tools Menu
Gives access to additional tools and features.

Available options may vary with the view and the license of your product.

Reset All Views
Reverts the settings in the bottom panels of every view to their original values.
Clear History
Clear your browsing history. You can choose to erase everything, or specifically the history of pages you went to, your form filling history, your cache or all your cookies.
Downloads
Opens the download panel.

Preferences
Opens the OutWit Hub’s user preference panel.
Apply Scraper
Applies an applicable scraper to the current page.
Apply Macro
Applies a generic macro to the current page.
Error Console
Display the error console with messages (blue), warnings (yellow) and errors (pink) that have occurred recently.

The Datasheet Right-Click Menu
In all datasheets, additional features can be accessed with a right click on the selected items.

Available options can vary with the view and the license level of your product.
Note that this menu has changed in versions 3.x and 4.x

Edit
Gives access to the Edit sub-menu, with the standard Editing functions and more.

Editing Functions
Cut, Copy and Paste functions are available for cell editing.
Copy Cell(s)
Copies the content of selected cells. Use it to get the contents of selected cells in a column as a list of values.
Edit Cell
Allows for inline editing of a cell content.
Replace in Cell(s)
Opens the replace dialog for replacements in the selected cells of the current column.
Replace All
Opens the replace dialog for replacements in the whole datasheet.
Rename Column
In data views, this option allows you to change the header of a dynamic column.
Empty Cell(s)
Empties the selected cells of the current column.
Duplicate
Duplicates the contents of selected rows and inserts the duplicates as new rows after the selection.

Insert
Gives access to the Insert and Split sub-menu, with cell/row/column insertion functions.

Insert Row
Inserts a new blank row after the selection.
Insert Rows
Gives access to the String Generation Panel and inserts the generated strings as new rows before the selection. This Insert Rows function allows you to generate strings using the Query Generation Pattern format.
Split First/Last Names
If the selected cell values are recognized as people names, this function inserts new ‘Fist Name’ and ‘Last Name’ columns before the selected column (if these do not already exist) and fills them with the corresponding values found in the selected cell(s). Note that, for now, only one pair of First Name/Last Name columns can exist in the datasheet.
Split Cell(s) to Rows
If the selected cell values contain a character recognized as an item separator (;,-/), this function inserts new rows below the selected rows and fills them with the split values of the selected cells, duplicating the content of the other cells of the selected rows. Note that, as all ‘intelligent’ functions, this one can sometimes have unexpected results, but it can nevertheless save you a lot of time in many repetitive tasks.
Split Cell(s) to Columns
If the selected cell values contain a character recognized as an item separator (;,-/), this function inserts new columns left of the selected column and fills them with the split values of the selected cells. Note that, as all ‘intelligent’ functions, this one can sometimes have unexpected results, but it can nevertheless save you a lot of time in many repetitive tasks.
Insert Column
In data views, inserts a new blank column before the selection. This option only applies to dynamic columns.
Insert Cell(s)
In data views, inserts new blank cells before the selection. This option only applies to dynamic columns.

Delete
Gives access to the Delete sub-menu, to delete cells rows or columns.

Delete
Deletes the selected row(s).
Delete Unselected
Deletes the row(s) that are not selected.
Delete Column
In data views, deletes the selected column. This option only applies to dynamic columns.
Delete Columns
In data views, this options gives you access to a sub-menu allowing you to delete columns containing less than a certain number of populated cells. This option, which only applies to dynamic columns, is very useful to clean up large scrapes where useless columns have been created by poorly populated data fields.
Delete Cell(s)
In data views, deletes selected cells and moves left all the cells located at the right of the selected column. This option only applies to dynamic columns.
Delete Duplicates
Gives access to a sub-menu to delete cell duplicates (rows containing an identical value to the selected cell in the same column) or row duplicates (rows where all cells are identical to the cells of the selected row). It is also possible, through the same menu, to delete all cell or row duplicates of the datasheet.

Select
Gives access to the Select sub-menu, with various ways to select cells or rows.

Select All
Selects all rows of the datasheet.
Invert Selection
Deselects all selected rows of the datasheet and selects all rows that were not selected.
Select Block
In the lists, tables, scraped and news views, this function will select the whole block (list, table, scraped page or rss feed) where the selected row is located. Note: the selection is done using the second group of digits in the Ordinal ID. (Use the column picker at the top right corner of the datasheet to show the Ordinal column, if it is not visible.)
Select Similar
Selects all rows of the datasheet with content similar to that of the the selected cell. The default threshold used for determining similarity is 40 (0 selecting only identical values and 100 selecting everything). Use the sub-menu items to increase or decrease the threshold and select more or less rows.
Select Identical
Selects all rows of the datasheet with content identical to that of the the selected cell.
Select Different
Selects all rows of the datasheet with content different from that of the the selected cell.
Select Duplicates
Gives access to a sub-menu to select cell duplicates (rows containing an identical value to the selected cell in the same column) or row duplicates (rows where all cells are identical to the cells of the selected row). It is also possible, through the same menu, to select all cell or row duplicates of the datasheet.

Auto-Explore
This sub-menu gives access to automation functions that you can apply to the URLs of the selected column in the selected rows of the datasheet. It gives you the capacity to explore the pages or documents and apply extractors, according to the current configuration of the application.

Browse
The program explores the links included in the current selection one after the other. During the exploration, all active extraction processes will be executed on page load, depending on the settings of the views’ bottom panel.

Dig
The program explores the links found in the the pages of the current selection’s URLs. The exploration will be done within the domain of each link, with a depth of 1. During the Dig process, all active extractions will be executed on page load, according to the settings of the views’ bottom panel.

Fast Scrape
Applies a Scraper to a list of Selected URLs. When this function is invoked, XML HTTP requests are sent to all the selected URLs, to retrieve the source code of each one. The most relevant scraper is applied to it, without loading images etc. and without any other extraction being performed. All extracted data is sent to the Scraped view (which is not emptied during the process, regardless of the state of the Empty checkbox).
Fast Scrape (Include Selected Data)
Same function as ‘Fast Scrape’ above, except that the data fields included in the selection will be added to the scraped results. This saves you the work of merging back the records after the scraping, if you need to keep information from the original data.
Apply a Generic Macro
This function allows you to apply a generic macro to the selected URLs. Generic macros are simply macros for which no specific URL is set in the Start Page field.
Open URL in a New Window
In the Firefox Add-on: When the selected data contains a URL, it will be opened in a new browser window.

Download
Gives you access to the Download sub-menu. Note that a preference (in Tools>Preferences>Export) allows you to automatically rename the downloaded files.

Download Selected Files
Downloads and saves to the current destination folder on your hard disk, all documents and images found in the selected rows.
Download Selected Files in…
Downloads all documents and images found in the selected rows, opening the folder picker to let you decide where you want the files to be saved.

First Names
Gives you access to the First Names sub-menu.

The First Name Dictionary is used to enhance the recognition of contact in Web pages. A default dictionary of a few thousand first names from around the world is already included in the program. You can add your own using these options. Note that the dictionary can be saved and loaded from the File menu.

Remember First Name
Choosing this option when a first name is selected in the datasheet will add it to your dictionary.
Forget First Name
Choosing this option when a first name is selected in the datasheet will remove it from your dictionary.

Clean Up
Gives access to the Cleaning & Normalization sub-menu.

Clean Contents
Gives access to the String Cleaning sub-menu.

To Lower Case
Converts all characters of the selected cells to lower case.
To Upper Case
Converts all characters of the selected cells to upper case.
Capitalize Words
Converts the first character of each word in the selected cells to Upper case and the others to lower case.
Dust It
Cleans the text at best and capitalizes the words.
Zap It
Cleans the text from all non-alphabeltical chars and capitalizes the words.

Normalize All Figures / Selected Figures in Column
When this function is executed on a selection, the numerical data contained in each selected cell of the selected column (or in the whole datasheet, depending on the selected option) is reformatted and converted to the corresponding value in metric units (if a numerical value is found with a non-metric unit). Numerical values are normalized as much as possible, removing thousand separators, using dots as decimal separators, removing trailing zeros in decimals, etc. The purpose of this function is not to create a nice formatting but rather to homogenize the formats so that the values can be processed and sorted. Note: the feature is watched by dozens of unit tests in our system and works rather well. There are, however, many possible causes for miss-interpretation of numbers in a text, so please do not rely on this function for processes involved in the piloting of commercial airliners, nuclear power plants, etc.
To Units: Values will be converted to meters, square meters, cubic meters, grams etc.

To k Units: Values will be converted to kilometers, square kilometers, kilograms etc.

Send to Queries
Sends strings to a directory of Queries.

Send Cell(s) to Queries
Sends the selected cells to the chosen directory of the queries view:

New Directory: A new directory will be created with the selected items.

directoryName: The selected items will be sent to the chosen directory.

Send Links(s) to Queries
The first links found in the selected rows will be sent to the chosen directory of the queries view:

New Directory: A new directory will be created with the selected items.

directoryName: The selected items will be sent to the chosen directory.

Export Selection as…
Exports the selected data to a file on your hard disk, in one of the available formats (Excel, HTML, Text, CSV, SQL).

The Page Right-Click Menu
In the browser panel, additional features can be accessed with a right click on the page or a click on the Exploration Button.

Available options can vary with the context and the license level of your product.
Edit (right-click menu)
Gives access to the Edit sub-menu.

Copy Links
If a part of the current page was selected in the browser panel, OutWit Hub will copy the links found in the selection, otherwise all the links of the page will be copied to the clipboard.
Copy
Copies the selection to the clipboard.
Paste
Pastes the clipboard content.
Paste Text
Pastes the clipboard content as plain text.
Paste Links
Pastes the URLs found in the clipboard content. Note that if you use this function on the browser, the list of links will replace the currently displayed page.
Send Copied Links to Queries
If there are links (URLs) in your clipboard (copied from OutWit Hub or any other application), this function sends them to a new or existing directory of the queries view.
Send Highlighted Links to Queries
If hovering over a link, OutWit Hub highlights a series of links it belongs to, this function sends the whole series to a new or existing directory of the queries view.
Send Page Links to Queries
Sends all URLs found in the current page to a new or existing directory of the queries view.
Edit Page Tools
Gives access to series of functions to alter or reformat the current page (the resulting page can be used as the source for all extractions or saved to your hard disk):

Extract All Page Links: Replaces the currently displayed page with a generated HTML page containing all the links found in this page.
Outline Page:Replaces the currently displayed page with a generated outline of the original page, only keeping the section and paragraph titles and subtitles.
Indent Page:Replaces the currently displayed page with a generated outline of the original page, including the text content, indented within the outline.
Decode MIME inclusions: Replaces MIME inclusions (if any) within the currently displayed page, as legible (and extractable) decoded text.

Select Similar
Selects links that belong to the same series or that are at the same hierarchical level as the selected link.
Select All
Selects all the page content.
Find
Looks for a string or regular expression in the page.

Options (right-click menu)
Allows you to disable images, plugins and/or javascript in order to enhance the performance during large automatic explorations. (These settings are persistent between sessions. Do not forget to switch them back to revert to normal browsing.)

Fast Search for contacts & Auto-Explore Pages
Give access to a series of automatic navigation functions. (see option details in the Navigation Menu.)
Apply Scraper
Gives access to the Scraper Application sub-menu. Applies the most pertinent scraper to the current page or to the selected / highlighted links. When a scraper is applied to links with this function, Fast Scrape mode will be used. If you do not want to use the Fast Scrape mode, use the Auto-Explore Pages function instead after having set the scraped view to recieve the data.
Apply Macro
Applies a generic macro with the current page as start page.
First Names
Gives access to the First Names sub-menu.

The First Name Dictionary is used to enhance the recognition of contact in Web pages. A default dictionary of a few thousand first names from around the world is already included in the program. You can add your own using these options. Note that the dictionary is located in your automator database, which can be saved and loaded from the File menu.

Remember First Name
Choosing this option when a first name is selected in the page will add it to your dictionary.
Forget First Name
Choosing this option when a first name is selected in the page will remove it from your dictionary.

OutWit Hub’s Views — The Side Panel

The side panel on the left of your screen contains all available views of the application.

The different views allow you to dissect the page into its various data elements.

Some display extracted data (links, contacts, text…) others give you access to tools for performing specific extraction tasks (automators).
Items may be collapsed: to display the views they contain, click on the triangle pointed to the right (►). Some of the sections containing views (like Data) are not clickable, as they do not correspond to a view. You need to open the section, if it is collapsed, and select one of the views inside it.
Note: Some views are present in both light and pro versions, with limited or disabled features in the light version, others are only present in the pro version.

page
Displays the current web page or document analyzed in the other widgets.

links Lists URLs found in the current page.

documents
(Pro) Lists documents found in the current page.

images
Lists images found in the current page.

emails
Lists contact info found in the current page.

data
Contains the data extraction tools.

tables Extracts HTML table contents.

lists Extracts HTML list contents.

guess
Tries to guess the structure of the data and extract it.

scraped
Applies the most pertinent active scraper to the page.

text Displays the current page as simple text.

words (Pro) Displays the vocabulary used in the page, with the frequency of each word.

news Displays RSS news found in the current page or domain.

source
Displays the HTML source of the page.

autotmators
Contains the automation tools.

queries (Pro) Allows you to create directories of URLs.

scrapers
Allows you to create and edit data scrapers.

macros
(Pro) Allows you to create and edit macros.

jobs (Pro) Allows you to program the execution of a task.

history
Displays the navigation history, grouped by domain name.

The Page View
This is the browser: it displays the current web page or file that is being analyzed in the other views.

When in the page view, you can navigate through Web pages as you would in any Web browser. You can also open a local file or even drag a folder from your hard disk to the url bar to see (and navigate through) its content.

Exploration Button and Right-Click Menu
If active, the Explore Button representing a magnifying lens with an at sign (@) is located at the top left corner of the browser. It is the Exploration Button. When moving your cursor across the page, you will see it placing itself above the series of links that the program recognizes and highlights. Automatic navigation functions are available by clicking on this button or by right-clicking directly on the page. see details in the page right-click menu.

TIP – Optimizing Performances: Right-click on the page to disable or reactivate images and plugins in OutWit’s browser. Deactivating them can make the loading of each page faster for long explorations and extraction workflows, when you do not need images or flash animations.
Click on the black triangle next to the page view name in the side panel to hide or show the extractors (links, images, contacts, data, tables, lists, guess, scraper, text, news and source views).
Note: you can select, in the general preferences, whether you want the application to remain in the current view or to come to this view when a URL is typed in the address bar.
Dragging text to the page
You can Drag a selection from another application to the browser and it will appear as simple text. This is one of the many ways to import URLs from another source: just drag a selection of urls from a text editor and you will find them in the “links” view. (You can also put them in a .txt file and open the file with the Hub.)

The Links View
Shows the list of URLs found in the current page.

The links view displays a table of the URLs found in the current Web page or file, that do not link to media or documents. The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the link was found*
Page URL: The URL of the link itself
Frequency: The number of occurrences of this link in the page
Text: The description text of the link
Filename: The name of the file the URL links to
Type: The type of document
Mime Type: The Mime Type of the file on the server
First Seen: The first time this link was seen*
Last Seen: The last time this link was seen*
Main Doc URL: The URL of the page’s main document. (Useful when a page contains frames or iFrames, to have the parent URL.)*

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
If you wish to download some of the files, simply select them in the table and use the ‘Download Selected Files’ or ‘Download Selected Files in…’ option of the right-click menu.

Bottom Panel Options

When the local checkbox is unchecked, OutWit hides links to the same domain as the current page.

When the cache checkbox is unchecked, OutWit hides links considered to be cached data.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Documents View (pro version)

Shows the list of all document URLs found in the current page.

The documents view displays a table of all document files (.doc, .pdf, .xls, .rtf, .ppt…) found in the file currently displayed in the page view. It includes the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the document was found*
Document URL: The URL of the document
Filename: The name of the file the URL links to
Last Modified: The modification date, if found on the server
Size: The file size
Type: The type of document
Mime Type: The Mime Type of the file on the server

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
If you wish to download some of the documents, simply select them in the table and use the ‘Download Selected Files’ or ‘Download Selected Files in…’ option of the right-click menu.

Bottom Panel Options

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Images View

Shows the list of all images found in the current page.
The images view displays a table of all image files found in the page currently displayed in the page view or in linked pages.

TIP – Optimizing Performances: Right-click on the images view name in the side panel to disable or reactivate automatic image extraction when a new page is loaded. Deactivating this can make the processing of each page faster for long explorations and extraction workflows, when you do not need images. (Also see the page view.)

The table contains the following information:

Source URL: The URL of the page where the image was found*
Image: The thumbnail of the image
Filename: The name of the image file
Size: The size of the image in pixels (width x height)

Media URL: The URL of the image file
Found in: The DOM element where the image was found in the source code (image tag, script, background…)
Description: The size of the image file
Type: The type of image
Mime Type: The Mime Type of the image file on the server
Thumb URL: The URL of the thumbnail (if a high resolution image was found)*

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
If you wish to download some of the images, simply select them in the table and use the ‘Download Selected Files’ or ‘Download Selected Files in…’ option of the right-click menu.

Bottom Panel Options

If the adjacent checkbox is checked, OutWit will look for sequences of pictures, by trying to find numerical sequences of in URLs around the found images. For instance: if an image named obama_022.jpg is found, the program will try to find obama_021.jpg and obama_023.jpg on the same server.

When the scripts, styles, backgrounds, checkbox are checked, OutWit looks for images in the corresponding tags of the page source code. Styles is unchecked by default as style images are often small layout element of lesser interest.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Contacts View
Shows the list of email addresses and contact elements found in the current page.

The emails view displays a table of the email addresses found in the current Web page / file or in the automatically explored pages. The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the link was found*
Source Domain: The URL of the page where the link was found
Page Title: The title of the page where the link was found
Email: The email address itself
Frequency: The number of occurrences of this email address in the or in the automatically explored pages
Contact Info Columns: First Name, Last Name, Address, Phone, Fax, Mobile, Toll Free, Title… are added when the ‘Guess Contact Info’ checkbox is checked in the bottom panel.

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.

Bottom Panel Options

Strict: When unchecked, OutWit Hub will look for addresses with a looser format and will accept strings like “name at site dot com” or “pseudo[at]domain.org” as valid email addresses.
Guess Contact Info: When checked, OutWit will try to find additional contact information linked to the email address.
Filter Level: Beside the Guess Contact Info checkbox is a popup menu allowing you to select the level of filtering or strictness for the info recognition. When the filter is maximum, the contact data found is only added if it is very likely to be linked to the email address. When the filter is minimum, all found data is added to the result datasheet, at the risk of grabbing some noise or making occasional mistakes associating the info to the email address.

Note: The contact info extraction is based on recognition by the program of unstructured data in each page.
Recognizing that a series of digits is a phone number rather than a social security number or a date is easy if you know in advance that you are dealing with data from a given country. If you don’t, however, the problem is very far from trivial.
A brief description of the way OutWit searches for names, addresses, phone and fax numbers etc. will help understand how reliable it can be, depending on the source: The program first looks if additional contact information can be found in the immediate context of each email address. After this, it takes all non-assigned phone numbers and physical addresses and sees if it is likely to belong to one previously found contacts. Otherwise, these are listed independently, lower in the result datasheet.
For the data to be extracted, it must first be present in the page, of course. Then, if it is, no technology allows for perfect semantic recognition. An address or a phone number can take so many different forms, depending on the country, on the way it is presented or on how words are abbreviated, that we can never expect to reach a 100% success rate.
Email address recognition is nearly perfect in OutWit; phone numbers are recognized rather well in general; physical addresses are more of a challenge: they are better recognized for US, Canada, Australia and European countries than for the rest of the world. The program recognizes names in many cases. As for other fields like the title, for instance, automatic recognition in unstructured data is too complex at this point and results would not be reliable enough for us to include them unless they are clearly labled. We are constantly improving our algorithms so you should make sure to keep your application up-to-date.
If your need for precision in the extracted data is critical in your workflow and if you cannot afford failed automatic recognition, it may not be a good idea to rely on automatic features like this one. In these cases, you may want to create a scraper for a specific site.

Max Processing Time: Allows you to set the maximum time in seconds that the program should spend analyzing each page when searching for contacts.

Empty/Auto-Empty: This button offers two positions, accessible via the popup arrow on its right side: Empty on Demand, which allows you to only clear the contents of the results datasheet when you decide, or Auto-Empty, which tells the program to clear the results each time a new page is loaded.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Data Section

Gives access to the different data extraction views of the application.

The current version of OutWit Hub Pro, offers four data views: Tables, Lists, Guess and Scraped.
Note: You can hide or show those by clicking on the black triangle next to the section name in the side panel.
The Tables and Lists views will help you extract data with an explicit structure in the HTML source code of the page. The other data extractors will be useful when these two are not enough to get the Job done. Guess tries to automatically recognize the data structure and Scraped allows you to manually define how the extraction should be done.

The Tables View
Displays the HTML tables found in the current page.

The tables view displays in the datasheet, the HTML tables of three rows or more, found in the current page. The minimum number of rows required for tables to be extracted can be altered in the preferences (Tools>Preferences>Advanced Tab).
In case of merged cells in the HTML code, using row or column spans, the cells are kept seperate in the view datasheet and the values will be repeated in the corresponding cells. By default, tables of less than tree rows are ignored. This can be changed in the preferences.
If an hypertext link is found in the data of a table row, it will be placed by the program in the URL column at the left of the datasheet. The objective is to gather the useful links in this one column both for the Lists and Tables views. This column will usually be the simplest way for you to grab collections of links to explore further. If several links are found in each row, outwit will try to decide which column contains the most significant links. By default, the first column containing URLs will be chosen, unless there is a column with less missing links, less duplicate links, etc. This is an arbitrary algorithm, but it usually works pretty well.
The first two columns of the datasheet contain the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the list was found*
URL: The most significant URL found in the table row (if any). Often the first link found.

The following columns vary with the data extracted.

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Lists View
Displays the HTML lists found in the current page.

The lists view displays in the datasheet, the HTML lists (

    ,

      and

    1. tags) found in the current page, keeping the hierarchical level of the items.

If a link is found for a list item, it will be stored in the URL column of the datasheet. If several links are found, only the last one will be kept. The first three columns of the datasheet contain the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the list was found*
URL: The last URL found in the list item (if any)

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options
Add Titles: This option was added in v3.0 as lists are often difficult to identify or understand without the title preceeding them. When this option is checked, the program includes the content of and

… tags in the HTML page.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Guess View
Displays the data extracted using automatic structure recognition algorithms.

The guess view tries to understand the structure of the data found in the current page, if any.

Note: The program analyzes the available html source code of the page. Labels and field/record separators are looked for, using many different strategies. The program eventually gives a rating to each possible structure found and decides of the best possible answer, if any. The Challenge of these intelligent algorithms is to understand even non-tabulated data and we will make sure they become more and more efficient, but the very nature of the problem makes it impossible to ever get close to a 100% success rate.
If your need for the scraped data is critical in your workflow and if you cannot afford failed automatic recognition, it may not be a good idea to rely on automatic features like this one. In these cases, you should probably define a scraper and use the right click menu option: ‘Auto-Explore’ > ‘Fast Scrape’. This way, if you have thoroughly tested the scraper you have designed, the process will be reliable and reproducible, at least as long as the online source is not altered and remains accessible.

Ordinal: An index composed of three groups of digits separated by dots (hidden by default)

Source URL: As in the datasheets of the other views, the Source URL is placed in the second column and is hidden by default.
Note: Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options
List: When checked, OutWit will try to find a list of records and present them as a table (one record per row, one field per column). When unchecked, the program will try to recognize “specsheet” type of data in the page (one row per field, a Label and a Value for each field). If you uncheck this option, OutWit should do better with simple text data like this:

last name: Knapp

first name: John

age: 34

phone: (674) 555-5621

You can try this option by going to the workshop page (ctrl/cmd-shift-k) and pasting text from a word processor, emails… Guess will usually do better if you paste simple text, using a right-click and chosing Edit>Paste Text.
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Scraped Data View
Displays the results of the application of a scraper to the current page (or to a series of URLs).

The scraped data view displays in a table the data extracted using the active scraper with the highest rating. Each field defined in the scraper corresponds to a column of the datasheet.
If several active scrapers can be applied to the current URL, the possible candidates will be rated according to their version number, their freshness and the specificity of the Apply to URL (mySite.com having a lower priority than www.mySite.com/myPage). If you wish to apply a scraper of a lesser rating, you can deactivate all scrapers with a higher priority in the scraper manager.
Ordinal: An index composed of three groups of digits separated by dots

Source URL: As in the datasheets of the other views, the Source URL is placed in the first column and is hidden by default.
Note: Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.

Bottom Panel Options
The Keep Order checkbox allows you to ask OutWit to force the columns in the same order as the scraper. If this option is checked, all columns will appear in the resulting data, even when they are completely empty. (This can be useful if you wish to export to an Excel or HTML file, for instance as part of a job, to then use this data in a set process, with other applications.)
In case of application of a scraper to whole lists of URLs, like with the ‘Auto-Explore’ > ‘Fast Scrape’ of the datasheets’ right-click menu, the Empty checkbox will be ignored. In all other cases, if this option is checked, the datasheet will be emptied as soon as a new page is loaded.
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Text View
Shows the current page as simple text.

The Text view displays the textual content of the current page and ignores all other content: scripts, media, animations, layout, etc.

Note: You can hide or show the Text related views available in your version of OutWit Hub, by clicking on the black triangle next to the view name in the side panel.

All or parts of the text can be moved to the Catch (or saved to a file).

The Words View (pro version)
Displays the vocabulary used in the page, with the frequency of each word.

The Words view displays a table of significant words and groups of words found in the source code of the current Web page or file. The frequency column gives you, as a fraction, the number of occurrences divided by the total number of words. Note that if you notice a higher number of occurrences than what you can actually see in the Web page, it means that the other occurrences of the word or phrase are in the source code but hidden (like alternate text, invisible blocks, etc.).

Groups of words are recurring successions of two to four words. If OutWit recognizes the page language, “empty words” are ignored in this view. This means that in English, French, German, Spanish and several other occidental languages, very common pronouns, auxiliaries, articles, etc. will be ignored. This covers words like “the”, “is”, “which” etc.

The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the word was found (or the number of pages where it appeared, if ‘Empty’ is unchecked)
Word: The word
Frequency: The number of occurrences of this word in the page

Note: The Ordinal column is hidden by default in this view. Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.

Bottom Panel Options

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The News View
Shows the list of RSS articles found in the current page or in the current domain.

The news view displays a table of the news articles from all RSS feed found in the current Web page, or in the same domain. The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the feed was found*
Feed URL: The URL of the RSS feed*
Feed Title: The name of the feed*
Feed Link: The link of the HTML page corresponding to the feed*
Feed Description: The description of the feed*
Feed Language: The language of the RSS feed*
Title: The title of the article
Article URL: The link to the full article
Date: The date and time of release
Image: The URL to the attached image*
Category: The category name of the article*
Abstract: The abstract of the article

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Source View
Shows the colorized source code of the current page.

When the current file displayed in the Page view is a Web page, the source view contains the colorized HTML source code of its main document.

In the pro version, a radio control allows you to select if you want to see the original source code that was loaded by the browser when opening the page or the code as it was dynamically altered by scripts after the page was loaded. The dynamic source code is presented on a pale yellow background and the original, on a white background. This will help you recognize the setting immediately.

The source code presented in the scrapers view is another instance of the same panel.

The colorization was conceived for data search rather than programming purposes and emphasis is given to the textual content that is actually displayed on the page: it is shown in black and pops out from the cryptic HTML syntax.

The colors used are the following:

Displayed text
HTML tags
Links
Comments
HTML Entities
Styles
Scripts
Images

The History View
Displays the list of seen URLs grouped by domain.

The history view doesn’t show the list of URLs that have been visited (this would be redundant with the browser navigation history), but of URLs that have been seen in the current session. This means that, as the history is grouped by domain, after surfing for 15 minutes (or hours) on Web pages related to a certain topic, say astronomy, you will find in this view a list of the most frequently cited domains in this topic.

The History datasheet contains the following information:

Domain: The domain
First Seen on: The time when this domain was first recorded in the current session
Last Seen on: The time when this domain was seen most recently in the current session
Frequency: The number of occurrences of this link in the page

Bottom Panel Options
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Automators Section

Gives access to the different automators available in the application.

In the current version of OutWit Hub Pro, four kinds of automators can be defined: Scrapers, Macros, Jobs and Queries.

A Scraper is a description of the data structure in a page, defining the markers that can be found in the document source code, around the data you wish to extract.
A Macro is a snapshot of the complete configuration of the application’s various extractors, which can be replayed in a single click for performing a specific exploration and extraction task.
A Job is a preset time and periodicity at which an action should be performed.
A Set of Queries is a directory containing a list of URLs or Query Matrices on which an action (autobrowsing, macro, scraper, slideshow, etc) can be performed.

The Automator Managers

Allows you to manage the automators stored in your profile.

In each of the automator views (Scrapers, Macros, Jobs and Queries), the manager is the panel presenting the list of all automators of the considered type stored in your profile. The manager allows you to create, delete, import, export automators and gives access to the property and automator editors.
Each automator is identified by its Automator ID (AID) in your profile. It is preceded by an Active checkbox. When this box is unchecked the automator is deactivated and grayed out in the list. You will need to activate it before using it.
If a layout change button is present at the top right corner of the panel, you will be able to switch between horizontal and vertical layout of the window, placing the editor and manager at the bottom or on the right of the screen.

The Scrapers View
A Scraper is a template telling OutWit how to extract information from a page.
When ‘tables’, ‘lists’ or ‘guess’ do not manage to recognize automatically the structure of a page and extract its data, you still have the option to create a Scraper and tell OutWit how it should handle this specific URL (or all the pages of a given Web site, a sub-section thereof, etc).

What is a scraper?

A scraper is simply a list of the fields you want to recognize and extract. For each field, it specifies the name of the field (ex.: ‘Phone Number’), the strings located immediately before and after the data to extract in the source code of the page and the format of the data to extract for this field. The pro version also allows you to set replacement string to alter the extracted data and a delimiter, to split the extracted result into several fields.

Creating and Editing Scrapers

The scrapers view can contain either the scraper manager (to create, duplicate, delete previously made scrapers), or the scraper editor, to build and edit them.
In editing mode, the source code of the current page is displayed, for you to easily identify and copy the markers you need. You can select to which source code you want your scraper to be applied to: the original (white background) or the dynamic source code (pale yellow background), using the source type popup menu.
In the bottom part of the window is the editor itself, where you can create and modify your scrapers.
To switch from one mode to the other, use the Manage or Edit button.
Activating scrapers
As for all other extractors, the ‘scraped’ view is automatically active when any control of the bottom panel is set to a non-default value. (i.e. ‘empty’ is unchecked, ‘move to catch’ is checked…). On the scrapers themselves, an additional control must be set for the extraction to be done automatically:
In the scraper manager, the ‘active’ checkboxes determine whether a scraper should be used when the corresponding URL is loaded. When unchecked, the scraper is deactivated.

When asked to scrape a page, if there are more than one applicable active scrapers for that URL, OutWit will apply the one that seems most appropriate and recent, using several criteria (version number, modification time, specificity of the ‘URL contains’ string, etc.).

In the light version, only one scraper can be active at a given time and no more than ten scrapers can be present in the manager.

Scraping Data

If a control of the bottom panel is set to a non-default value in the scraped view, the program will try to find a matching scraper and apply it, as soon as a new page is loaded.

From both the manager and the editor, you can apply the scraper to the current page, using the ‘Execute’ button (or the right-click menu, in the manager). After performing an extraction with a scraper, you will find your results in the scraped data view.
From any datasheet, you can select URLs, right click on one of them and choose ‘Auto-Explore’ > ‘Fast Scrape’ in the popup menu.
You can of course include your scrapers in macros (browsing/digging through URLs or using the ‘fast scraping’ mode) for recurring extraction tasks.

The Scraper Editor (OutWit Scrapers Syntax Reference)

In the scrapers view, the bottom part of the window can either be the Scraper Manager or the Scraper Editor. In the scraper manager, you can see and organize your scrapers and, when double-clicking on one of them or creating a new one, the scraper editor opens and you can create or alter your scraper lines.
The Editor allows you to define the following information:

Apply if URL contains…: The URL to Scrape (or a part thereof). This is the condition to apply the scraper to a page. The string you enter in this field can be a whole URL, a part of URL, or a regular expression, starting and ending with a ‘/’. (In the last case, the string will be displayed in red if the syntax is invalid.) If you try to scrape a page with the ‘Execute’ button when this field doesn’t match the URL, an error message will be displayed. If two or more scrapers match the URL of the page to be scraped, the priority will be given to the most recent, with the most significant condition (longest match).

Note: If you keep getting the message: “This scraper is not destined to the current URL”, this is the field that must be changed. A frequent mistake is to put a whole URL in this field, when the scraper is destined to several pages. Try to enter only the part of the URL which is common to all the pages you wish to scrape, but specific enough to not match unwanted pages.

You may also get this error message if you are trying to apply a disabled scraper to a valid URL. In this case, just check the OK checkbox in the scraper manager.

Source Type (pro version): You can set your scraper to be applied either to the original source code that was loaded by the browser when opening the page or to the code as it was dynamically altered by scripts after the page was loaded.

In the scraper definition itself, each line corresponds to a field of data to be extracted.

To edit or enter a value in a cell, double-click on the cell. To simplify the fabrication of scrapers and avoid typos, the best way is often to select the string you want in the source code and drag it to the cell. You can then edit it as you like.

Description (name of the field): can either contain a simple label like “Phone” or “First Name” or a directive (see below),
A) Marker Before – optional: a string or a Regular Expression marking the beginning of the data to extract,
B) Marker After – optional: a string or a Regular Expression marking the end of the data to extract,
C) Format – optional: a Regular Expression describing the format of the data to extract,
Replace – (pro version) optional: replacement pattern (or value to which this field must be set).
Separator (pro version) optional: delimiter, to split the extracted result into several fields.
List of Labels (pro version) optional: the list of labels to be used, if the result is split into several fields with a separator.

Important Notes:

1) In a scraper, a line doesn’t have to include Marker Before (A), Marker After (B) and Format (C). One or two of these fields can be empty. The authorized combinations are: ABC, AC, BC, AB, A, C.

2) When creating a scraper you can right-click on a marker or format field to find and highlight the occurrences of a string or a pattern in the source code. If you right-click on the description field it will allow you to find the whole scraper line in the source. This is very useful for troubleshooting.

3) The first line of the scraper will be considered by OutWit Hub as the field that starts a new record. This means that each time this scraper line matches data in the page, a new record will be created. Usually, the best way is to follow the order of appearance of the fields in the source document.

In the Format pattern, use the regular expression syntax and do not forget to escape reserved characters.Note: If you right-click on the text you have entered in a cell, an option will allow you to escape a literal string easily. In the Format field, the content will always be understood as a regular expression, even if not surrounded by / /.

In the Replace string, use \0 to insert the whole string extracted by this line of the scraper, or \1, \2, etc. to include the data captured by parentheses –if any– in the Format regular expression.

For instance, say you extract the string “0987654321″ with a given scraper line. Adding a replacement pattern can help you rebuild a whole URL from the extracted data:

If you enter:

http://www.mySite.com/play?id=\0&autostart=true

as replacement string, the scraper line will return

http://www.mySite.com/play?id=0987654321&autostart=true

In the Separator, use either a literal string like , or ; or a regular expression like / [,;_\-\/] /.
For technical reasons, all regular expressions used in scrapers are interpreted as case insensitive patterns by default. [A-Z], [A-Za-z] and [a-z] have the same result. This can be changed using the #caseSensitive# directive. This means that ‘Marker before’, ‘Marker after’, ‘Format’, which are always converted to regular expressions by the program, are case insensitive by default. Conversely, the ‘Separator ‘which is used as is by the program, is therefore case sensitive by default if it is a literal string, and case insensitive if it was entered as a regular expression.

To learn about Regular Expressions, please visit the RegExp Quick Start Guide

When splitting the result with a separator, use the List of Labels to assign a field name to each part of the data. Separate the labels with a comma. If there are less labels than split elements or if the Labels field is empty, default labels will be assigned using the description and an index.

Example: the string you want to extract doesn’t have remarkable markers and you do not know how to separate different elements of the data. Say the source code looks like this:

  • Dimensions:35x40x70

If you want the three dimensions in three separate columns, you can reconstitute the structure by entering the following:

Marker Before:

  • Dimensions:Marker After:

Separator:

x

Labels:

Height,Width,Depth

Regular expressions can be used to keep the scraper short if you are confortable with them. In many cases, however, it is also possible to do without. For instance, if you need a OR, just create two scraper lines with the same field name (Description).

Example: If you want your scraper to match both ‘sedan-4 doors’ and ‘coupe-2 doors’

The simple way is do it in two separate lines:

Description:

car

Format:

sedan-4 doors

Description:

car

Format:

coupe-2 doors

Or you can use a regular expression:

Description:

car

Format:

/(sedan\-4|coupe\-2) doors/

Directives (pro version)

Directives alter the normal behavior of the scraper. They can be located anywhere in the scraper and will be interpreted before all other lines by the program. Directives are identified by # characters in the description field:

Pre-Processing:
#abortIf# and #abortIfNot# Aborts the scraping and interrupts the current automatic exploration if the scraper line matches (or doesn’t match) within the page.
#autoCorrect# is destined to fix common scraper problems. For now it only corrects wrapping offsets happening when the wrong field is used as the record delimiter. (This feature is temporary.)
#caseSensitive# makes the whole scraper case sensitive. Note that as every regular expressions and literals of a scraper are combined into a single regular expression at application time, it is therefore not possible to define case sensitivity line by line or field by field. The whole scraper must be conceived with this in mind.
#checkIf# and #checkIfNot# if the scraper line matches at least one string in the page, or does not match anything, or in any case, without condition (#check#) the content of the ‘replace’ field will alter the OK column of your scraper. A string of 0s and 1s in the replace field will set the OK checkboxes of the scraper in the same order. Note that the right-click menu on the replace field of a #check# directive line will allow you to copy the values from the OK column to the cell or copy the cell string to the OK column.

Example:

You want to turn off line 5 of your scraper if the page doesn’t contain “breaking news”:

Description:

#checkIfNot#

Format:

breaking news

Replace:

11110111

#cleanHTML# normalizes the HTML tags before the scrape, placing all attributes in alphabetical order. This can prove useful in some occasions when a page was typed by a person (without rigor), instead of generated automatically.
#concatSeparator#separator# allows you to set the character or string to be used as a delimiter in contactenation functions like #CONCAT#, #DISTINCT#, etc.
#ignoreErrors# when this directive is used, cells where a function returned an error will be empty instead of containing an ##Error message.
#insertIfNot#myFieldName# if the scraper line does not match anything in the page, the content of the ‘replace’ field will be added once to each row scraped for this page. It is the only way to insert information to your extracted data when the page does not contain something.
#insertIf#myFieldName# the data extracted by this scraper line will be added once to each record scraped in this page, if the scraper line matches one or more strings in the page. It is mostly here as the corollary of the previous directive, but it is a good way to get rid of duplicate columns in certain cases.
#keepOrder# has the same effect as checking the ‘keep order’ checkbox in the scraped view or in a macro, i.e. ensuring that the columns will appear in the result datasheet in the same order as the scraper lines. Setting it directly in the scraper allows you to make sure to always have this behavior with this scraper.
#outline# alters the source code before scraping, keeping only the document/page outline.
#indentedText# alters the source code before scraping, reorganizing the document/page layout into an outline with indented text.
#pause# or #pauseAfter# instructs the scraper to wait, after the page is processed, for the number of seconds set in the Replace field.
#processPatterns# instructs the scraper to check if URLs passed to the #addToQueue# directives are generation patterns. If they are, the patterns will be interpreted and all generated strings will be added to the queue.

#replace# pre-processing replacement: the string (or regular expression) entered in the ‘format’ field will be replaced by the content of the ‘Replace’ field throughout the whole source code of the page, before the scraper is applied.

Example: The page you wish to scrape contains both “USD” and US$. You wish to normalize it before scraping:

Description:

#replace#

Format:

US$

Replace:

USD

#scrapeIf# data will only be extracted from the page if this scraper line matches something in the page source code.
#scrapeIfNot# data will only be extracted from the page if this scraper line doesn’t match anything.

Example: You want to scrape only pages that contain “breaking news”:

Description:

#scrapeIf#

Format:

breaking news

#scrollToEnd# instructs the scraper to scroll down to the end of the page and wait for the number of seconds set in the replace field (usually for AJAX pages, in order to leave time for the page to be refreshed).

Processing:
#addToQueue# stores the data scraped by the line in a global variable. The queue can then be accessed with the #nextToVisit()# function. (See below for more info.)

#exclude#myFieldName# If this directive is used, the content of the ‘Format’ field of the scraper line will not be accepted as a value for myFieldName. If the line matches with the excluded value, the match will be ignored.
#newRecord# each time the pattern of this scraper line matches a string in the page source, a new record (new row) is created in the result datasheet. The pattern to match can be entered either in the ‘marker before’ field or in the ‘format’ field.
#repeat#myFieldName# the matching or replacement value will be added in a column named myFieldName to all rows following the match of this scraper line.

Example: Say you have a page where the data to scrape is divided by continent between the following tags:

Continent: XXXXX

.

You can set the scraper to add the continent in a column for every row by adding:

Description:

#repeat#Continent

Marker Before:

Continent:

Marker After:

The repeat directive can be used to set a fixed value in a column by only entering a string in the Replace field:

Example: For inputing data directly in your database without any touchup in the process you need to add the field “location” with a set value:

Description:

#repeat#Location

Replace:

New Delhi

Note: if a variable is entered in the Replace field, all its values will be concatenated in the repeated output.
#start# switches scraping on. Data will start being extracted in the part of the source code following the match of this scraper line. (Directives are not limited by #start# and #stop#. For instance, if the #scrapeIf# directive matches outside of the start/stop zones, it will still be executed.)

Example: You only want to start scraping after a given title, say

Synopsis:

.

You simply need to type the string in the Format field of your scraper line:

Description:

#start#

Format:

Synopsis:

#stop# switches scraping off. Data extraction will stop after the match of this scraper line in the source code. (But the code analysis continues and scraping will start again if a #start# line matches.) Note that if the #stop# line matches before a #start# line (or if there is no #start# line), a #start# directive is implied at the beginning. In other words, in order to be able to stop, the scraping needs to start. Directives are not limited by #start# and #stop#. For instance, if the #scrapeIf# directive matches outside of the start/stop zones, it will still be executed.
#variable#myVariableName# Declares and sets the value of the variable (#myVariableName#). The occurrences of the variable are then replaced, at application time, by the scraped value in all other lines of the scraper. Variables can only be used within the scope of one scraper execution. They cannot be used to transfer information between two scrapers.

Example: Setting and using the variable ‘trend’.

line 1:

Description:

#variable#trend#

Marker Before:

Dow Jones:

Marker After:

 

Format:

/[-+\d,.]+/

line 2:

Description:

#showAlert#

Replace:

#if(#trend#<0,Bear,Bull)#

Anchor Functions: (The need for these functions is relatively rare, it will help you solve difficult cases when the data is presented in columns in the HTML page, using blocks with left or right ‘float’ tags.) #setAnchorRow# stores the row number where this scraper line matches, so that data that will be found later in the page source code can be added to the result table as additional columns, starting at this row number. Thus, when the directive #useAnchorRow# is encountered –and if an anchor row has been previously set– the following fields of data are added, starting at the anchor row until the #useCurrentRow# directive reverts to the normal behavior, adding a new row at the bottom of the result table each time a record separator is found.

Post-Processing:
#nextPage# allows you to tell OutWit Hub how to find the link to the next page to use in an automatic browse process. Use this when the Hub doesn’t find the next page link automatically, or when you wish to manually set a specific course for the exploration. NOTE: As any feature in scrapers, the next page directive is only applied when the scraped view is active (which means that the view’s bottom panel has non-default settings and the view name is in bold in the side panel).

Example: A typical next page scraper line.

Description:

#nextPage#

Marker Before:

Next page

Format:

/[^"]+/

Replace:

#BASEURL#\0

#cleanData# and #originalData# override the ‘Clean Text’ checkbox in the scraped view. When original data is set, the data is left as is (including HTML tags and entities), when clean data is used, HTML tags are removed from the scraped data.
#nextPage#x# You can add a positive integer rating in the next page directive: if several nextPage directives are used, the first matching line of the highest rating will be chosen. Use #nextPage#0#, the lowest, for the default value. If #nextPage# is used without a rating parameter, it will be considered as the highest rated.

Example: You want to go to the link of the text “Next Page”, if found, or go back to the previous page otherwise:

line 1:

Description:

#nextPage#0#

Replace:

#BACK#

line 2:

Description:

#nextPage#3#

Marker Before:

Next page

Replace:

#BASEURL#\0

#normalizeToK#myFieldName# and #normalizeToUnits#myFieldName# normalizes numerical value in the field myFieldName: converts it to decimal units (m, m2, m3, g…) or k units (km, km2, kg…), removes thousand separators and uses the dot as a decimal separator.

Debug directives:

#showAlert# displays an alert with the data scraped by the directive line. If only the ‘Replace’ field is filled, the alert will be shown at the end of the scraping.
#showMatches# displays an alert with all the strings that match the scraper patterns.
#showNextPage# displays an alert with the value of the selected next page URL.
#showNextPageCandidates# displays an alert with the list of possible next page URLs found.
#showRecordDelimiter# displays an alert with the name of the field selected as the record delimiter for this scraper.
#showResults# displays an alert with the data grabbed by the scraper.
#showScraper# displays an alert with the content of the scraper as interpreted by the program.

#showScraperErrors# displays an alert if an error occurs. (Most of the time alerts are not welcome as they would block the execution of automatic tasks.)
#showServerErrors# creates a separate column in the result datasheet with error messages returned by the server.
#showSource# displays an alert with the source code to which the scraper is applied (after replacements made by the #replace# directive).
#showOriginalSource# displays an alert with the original source code that was sent to the scraper (before alterations).
#showVariables# displays an alert with the values of all variables.
#showVisited# displays an alert with the list of the URLs visited since the beginning of the browse process.
#simulate# instructs the program to process the scraper without actually applying it. The interpretation is performed and some directives will still work, allowing you to display information for debug. This can be helpful if the scraper application fails –in particular in case of freezes during the application of scrapers with too complex or faulty regular expressions– in order to seek the cause of the problem.

Time Variables (pro version)
The following variables can be used in the ‘Replace’ field to complement or replace the scraped content.
Use #YEAR#, #MONTH#, #DAY#, #HOURS#, #MINUTES#, #SECONDS#, #MILLISECONDS#, #DATE#, #TIME#, #DATETIME# in the ‘Replace’ field to insert the respective values in your replacement string.

Example:

You can add a collection time to the scraper using both a directive and a time variable:

Description:

#repeat#Collected On

Replace:

#DATETIME#

Navigation Variables (pro version)
The following variables can be used in the ‘Replace’ field to complement or replace the scraped content.
Use #URL# (current page URL), #BASEURL# (current page path), #DOMAIN# (current domain), #BACK# (previous page in history), #FORWARD# (next page in history) in the ‘Replace’ field to insert the respective values in your replacement string.

Example:

You just want the source domain in a column ‘Source’:

Description:

repeat#Source

Replace:

Collected on #DOMAIN#

Redirections:
#REQUESTED-URL# gives the URL that was queried or clicked on.
#REDIRECTED-URL# returns the URL the browser eventually landed on after a redirection, if any, and returns nothing if there was no redirection.
#TARGET-URL# returns the URL the browser eventually landed on after a redirection, if any, and returns the requested (current) URL if there was no redirection.

Host Info:

#HOSTNAME# returns the most probable name of the organization hosting the current Web page.
#HOSTCOUNTRY# (Enterprise version) returns the most probable country of the current Web page.

#ORDINAL# returns the ordinal number of the page being scraped in an automatic exploration. (Note that this is different from the Ordinal ID column in datasheets. The number returned by #ORDINAL# is the first group of digits that constitute the Ordinal ID.)
#COOKIE# returns the content of the cookie(s) that have been set in your browser by the current Website if any.

Replacement functions (pro version)
The following functions can be used in the ‘Replace’ field to alter the scraped content.
These are executed when the scraper line (markers and/or format) match a string in the source code.
NOTE: these functions are still subject to evolution. At this point they can only be used alone in the replace field. They can now be used in a variable declaration.

Put #AVERAGE#, #SUM#, #MAX#, #MIN#, #CONCAT#, #HAPAX#, #UNIQUE#, #STRICTLY-UNIQUE#, #DISTINCT#, #STRICTLY-DISTINCT#, #FIRST#, #LAST#, #SHORTEST# or #LONGEST# in the ‘Replace’ field to replace the scraped values by the corresponding total calculation. (Note that totals cannot serve as record separator. They will only work if not located on the first line of a scraper.)

#AVERAGE#: if scraped values are numerical, the result is replaced by the arithmetic mean of these values

#SUM#: if scraped values are numerical, the result is replaced by the sum of these values
#MIN#: if scraped values are numerical, the result is replaced by the minimum value, otherwise by the first in alphabetical order

#MAX#: if scraped values are numerical, the result is replaced by the maximum value, otherwise by the last in alphabetical order

#CONCAT#: all values are concatenated, using semicolons as separators
#COUNT#: the number of occurrences

#HAPAX#: if only one occurrence is found, it is returned, otherwise the field does not return anything
#UNIQUE#: if only one value is found (whatever the number of occurrences), the value is returned, otherwise the field does not return anything
#STRICTLY-UNIQUE#: (case sensitive) if only one value is found (whatever the number of occurrences), the value is returned, otherwise the field does not return anything
#DISTINCT#: all distinct values are concatenated, using semicolons as separators; duplicate values are ignored (even if in different cases)
#STRICTLY-DISTINCT#: (case sensitive) all distinct values are concatenated, using semicolons as separators; exact duplicates are ignored
#DISTINCT-COUNT#: creates two columns (fields). The first one with the COUNT, the second with the DISTINCT concatenation.
#STRICTLY-DISTINCT-COUNT#: creates two columns (fields). The first one with the COUNT, the second with the STRICTLY-DISTINCT concatenation.
#FIRST#: only the first occurrence is returned
#LAST#: only the last occurrence is returned
#SHORTEST#: only the shortest matching occurrence is returned
#LONGEST#: only the longest matching occurrence is returned

Operations: #(term1 operator term2)# Works with the following operators: + (addition of integers: 1+3=4; concatenation of strings: out+wit=outwit; incrementing characters: c+3=f), – (subtraction of integers: 5-2=3 or decrementing chars: e-3=b ), * (multiplication), / (division), ^ (power), <, >, =, ==, !=,… (comparison operators): a=A (case-insensitive comparison), a==a (case-sensitive comparison), a!=b (not equal, case insensitive), a!==b (not equal, case sensitive). The terms can be literals, variables or functions. When using equality operators on strings (=, !=, ==, !==), you can now use the wildcard % in the second term to replace any string. (ex. these three statements are true: headstart = Head% ; homeland == h%d ; lighthouse = %HOUSE).
Conditions: #if(condition,valueIfTrue,valueIfFalse)# or #if(condition;valueIfTrue;valueIfFalse)# for conditional replacements. The separator used between the parameters (comma or semicolon) must not be present in the parameters themselves.
Lookup lists: #lookUp(value,listOfValuesToFind,listOfReplacementValues)# or #lookUp(value;listOfValuesToFind;listOfReplacementValues)# for replacing lists of values. The parameters listOfValuesToFind and listOfReplacementValues must include the same number of items, separated by commas or semicolons. The elements of the first list will be respectively replaced by those of the second. The separator used between the parameters must not be present in the parameters themselves.
Replace function (not to be confused with the replace directive) #replace(originalString,stringToFind,replacementString)# or #replace(originalString;stringToFind;replacementString)# replace the first occurrence of stringToFind by replacementString in originalString.
URL alteration functions: #getParam(URL,parameterName)# returns the value of a parameter in the passed URL and #setParam(URL,parameterName,parameterValue)#, to assign a new value to a parameter. When used in conjunction with #URL# in the #nextPage# directive line, this function allows you to easily set the value of the next page URL in many cases.
Alert: #alert(Your Message)# Displays an alert with the message passed as a parameter (and blocking the scraping process).

Example:

This scraper line will generate the next URL to explore, incrementing the parameter ‘page’ in the current URL.

Description:

#nextPage#

Replace:

#setParam(#URL#,page,#(#getParam(#URL#,page)#+1)#)#

Automatic Exploration and Hierarchical Scraping (pro version)
It is now possible for a scraper to set the URL of the next page to explore in a browse process (see #nextPage# directive above). Together with this feature comes a replacement function which allows advanced users to develop powerful scraping agents:

#nextToVisit(#myURL#)#, in the ‘Replace’ field, instructs the Hub to give the variable #myURL# the next value which is not found in the list of visited URLs. If you set #variable#myURL# in a scraper line, and if this line matches say 10 strings within the source code of the page, this variable will contain an array of 10 values. The #nextToVisit# directive will give #myURL# the value of the first URL which hasn’t been explored in the current Browse process. This means that, used in conjunction with #nextPage# and #BACK# you can create complex scraping workflows. You can, in particular, create multi-level scraping processes.

#addToQueue# and #nextToVisit()#: This follows exactly the same principle, but without declaring a variable. It is simpler to use but it offers a little less control as it only allows you to have a single stack of URLs to explore. Contrary to variables, the queue can be accessed by any scraper during the process of an exploration. You can put URLs in the queue with one scraper and refer to it with another.

Example 1: Two-level scraping using #addToQueue# and #nextToVisit()#

Say you have a page named ‘Widget List’ with a list of URLs leading to the ‘Widget Detail’ pages where the interesting information is. You just need to create two scrapers:

Scraper #1:

Apply if URL contains:

widget-list

line 1:

Description:

#addToQueue#

Marker Before:

See Widget Description

Replace:

#BASEURL#\0

Line 2:

Description:

#nextPage#

Replace:

#nextToVisit()#

Scraper #2:

Apply if URL contains:

widget-detail

line 1:

Description:

#nextPage#

Replace:

#BACK#

line 2…:

… scrape the data here.

Example 2: Two-level scraping using a variable #nextToVisit(#extractedURLs#)#

Same scenario, but this time, using a variable (for instance because you wish to keep two different kinds of URLs in separate piles):

Scraper #1:

Apply if URL contains:

widget-list

line 1:

Description:

#variable#extractedURLs#

Marker Before:

See Widget Description

Replace:

#BASEURL#\0

Line 2:

Description:

#nextPage#

Replace:

#nextToVisit(#extractedURLs#)#

Scraper #2:

Apply if URL contains:

widget-detail

line 1:

Description:

#nextPage#

Replace:

#BACK#

line 2…:

… scrape the data here.

Note: This may look confusing, but it’s not all that bad, once you have gotten the principle.
The idea is that you often have a list L1 that links to another list L2 (n times), which in turn links to the pages P where you want to scrape your data.

Think of it from the end:

You have to make a page scraper (#2 in the example above) for the data in P with #nextPage# set to #BACK# (It’s the “leaf” at the end of the branch, so the program will backtrack once the page is scraped.)
You also have to make one or several list scrapers where you will extract the links from L1, L2… into a variable like #extractedURLs#.
In the list scraper, you also need to set #nextPage#1# (higher priority) to #nextToVisit(#extractedURLs#)# to explore all the pages one after the other,
and, finally -still in the list scraper- set #nextPage#0# (default value) to #BACK#, to backtrack to the higher level, once all #extractedURLs# of the level have been visited.

One of the tricky things is to make sure that each scraper will apply to the right kind of page using the “URL contains” field. This may require a regular expression.

Applying a Scraper to a Page (or Series of Pages)
If you simply want to apply the best matching scraper to the current URL (the page loaded in the page view), just go to the scraped view. If you want to apply it to a series of pages or to the content of a site, you can set the scraped view’s bottom panel as you want, (uncheck ‘Empty’ to keep the results in the scraped view OR check ‘Catch selection’ to move them to the catch) and use the Browse or Dig commands to explore the pages you want.
If you need to apply a scraper to a whole list of URLs, another way is to select the rows containing the links you want to scrape (in any view: usually ‘the Catch’, ‘links’, ‘lists’ or ‘guess’), then right-click (ctrl-click on Macintosh) on one of the URLs to scrape (they should all be in the same column) and, in the contextual menu, select ‘Auto-Explore’ > ‘Fast Scrape’. For each of the selected URLs the resulting data will be added in the datasheet of the Scraped view. (A throbber beside the view name shows that the process is running.)
Note that the two methods above are different: applying a scraper by going to the scraped view does the extraction from the source code of the page loaded in the Hub’s browser, whereas using the ‘Fast Scrape’ function on Selected URLs, the program runs an XML HTTP Request for each URL, but doesn’t really load the pages (ignoring images etc.). Most of the time, the result is the same, but the ‘Fast Scraping Mode’ is simply… faster. In some cases, however, the pace can be too high for the server. In other cases, the results can be different or the fast scraping mode can even completely fail: the reason is that in the normal mode, events can happen that dynamically alter a page (mostly due to the execution of javascript scripts). These dynamic changes will not occur in the fast scraping mode, as scripts are not executed. This means that dynamically added information, javascript redirections, page reloads… will simply not happen in fast scraping mode. If you notice this kind of behavior, the best way is to accept the slower method and browse through the URLs, doing the scraping page after page.

Temporization
You can set the exploration speed in the Time Settings tab of the Preferences panel (Tools menu). By default, the temporization between pages is set to 4 seconds. You can lower it as much as you want, but do make sure that you are respecting the sites’ terms of use and that you are not overusing the servers.

Use of Regular Expressions
Regular Expressions are a powerful syntax, used to search specific patterns in text content. They can be used in several places of OutWit Hub:

In the bottom panel of each widget (images, links, contacts…) the Select If Contains text box allows you to select items of the list above it that contain the typed string. By starting and ending the string with the character / you can use Regular Expressions in these text boxes.
In the Scraper Editor located in the ‘Scrapers” widget, Marker Before and Marker After can be either a literal string or a Regular Expression. Format is always interpreted as a Regular Expression.
Lastly, the ‘URL to Scrape’ attributed to a scraper can also be a regular expression. In this case, the scraper can be applied to any URL matching the pattern.

To use regular expressions, write your string between slashes: /myRegExp/. The pattern will be displayed in green when the syntax is correct, in red otherwise.
IMPORTANT NOTES:

The ‘Format’ field of the Scraper Editor is always interpreted as a regural expression, even if not marked with slashes.

For technical reasons, all regular expressions used in scrapers are interpreted as case insensitive patterns by default. [A-Z], [A-Za-z] and [a-z] have the same result. This can be changed using the #caseSensitive# directive.
Here is what you should know if you are using regular expressions:

Ultra Quick Start Guide
Quick Start Guide
More

Regular Expressions Ultra Quick Start
Regular expression patterns are strings to match in a text, surrounded with / (slashes) and including a series of reserved characters used as wildcards (i.e. representing ranges of characters or remarkable features).

The three most useful patterns:

use the pattern \s* to match a succession of zero or more space characters, tabs, returns, etc.
use the pattern [^<]+ to match a succession of one or more characters until the next <
use the pattern [a-z]+ to match a succession of one or more letters

The two mistakes you are most likely to make, once you have learned more about RegExps:

The character . (dot) doesn’t mean ‘any character’, but ‘any character, except return characters’ (returns, line feeds, form feeds etc.), so do not use .* to say ‘anything’. Instead, you should use [\s\S]*, for instance, which means any succession of non-space characters or space characters.
Among the characters that need to be escaped in a RegExp pattern is the very common / (slash). If you forget to escape these, the regular expression will not work. You need to escape it like this: \/ (backslash followed by slash).

Example:

The pattern /<span[^>]+>\s*Phone\s*:/ will match any tag followed by Phone, followed by the colon character (:)

whatever the number of spaces, tabs or returns between these elements.

To learn some more about Regular Expressions, you can go to our RegExp Quick Start Guide

Regular Expressions Quick Start
Marking a Regular Expression: /myRegExp/
Most of the time, simple strings will be enough as markers or selection criteria. Such a literal string must be typed and will be searched as is in the data. Therefore, if you want to use a regular expression instead, you must mark it, so that the program can identify it as such. This is done by adding a / before and after the reg exp pattern.

Escaping Special Characters
Characters that are used in the regular expressions syntax, like .$*+-^\(){}[]/ should be ‘escaped’ when used literally in a regular expression (i.e. when used as the character itself, not as part of the reg. exp. syntax). Escaping means placing a backslash character \ before that special character to have it be treated literally. To search for a backslash character, for instance, double it \\ so that its first occurrence will escape the second.

Most common “special” characters in regular expressions

Wildcard
. (dot): any character except a line break (or carriage return)

Character Classes (Ranges of Characters)
In a character class, a caret character ^ excludes all characters specified by a character class, if placed immediately after the opening bracket [^... ].
[abc] list: any of the character a, b, c
[^abc] exclusion list: any character except a, b, c
[a-z] range: any character from a to z
[^aeiou] any character which is not a vowel
[a-zA-Z0-9] any character from a-z, A-Z, or 0-9
[^0-9aeiou] any character that is neither a digit nor a vowel

Escaped matching characters

\r line break (carriage return)

\n Unix line break (line feed)

\t tab

\f page break (form feed)

\\ backslash

\s any space character (space, tab, return, line feed, form feed)

\S any non-space character (any character not matched by \s)

\w any word character (a-z, A-Z, 0-9, _, and certain 8-bit characters)

\W any non-word character (all characters not included by \w, incl. returns)

\d any digit (0-9)

\D any non-digit character (including carriage return)

\b any word boundary (position between a \w character and a \W character)

\B any position that is not a word boundary

Alternation
| (pipe): Separates two expressions and matches either

Position
^: (when not in a character class) beginning of string

$: end of string

Quantifiers
x*: zero or more x

x+: one or more x

x?: zero or one x

x{COUNT}: exactly COUNT x, where COUNT is an integer

x{MIN,}: at least MIN x, where MIN is an integer

x{MIN, MAX}: at least MIN x, but no more than MAX

Note:

+ and * are ‘greedy’: they match the longest string possible. If you do not want this “longest match” behavior, you can use non-greedy quantifiers, by adding a ?.

*?: zero or more (non-greedy)

+?: one or more (non-greedy)

??: zero or one (non-greedy)

For example, Instead of:

/

The following two tabs change content below.

allenpg

Latest posts by allenpg (see all)

Leave a Reply