Google Webmaster Tools

How to capture the ‘not provided’ data – xxx – 201309xx

http:xxx

It’s official. The last nail on the ‘not provided’ data is hammered in. And now all of us have to live with not having access to keyword data. It was coming, for years now. Some of us took note of it. Many didn’t.

A few months ago, Niswey did. When we saw we were dealing with well over half of a client’s website data being under ‘not provided’, we decided we had to do something. Whatever we did, the campaigns we ran, the relevant content we wrote, the webinars we did, would not amount to much if we didn’t know what the big chunk of the traffic was really doing at the site.

Even if we didn’t get to the actual figures, we really needed to see what the trends were. That’s when we hunted for how to get to know the trends hidden behind ‘not provided’. We set it up in Google Analytics, and we now have a very good picture of the keywords and their impact.

Our strategies for customers are not driven by keywords, our focus is on creating relevant customer experience at the website. But the keyword data would give us a pretty good analytics picture to support our strategies.

Here’s how we set up access to the ‘not provided’ data.

1. Log into your Analytics API. Select the required profile. Now click on the Admin tab. You need to have administrator privileges for this.

2. Under the Profile section, click on Filters.

3. Now click on Create New Filter and enter the filter name. In the Filter Type field, select Custom Filter. Now select the Advanced radio button.

4. Now fill up the values in the boxes as mentioned here:

a. Against Field A->Extract A, select Campaign Term and (.not provided.)

b. Against Field B->Extract B, select Request URI and (.*)

c. Against Output To -> Constructor , select Campaign Term and np – $B1

Select the radio buttons as shown in the image: Yes for Field A Required, Field B Required and Override Output Field, and No for Case Sensitive.

5. Click Save.

Now you will find the ‘not provided’ data in the keywords section in Google Analytics.

After the setup, you will see that the ‘not provided’ data drops to zero.

Do remember that this will not give you the exact keywords. But it will show you the pages on which the visitors are landing. If you look at the data carefully, you will be able to clearly tell the keywords they have searched for.

If you think keyword data from Google Analytics was the best thing ever for your work, the process described above will give you the next best thing.

Update: As pointed out by David on Twitter, the credit for the process goes to Dan Barker, while we learnt it from SearchEngineWatch. Thanks for letting us know David :)

Tagged on: ,

By Premnath321 | September 27, 2013 | Blog | 16 Comments

16 thoughts on “How to capture the ‘not provided’ data”

    1. Premnath321 Post author
      @Dheeraj: pls do share results too :)Says: Hi I set everything as you mentioned in your, post but its not working for meReply: the data will start appearing in future, once you have completed the process. You cannot find previous data.


      Google Keyword ‘(Not Provided)’: How to Move Forward – Ray Comstock – 20131016

      http://searchenginewatch.com/article/2300838/Google-Keyword-Not-Provided-How-to-Move-Forward

      STAT Search Analytics

      not provided Counter

      Without a doubt, Google’s recent changes make performance reporting less accurate. SEO professionals and marketers no longer have the raw data that we once used to measure SEO results. We will need to use different KPIs and trending metrics to approximate the data that is now lost.

      However these changes aren’t a surprise. It has been widely assumed by the SEO community for some time that this change was going to happen (although few expected it to be so soon).

      Google isn’t the only company making “secure search” a priority. Browsers such as IE10, Firefox 14+, and Mobile Safari have put measures in place to mask keyword referral data.

      Fortunately, many SEO professionals and organizations have been preparing for this eventuality. It starts with having a solid plan in place to report on data that we know historically has a high correlation to the success that we were once able to directly measure.

      The good news is, unlike Google’s Panda and Penguin updates, this change doesn’t affect our approach to optimization for the most part other than performance reporting (with the exception of being able to use analytics data for keyword research).

      As Google’s Distinguished Engineer Matt Cutts has said: “Succeeding in SEO will be the same as it’s always been if you’re doing it right – give the users a great experience.”

      By developing user-centered content that is valuable and informative, and publishing to the web using best practices, you will see positive business results – assuming that you’re coupling that with other SEO best practices.

      Let’s dive into a new approach to SEO performance reporting using the metrics we still have in conjunction with a couple new KPIs. This extrapolation should provide quite accurate organic search performance results and allow you to understand if you’re successfully driving more visits based on your optimization activities.

      Understanding the Background

      LostTo SEO and digital marketing professionals alike, the “(not provided)” or “keyword unavailable” issue has been going on sincelate 2011. Since that time, Google has been redirecting a growing number of users to a secure page for their search activity (https://www.google.com). The end result is that all search referring data that traditional analytics tools have used to understand which keywords drove visitors from Google is now blocked.

      When Google initially launched their secure search, many marketers began seeing that a percentile portion of their keyword data in Google Analytics fell into a “(not provided)” category. In fact, Google estimated at that time that keyword unavailable searches wouldn’t exceed 10 percent.

      Initially, a searcher had to be logged into one of their respective Google accounts in order to produce any sort of keyword “(not provided)” data. This meant that referring keyword data was no longer being fully displayed in analytics as Google aimed to provide their users an amount of privacy when searching.

      However, the percentage of organic search keyword traffic coming from keywords that were “(not provided)” grew steadily in past years, to the point that many sites were accumulating more than 50 percent of keyword “(not provided)” data (and in some cases upwards of 80 percent or more).

      Things changed again in late September when Google rolled out major changes toward encrypting search activity altogether. Now, when any user goes to Google to search they are automatically redirected to the https:// version of Google or, an SSL encrypted search.

      This update only affects organic search data. Paid search data from Google continues to report on keyword referrals.

      There is no doubt that secure search will be the trend going forward and we should assume for the sake of planning and scalability that keyword referral data is a thing of the past from an analytics perspective.

      What the Loss of Keyword Data Really Means

      Quick summary:

      • Changes only affect how we measure and report SEO performance.
      • Organic traffic from Google can no longer be tracked at a keyword level via analytics.
      • There will be a limited amount of keyword referral data available in Google Webmaster Tools.
      • No longer have visibility into traffic numbers:
        • Brand / Non-Brand.
        • Long-Tail Performance.
        • By Keyword Group.
      • Decrease in visibility for new keyword opportunities based on analytics data.
      • We need to use a different metric set to understand SEO performance.
      • We should expand the number of keywords we check rankings for in Google that correlate to high performance URLs.

      Again, SEO still works the same way. But, not having keyword performance data affects SEO practitioners and digital marketers in two distinct ways.

      1. How to Measure Success and Performance

      SEO professionals have historically used a combination of ranking, traffic, and conversion metrics as the primary KPIs to measure SEO performance.

      Now, based on the new Google change, the following metrics are still available:

      • Overall Organic Search Traffic By Engine
      • Total conversions from Organic Traffic / By URL
      • Search Rankings for Critical Terms
      • Search Rankings by Page Tags / Types
      • Search Rankings by Keyword Tag

      These are no longer available:

      • Year-Over-Year Brand / Non Brand Total SEO Traffic
      • Year-Over-Year SEO Traffic by Keyword Tag
      • Conversions by Keyword / Keyword Tag
      • Keyword Traffic Patterns by URL
      • Long-Tail Keyword Traffic Patterns

      2. How to Research Keyword Opportunities in the Era of “Keyword Unavailable” Performance Data

      This is a much smaller issue but still deserves attention. Historically, analytics data has been an excellent source of uncovering additional keyword opportunities and long-tail permutations that had a propensity to drive traffic. However this data was used largely in conjunction other keyword data sources like:

      • Google Keyword Planner
      • PPC / Paid Search Data
      • Competitive Analysis
      • Intuitive Understanding of the Market / User Personas
      • Third Party Tools (SEMRush, Keyword Discovery, Wordtracker, etc.)

      Going forward, greater emphasis will be placed on these data sets as the foundation of keyword research, especially PPC impression data, which will be the most accurate source of information to identify opportunity.

      How to Report on SEO Performance if Keyword Data is ‘(Not Provided)’

      What KPI set should be used as the primary gauge of SEO success going forward? Earlier we identified the historical KPIs we’ve used to measure SEO success as well as which of those KPIs are still available.

      Let’s take a more detailed look at how to use data that’s still available, and which other KPIs you should incorporate into your reporting methodology. Below are four primary metrics to measure search performance going forward.

      1. Total Organic Search Visitors

      This will still be your primary metric. “Did traffic go up or down in comparison to a previous time period and is that change substantial relative to our goal?”

      Unfortunately, because brand and non-brand segmentation of this traffic is no longer feasible, it’s less clear if SEO efforts were primarily responsible for a shift in performance or if it was mainly due to a shift in demand across keywords that have remain consistent in ranking. This is especially true for brand related searches where typically a company will rank number one for their brand.

      Therefore any change to brand traffic levels aren’t usually considered a result of SEO activities when the ranking doesn’t change for the brand terms. This isn’t as true for large companies that have multiple brands or sub-brands where they are less likely to own the number one spot for all brand related terms.

      2. URL Level Traffic

      Although we can no longer see the keywords that drive traffic to a website from Google, we can see what pages that traffic lands on. By identifying the pages that drive the most organic search traffic to the site and correlating which keywords those pages are ranking for, we can correlate changes to both traffic and rankings to see if we can identify positive or negative changes.

      In many cases this will be difficult since we no longer have visibility into the keywords driving traffic (with the exception of Google Webmaster Tools data). However, we can get greater context around these traffic and ranking numbers by analyzing them in conjunction with the Google Webmaster Tools keyword data.

      A sample SEO URL performance reporting structure might look like:

      URL: http://www.example.com
      Total Traffic: Last Period: xxxx Current Period: xxxx Change: +/- xxxx
      Keyword 1:
      Rankings Last Period: #4 Current Period: #1 Change: +3
      Traffic Last Period: xxx Current Period: xxx Change: xxx
      (traffic numbers only if they exist in GWT)
      Keyword 2:
      Rankings Last Period: #10 Current Period: #6 Change: +4
      Traffic Last Period: xxx Current Period: xxx Change: xxx
      (traffic numbers only if they exist in GWT)
      ETC

      3. Use Webmaster Tools

      You can still get keyword referral data in Google Webmaster Tools. It also gives you impression versus click data so you have visibility into the keywords people are using and where your site got an impression in the search results. Note that this data isn’t 100 percent accurate and is typically only available for a relatively small overall percentage of search queries for most larger companies.

      Comparing keyword traffic volumes over time will give you a trending direction for your SEO program, especially for competitive non-brand keywords.

      Therefore, using this data in conjunction with the other data points as part of a trending performance report will show the effects of the SEO program. This will be especially telling when coupled with the URL level traffic and rankings for those keywords that have data in Google Webmaster tools.

      Since the number of keywords reported on is not comprehensive and the data is not 100 percent accurate, the analysis of the data derived from GWT will be considered trending data and is a KPI that will need to be considered in conjunction with total traffic, URL traffic, and search rankings in order to form a comprehensive view of the overall effectiveness of the SEO program during any particular time period.

      4. Search Rankings

      Search rankings will actually gain in importance (contrary to what Google has historically said they want) because of this update since marketers can no longer see which keywords have driven traffic to their site. Therefore, it will be important to check rankings for keywords that have historically driven traffic to your site since you won’t be able to directly measure changes in traffic levels for those keywords anymore. Analyzing ranking changes across keywords that have historically driven traffic will now be a critical tool in identifying and reacting to negative traffic changes.

      It will also be important to carefully track which URLs are ranking for which keywords in order to correlate ranking changes to traffic changes. This insight will allow us to better understand what is happening to traffic at the URL level.

      Using these four primary data sets in conjunction with one another can help you develop a comprehensive overview of your SEO performance and begin to answer questions about what happened and why.

      Here are four additional data sets that will add context to the four primary metrics:

      5. Use Google Adwords

      AdWords impression data can be used in conjunction with Google Keyword Planner data to identify new keyword opportunities.

      6. Look at Non-Google Keyword Data

      While Bing and Yahoo don’t provide nearly the amount of traffic that Google does, insights can still be made about the keywords that are driving traffic to your site, in particular at the URL level. This is especially true for those sites that have a significant amount of traffic.

      7. Look at Historical Data and Trends

      You still have all your historical keyword data in your analytics platform prior to this secure search update. This data will be extremely valuable for identifying campaigns and keywords that have consistently been performing well. This is important information for keyword opportunity identification standpoint as well as understanding URL level traffic trends.

      We’re now using page-level data in conjunction with ranking data to understand performance changes (since we don’t know exactly what’s happening anymore in terms of which keywords are driving traffic and which keywords have declined in traffic).

      By researching historical trends for the URLs that are being reported on, you can get a better idea of the keywords that have historically driven traffic and whether those keywords were primarily brand or non-brand keywords. This allows you to better understand the cause and effect of traffic changes to those URLs.

      Historical data also gives insights into the seasonality of your market. This allows you to better understand the potential causes of performance changes.

      8. Google Trends

      Google Trends can give you insights into what is trending and thus what is bringing you traffic (especially as it relates to understanding how your brand traffic might be performing).

      Summary

      Using data analysis to understand and identify performance changes is critical for SEO professionals so that they can quickly and effectively respond to negative changes, prioritize resources and accurately report performance to executives and other team members.

      In the past, keyword level analytics data has been the focus of this type of analysis and therefore has been critical in accomplishing these goals. In the absence of this data, based on the new Google changes, new metrics will need to be prioritized for these purposes.

      While these new metrics aren’t as accurate as keyword level data, they do provide a solid alternative to understanding SEO performance.

      Learn More: “(not provided)”

      Read more on Analytics

      Nikki Johnson says:  Solid post, Ray … I really like the punch list of metrics you’ve put together. As much as I cursed the day when Google made the announcement about 100% encrypted organic searches, we were all increasingly forced to use these types of metrics to deal with the gaping hole of “not provided” keywords as it grew wider and wider with each passing month — even before the official pronouncement was made. (I address that sentiment in a little greater detail here: http://www.plugingroup.com/google-keyword-search-data-changes.) Given how many keywords were falling into that “not provided” category, I’m almost relieved that it’s out of its misery, based on how crippled it had become toward the end. It was becoming increasingly difficult to work with an incomplete data set. Thanks again for the thoughtful coverage of where to turn now.

      Says: - Hi, thanks for the post.

      If we use the search term exclusion list with Universal Analytics, do you know if the keyword is excluded even though it is not provided – or not ?

      For example let’s say I want to exclude my own brand : will the search terms report include in the not provided section people who came from organic traffic, but using my brand as a keyword ?

      I am finding the Search Queries Report in Google Webmaster Tools most useful now for getting organic keyword data. I had almost completely forgotten about this old and once-fairly-useless report! Although the data is made up of Google approximations, it gives a good overview of which keywords are performing well and which might be slipping. You might find my post on using this report useful: http://www.koozai.com/blog/analytics/webmaster-tools-organic-keywords-rankings/

      CidJeremy – A lot of good information. How my team and I have adapted to the lack of keyword data from Google is to focus on visitor behavior patterns. Meaning, where they enter the site, then what actions they take next. I’ve always believed this to be a more accurate indicator as to the value your website provided anyway. If you’re able to accurately target your ideal customer and direct them to the correct page based on the their search query, you can take them by the virtual hand and guide them through the “sales cycle’ of your website. Entry and exit pages should be a key metric to follow

      PrashantJain – It was very much expected that Google will take out keywords data out of equation. This change is definitely going to impact strategy planning for SEO webmasters. I agree that now focus will shift on URL’s rather than keywords driving traffic to the site.

      We all will have to analyze our webpages and improve user experience there.

      Says:
      I don’t mind that keyword data wont be available in the UK at some point in the near future, I think it puts the focus on creating better content in the long term. However I do feel that performing some of the suggested methods above is going to be rather time consuming. I certainly don’t have time to trawl through historical data to try and spot trends. If you know your market and are already doing a good job, using best practice SEO techniques, then just continue on as you are, this shouldn’t really affect you. When it come to reporting most of my clients are concerned with results directly in relation to sales and enquiries.

      Says:
      I don’t think you get my point Dean. Firstly as it relates to reporting results related to sales, how are you going to prove whether a sale was generated from a brand or non brand term? If your client gives you credit for sales that originate from a previously held number one listing on a brand related term then you are fortunate but most people do not have said luxury. Secondly, if your traffic to a particular URL goes down, how are you going to figure out why it went down and whether or not the problem is something that can be fixed based on either application of best practices or correcting an inadvertent error (which happens all too frequently with larger clients who have IT teams that constantly make changes that marketing is not always aware of). Without reporting against some of the metrics I have described, you will be at a significant disadvantage to address either concern. Also I would challenge you that you don’t have time to analyze data to spot trends. I would argue that all SEO professionals should use data analysis to drive their strategy. Seems like you might need a better reporting and analysis solution. Good luck.

      Tom Slage says:
      Yep, thing’re lookin’ bleak, ma.

      But that’s what this industry is about, and I’d rather bootstrap SOMETHING and make it better whenever possible than just sit around and bellyache. Thanks for helping with that Catfish.

      That said, the biggest conundrum is the URLs with traffic dominated by branded terms. If we’re getting tiny incremental gains from long tail terms we’ll never know. Measuring SEO is now about the “big play” then, as far as effective measurement is concerned. As such, maybe it makes sense to extend reporting frequencies to look at periods that encompass more SEO activity. So instead of monthly reports, what about quarterly reports, coupled with reports of what SEO tactics took place in that period. And of course proving gains ABOVE previous trending is really what we’re after.

      Says:
      The unfortunate part of it Tom is that it puts more focus back on rankings which don’t really account for long tail traffic which, especially at the enterprise level, account for a significant if not majority of total traffic.

      Having said that, we all know what we need to be doing from an SEO perspective and reporting on the results which important, doesn’t change what we should do.

      Says:
      Reading what you typed below, Remind me of a story about my Great Grandma state fair winning apple pies which she made every year for 20 years winning the blue ribbon every time. It’s been 5 years since she passed away, my older sister pickup the pie making. Do you know, my sister hasn’t won a blue ribbon for the past 5 years, no matter what she does Its NOT THE SAME! Do I need to connect the dots for you?
      It is what it is….”While these new metrics aren’t as accurate as keyword level data, they do provide a solid alternative to understanding SEO performance.”

      Says:
      How do I calculate accurate ROI on non-brand performance when the top 50 pages of my site are dominated by brand traffic?

      Says:
      Unfortunately Dave there is no way to do that anymore. but you can look at the non brand keywords that have driven traffic to those pages in the past and continue to track rankings for those keywords. Then see if there are correlations between ranking changes and traffic changes in context with changes you see for those keywords if they exist in Google Webmaster tools. Make no mistake, its more about estimated trends now than actually measuring performance and that is unfortunately the world that we live in now. But on the flip side, it is no different than the challenge that social media folks have in understanding their performance and ROI. But it will be impossible to shows gains and losses for long tail, non brand keyword traffic which is also unfortunate.

      Says:
      First off, great post!

      In some cases, using the new advanced segments to create groups based on referral source/medium and landing page can give you a pretty accurate idea of band vs non-brand performance.

      As long as your brand traffic usually lands on pages that are different from non-brand traffic you can effectively calculate ROI.

      I have an example showing how to do this analysis that is a bit lengthy so I will just link to it:

      http://www.digital-performance-marketing-group.com/blog/2014/1/13/get-creative-how-to-use-landing-pages-and-segments-to-make-up-for-missing-not-provided-keyword-data

      Says:
      Keyword estimator tools were always off; and not by just 30-40% but several hundred % points.

      Take a look at the youtube case study we did awhile back at http://www.youtube.com/watch?v=wK3bRjheH8o.

      Best best. Fire up the largest keyword list you can muster (with local geos), test live in Google on the PPC adwords program, bucket those keywords into target and non-target, then get to work!

      Says:
      Excellent, thorough summary!

      Our most fruitful SEO analytics have been about the long tail and most of the fallbacks in this article will work well, if they work at all, for mainly the big head words. What a mess.

      We have tried most of these ideas but have decided that analyzing Bing and Yahoo traffic is by far the most useful if we need “visit quality” in the equation, which we emphatically do. So, we’re applying what we learn from Bing and Yahoo directly to our Google SEO, with a little help, but not much, from Google’s remaining tools.

      I would love to see a summary of what is known about the demographics, lifestyles and online behavior of Google users vs Bing and Yahoo users. There are a lot of glib stereotypes floating around, but what does the research really say?


      How to Use PPC Data to Guide SEO Strategy in a ‘(Not Provided)’ World – Ben Goodsell – 20131021

      http://searchenginewatch.com/article/2301732/How-to-Use-PPC-Data-to-Guide-SEO-Strategy-in-a-Not-Provided-World

      We can no longer precisely track traffic for Google in organic search at the keyword level. As “(not provided)” creeps its way up to 100 percent, so does the lack of our ability to track Google organic keyword conversions.

      Tell your friends, family, loved ones, the boss. Then if you haven’t immediately lost their attention with the use of acronyms and jargon, also let them know that we’re still able to measure our efforts and gain strategic insight in many ways.

      This article is an attempt to explain what we see in keyword reports currently, show how PPC data can help guide SEO efforts, and finally a consolidation of initial thoughts and ideas to assist inmoving forward.

      Smart SEO professionals will still prove their worth. Together we can overcome this daunting hurdle.

      What Do We See in Google Organic Keyword Reports?

      Short answer: We aren’t seeing an accurate representation of keywords people are using to get to our sites.

      The easiest way to look at this is by visualizing the browser versions that are still passing keyword referral data.

      Google Organic Visit Share vs Provided Query Share

      Above, the light green color is the percent of keywords that are still passing keywords next to the darker Google organic visits.

      In essence, we’re mostly seeing keywords from outdated versions of Safari and MSIE (Internet Explorer). So the search behavior associated with the demographics using outdated browsers is what we see coming from Google in analytics packages like Google Analytics. Probably not a comprehensive picture into what is actually happening.

      Using PPC Data to Guide SEO Strategy

      Google needs marketers to be able to quantify their efforts when it comes to AdWords. Therefore, keyword data is passed and there to take advantage of.

      The thought here is that if a page performs well in a PPC campaign, it will translate to performing well at the top of organic listings, though people clicking ads versus organic listings probably behave differently to some degree.

      There are many ways PPC data could be used to help guide SEO strategy, this is just one to get the juices flowing.

      Step 1: Identify Top Performing PPC Landing Pages

      If using Google Analytics, from Dashboard click Acquisition > Adwords > Destination URLs. Assuming you have sufficient conversion tracking set up here, it should give you all the information you need to understand which pages are doing the best.

      After filtering out the homepage, sorting by the conversion metric of your choice, adding Keyword as a secondary dimension, then exporting 100 rows you will have the top performing 100 landing page/keyword combinations for PPC. Revenue is always a good indication that people like what they see.

      Using PPC data for SEO strategy

      Step 2: Pull Ranking Data

      Next, pull in Google Webmaster Tool Ranking data for the associated keywords. You can access this data in Google Analytics from Dashboard > Acquisition > Search Engine Optimization > Queries, or in Google Webmaster Tools.

      Specify the largest date range possible (90 days) and download the report. Then use VLOOKUP to pull in ranking data into the spreadsheet containing the top PPC landing page/keyword combinations.

      Using PPC data and SEO Rankings strategy

      Step 3: Form SEO Strategy

      Now that we know where our site shows up in organic for the top PPC keyword/landing URL combinations, we can begin forming strategy.

      One obvious strategy is to make sure that the PPC and organic landing pages are the same. Sending PPC traffic to organic canonical pages can only increase the possibilities of linking and social sharing, assuming the organic page converts well.

      Another option is to filter the Average Rank column to only include first page rankings, in an attempt to identify low-hanging fruit. Once an opportunity is identified, compare SEO metrics to determine where focus should be placed and how best to meet and beat your competitors.

      Additional Thoughts on SEO Strategy in a 100% ‘(Not Provided)’ World

      1. ‘(Not Provided)’ Still Counts as Organic

      Conversion information is still applied to the organic channel, don’t forget! We no longer have the ability to say someone who Googled [hocus pocus] bought $1,000 worth of “hocus pocus” stuff. But we can say that someone clicked an organic listing, landed on the hocus pocus page, and bought $1,000 of stuff.

      Note: “(not provided)” shouldn’t be confused with the issue of iOS 6 organic traffic showing up as direct. Last we checked this was hiding about 14 percent of Google searches, but is becoming less of an issue with the adoption of iOS7.

      2. Bing Still Has Organic Keyword-Level Tracking

      Bing doesn’t use secure search, so we can still see what people are searching to get to our sites, conversions, sales, etc. Bing data could help quantify SEO efforts, but it’s still only 9.1 percent of organic search share.

      Note: People searching Bing versus Google probably behave differently to some degree.

      3. Google Webmaster Tool Search Query Data Provides Partial Insight

      Google gives us access to the top 2,000 search queries every day. After understanding limitations, the search query report can be invaluable as it gives a glimpse of how your site performs from Google’s side of the fence. Google also recently mentioned they will be increasing the amount of data available to a year!

      By linking Google Webmaster Tools with AdWords, Google also has given us a report using the same search query data except with more accurate numbers (not rounded).

      Conclusion

      Clearly, page-level tracking is more important than ever. Google has forced SEO professionals to look at what pages are ranking and where, and then pull in other sources to guess on performance and form strategies.

      Google will most likely respond to the outcry by giving us access to more detailed search query data in Google Webmaster Tools. As mentioned before, they have already announced an increase of data from 90 days to a year. This may be a sign of how they might help us out in the future.

      Hi Ben! - I believe that there are actually many ways PPC data could be used to help guide SEO strategy. I totally agree with you on your conclusion. Google has forced SEO professionals to look at what pages are ranking and where, and then pull in other sources to guess on performance and form strategies.”

      2 ways to get around this on a budget:
      1. set up campaigns and focus on optimizing Quality Scores (which accounts for keyword relevance and landing page experience) as this information is displayed without spending a large budget to advertise certain keywords
      2. Like “cgrantski” previously mentioned, utilize Bing Webmaster Tools but also no guarantee it’s proportional to Google

      Good article, thank you. - This is not a dig at this article, just a comment about SEO in general. I am disappointed that the SEO industry, particularly those who are making quite a bit of money in it, are not doing any recent research (or publishing it, anyway) to support/disconfirm the following:

      “…people clicking ads versus organic listings probably behave differently to some degree.”

      “People searching Bing versus Google probably behave differently to some degree.”

      In an industry where people spend a lot of time decoding the Google black box, it would be nice to see similar effort going into understanding human questions. I’m referring to real research, not opinions based on experience.

      If there is such research out there, please post a link!

      Great point cgrantski (I realize now PROBABLY is said a lot in this article … ), if you have any particular research in mind I’d definitely be interesting in reading!


      How to Find Keyword Conversions by URL Using Google Webmaster Tools – Ben Goodsell – 20140221

      http://searchenginewatch.com/article/2330149/How-to-Find-Keyword-Conversions-by-URL-Using-Google-Webmaster-Tools

      In January Google announced that numbers will no longer be rounded in Google Webmaster Tool Search Query reports. With that announcement these reports became 20 to 30 percent more accurate.

      Not even available from the API, the Top Pages report is the only place you can find page-level search query data. Does this make it the most valuable report around?

      This article walks through how to get keyword to landing page data by using the Top Pages report as a template. Then consolidating analytics conversions and trending over time in a very basic way.

      Tools used:

      • Google Webmaster Tools
      • Google Analytics
      • Excel

      Capture Top Pages Report Data

      Google Webmaster Tools Search Query reports are the only way to can get decently comprehensive keyword data (we have to take what we can get from Google).

      • Set the dates to the first week of February and expand page entries (figure 2 below) to reveal all keywords.

      Tips: Toggling from the bottom up is quicker and Noah Haibach at Lunametrics has a niceJavaScript workaround for doing all of this automatically.

      google-webmaster-tools-top-pages

      • Select, copy, then paste all data into Excel.

      Note: Excel took awhile to think about wanting to paste.

      google-webmaster-tools-top-data

      After pasting, format to remove all links, insert a column to the left of Impressions, add new column headers, and save as a new .xlsx file.

      Note: If pages contain a trailing space be sure to remove otherwise they won’t match up when we use VLOOKUP later.

      ga-combined-with-gwt

      • Use the same process to create a similar tab for week 2 of February.

      We now have the template to begin consolidating data from other sources, specifically Google Webmaster Tool Search Query Rankings and analytics Visits and Conversions.

      Consolidate Data

      Using Excel’s VLOOKUP function we’re going to begin to add data from the Google Webmaster Tool Search Query report and Google Analytics (see link if you don’t know how to use VLOOKUP, also quick run through here).

      Tip: Keep the downloads for reference later.

      • Pull Average Rank from Google Webmaster Tool Search Query report.

      Make sure you have the date set properly (this is set for piping in data to the week 1 tab).

      google-webmaster-tools-top-queries-tab

      Change 25 rows to 10, then change the grid.s parameter in the URL to the total rows given, in the case 2453.

      google-webmaster-tools-export-all

      Hit enter and then click “Download this table”. Open the file so that you have it and your report .xlsx in accessible windows. We’re going to use this file to pull in Average Position data per keyword.

      In the week 1 report tab (make sure you pulled week 1 GWT data), enter =vlookup and arrow over to the cell you want to use as the lookup_value, then enter a comma.

      google-webmaster-tools-vlookup

      In the Google Webmaster Tool Search Query download, highlight the data you want to use for thetable_array and add a comma. We want column H (8th column from left) values to be returned, add an 8, a comma, and finally a zero then hit enter.

      google-webmaster-tools-vlookup-table-array

      The full formula looked like this:

      =VLOOKUP(B3,’[www-yoursite-com_20140218T230012Z_TopSearchQueries_20140201-20140207.csv]www-yoursite-com_20140218T23′!$A$2:$H$2454,8,0)

      • lookup_value – B3
      • table_array – ‘[www-yoursite-com_20140218T230012Z_TopSearchQueries_20140201-20140207.csv]www-yoursite-com_20140218T23′!$A$2:$H$2454
      • col_index_num – 8
      • [range_lookup] – 0

      Drag the column to all applicable cells, making sure not to override the Average rank that already exists for pages. It is normal that #N/A will show up with queries that have less clicks. Search and replace all instances with 1, since keywords can’t be registered in Google’s system without a click.

      Repeat this process for week 2.

      • Pull Organic Conversion data for URLs from Google Analytics.

      Ensuring the proper dates are used, navigate to Customization (upper navigation) -> Create a New Custom Report -> Fill it out so it looks like the image below. Goal Starts can be any conversion data you want to include.

      google-analytics-custom-report-organic-landing-page-visits-and-conversions

      Change Show rows (bottom right) to 25 then find the explorer-table.rowCount parameter in the URL, substitute the number after %3D with the number of rows in the GA result set. Hit enter thenExport -> CSV.

      google-analytics-export-all

      Use the VLOOKUP process described previously to add conversions to both the week 1 and week 2 tab.

      The final product should be two tabs with Google Webmaster Tool Top Page report used as a framework, combined with analytics visits and conversion data. Next step, taking this information and creating Ultimate Google Webmaster Tool Dashboard.

      ga-combined-with-gwt

      The Ultimate Google Webmaster Tool Dashboard

      Note: Most / all of you are probably better than me at putting together reports and visuals.

      The only thing ultimate about this solution is that is that it’s a way to visualize correlation between URL conversion rate and keyword clicks and impressions.

      What we’re looking at is only the top 25 URLs, expanding this process to include more URLs is simple as noted earlier. This here represents about 60 percent of the site’s Google organic search traffic.

      Highlighted in green is the homepage of the site. We can see that our homepage was presented in search results 28.92 percent less, but our CTR is up almost 40 percent and our conversions up 33.05 percent.

      Tip: Percent change formula is: =(new # – old #)/old #

      google-webmaster-tool-conversion-dashboard

      Looking at our week tabs we can see it’s because Google listed our page in many new searches that were not made in week 1. So while our page was not presented in search results as much, it was listed in 36 searches not made the previous week. By pulling in what we have in keyword level conversion data from GA we can really start to narrow down which keywords were responsible for this conversion increase and begin using it to form new strategies.

      google-webmaster-tool-dashboard-keywords-not-rankings

      Conclusion

      Unfortunately since the introduction of SSL/”(not provided)” we are no longer able to directly tie conversion data to a search someone used to enter our site. We can now only correlate.

      This article merely scratches the surface of what could and should be done with this data. Enterprise level tools are doing what is shown and more on a massive scale. The key is finding ways to trend over time.

      Read more on Analytics

      • Google Webmaster Tools Now Provides More Specific Search Query Data

        January  8, 2014
      • Google Webmaster Tools Adds Debugging Support in Structured Data Dashboard

        December 16, 2013
      • Social Media ROI: 11 FREE Tools for Measuring Social Media Success

        November 24, 2013
      • Claim 2014s Biggest Easy Traffic Opportunity in 3 Easy Steps

        November 18, 2013
      • How to Use PPC Data to Guide SEO Strategy in a ‘(Not Provided)’ World

        October 21, 2013
      • Google Fixes Webmaster Tools Bug, Missing Search Query Data to Return

        I can’t seem to get this to work. Anyone have any advice? Like why does the saved xlsx contain different fields to the fields saved from the top pages report in webmaster tools?

        Note: If pages contain a trailing space after pasting from Google Webmaster Tools to Excel, be sure to remove otherwise values won’t match up when VLOOKUP is used later.

        I’ve had trouble too. The google analytics screenshots in these instructions look nothing like the google analytics dashboard I see when I log in. Any tips?


        Using ALL of Google Webmaster Tools data to tackle (not provided) – Noah Haibach – 20140123

        Source: http://www.lunametrics.com/blog/2014/01/23/google-webmaster-tools-data-not-provided/#sr=ibsubmmfo.dpn:2095&m=r&cp=(sfgfssbm)&ct=-tmc&ts=1393557904

        I couldn’t believe it when I saw the January 7, 2014th Webmaster Tools update,

        “data in the search queries feature will no longer be rounded / bucketed.”

        At first I thought, why would Google go through all that trouble to obfuscate keyword data in Google Analytics, when they planned on handing all that data back through the search query reports in Webmaster Tools? And of course, they didn’t plan on anything of the sort. The relatively minor update only removes bucketing, and does not address the big issue, that they display only 20% to 25% of search query data. I held out hope that, as it appears in the before and after pictures, the sampling rate had been increased from around 20% to around 35%. But while I’ve noticed small changes in some accounts, it does not appear they’ve made this improvement.

        webmaster tools graph before updateWebmaster Tools before January 7th, 2014 Update

         

        webmaster tools graph after updateWebmaster Tools after January 7th, 2014 Update

         So, how much of a boon IS the retraction of bucketing in GWT’s search queries? There definitely isn’t anyone complaining. It’s great to no longer see “<10 clicks” for our long tail queries. Of course, the biggest cost of (not provided) to the digital marketing community is the new-found powerlessness to relate search intent with landing page and overall site performance. While much energy and creativity is channeled towards addressing this issue with third party tools, I believe there is yet untapped insight inside Google Webmaster Tools.

        Patch Analytics with Webmaster Tools

        Before we get into the scope of this article, it is worth a shout out to Ben Goodsell who came up with a nice way to beat the bucketing over a year ago. Now that we no longer have to worry about bucketing, we can use an easier variation of his method to combat (not provided). After downloading the organic keyword data from Google Analytics and the search query data from Google Webmaster Tools, I used the latter (now accurate) data to correct the former. I won’t go into the details of my Excel setup, but I included a screenshot below. I can post the setup if there is interest. In this case, we went from 2283 visits with defined keywords in GA to 6802, using the GWT data. Of course when you only start with 4% of your organic visits as not (not provided), a 198%  increase is not as impressive. Still, it is better than nothing.

        Combining GWT Search Query Data with GA Keywords

        Re-connecting Queries with Landing Pages

        Short of using Google Tag Manager to import optimized keywords to your Google Analytics (which everyone should also do, by the way) Webmaster Tools still provides the last in-house way of connecting search queries with your site content. Below is the Search Query->Top Pages report from GWT    Top Page Report in GWT

        The Top Pages Report in GWT

        Notice the number of clicks, circled in green. When I first saw this, I did another impression of the Toronto Raptor, thinking I had discovered a loophole in GWT’s sampling methods. But of course, the ‘displaying 136,552 of 153,511 clicks’ means that nearly 90% of clicks are accounted for in terms of landing page. When you drill down into Keyword by Page, observe that only the standard 20% to 30% of search queries are accounted for. Still pretty neat, though, huh? You can now get an (exact) number of clicks for a given page for any search queries that made it past Google’s sampling method. What could we do with that data? Well it would be great to export it, play around with it, and see what types of additional insights we can draw. Which brings us to the next point of our review of GWT.

        Poor API Support!

        The only part of Google Webmaster tools as frustrating as the (former) bucketing and (ongoing) sampling, is the lack of official API support. There is a an official Java API that cannot return search query data; only crawled keywords,crawl issues, etc. And the unofficial APIs that I have seen (PHP and Python) do not support easy OAuth integration, and have only limited support for search queries. Even the Google Analytics integration is lacking. The search query data cannot be combined with any meaningful GA metric, and, to make things worse, the imported data is still being bucketed! So, to access the Search Queries->Top Pages report without any heavy coding, we need to use the GWT user interface.

        Search Queries->Top Pages Export

        Unlike the standard Top Queries report, we cannot export the complete Top Pages report via the UI. The best we can do is export the summarial table with a breakdown only by pages (and not search queries). We could also technically scroll down  the page, expanding each of the links by hand, but that would be painful. I wrote a couple JavaScript functions to automate the process. The code is rough, but it does download ‘page’, ‘search query’,  and ‘clicks’ columns for each search query entry, in TSV format for Excel. The code is available from GitHub, and is also included below. I have only used it in Chrome. Exporting Top Page Reports

        Steps to export your Search Query->Top Page report from Google Webmaster Tools:

        1.  Log into GWT and navigate to Search Traffic->Search Queries->Top Pages.
        2.  Set the grid.s=25 parameter in the URL to however many pages you want to download. You should also order the pages by clicks if you are downloading less than the maximum number of rows.

        https://www.google.com/webmasters/tools/top-search-queries? hl=en&siteUrl=http://www.yoursite.com/&de=20140121&db=20131101&more=true&qv=amount &type=urls&prop=WEB&region&grid.s=719

        1. Set your desired date range. Up to three months prior is available in GWT. As a side note, it might be a good idea to backup your data every three months.
        2. Press F12 to open the JavaScript Developer Tools. Select ‘Console’
        3. First, copy and paste the below JavaScript code into the Developer Tools console. Hit enter. You will be presented with an alert for each page entry in the table that Google is unable to expand. Simply hit enter to cycle through the alerts. When it appears all alerts are done, and all the page entries that Google can access have been expanded, proceed to the next step.
        //expand page entries
        (function(){
           pages = document.getElementsByClassName('goog-inline-block url-detail');
           for(i=0;i<pages.length;i++){
              pages[i].setAttribute('href','#');
              pages[i].setAttribute('target','');
              pages[i].click(); 
           } 
        })();
        
        1. Second, copy and paste the below JavaScript into the Developer Tools console. Hit enter. As long as your pop-ups are not disabled, you will be prompted to download a TSV with your GWT search queries->page data.
        //generate download link
        
        (function(){
        
          //make index for page rows
          //getting page rows separate from query rows
          //ordering them, storying 2-item array for
          //each page row, page path and index in
          //table
        
          temp = document.getElementById('grid').children[1].children;
          indices = new Array();
          tableEntries = Array.prototype.slice.call( temp)
          pageTds = document.getElementsByClassName('url-expand-open');
        
          for(i=0;i<pageTds.length;i++){
            temp = tableEntries.indexOf(pageTds[i]);
            indices.push([temp,pageTds[i].children[0].children[0].text]);
          }
        
          pageTds = document.getElementsByClassName('url-expand-closed');
        
          for(i=0;i< pageTds.length;i++){
            temp = tableEntries.indexOf(pageTds[i]);
            indices.push([temp,pageTds[i].children[0].children[0].text]);
          }
        
          indices.sort(function(a,b){return a[0]-b[0]});
        
          // this is complicated. need to mess with with index of
          // table rows since the aggregate page listing
          // is row just like expanded query rows
          for(i=indices.length-1;i> 0;i--){
            test = indices[i][0]-indices[i-1][0];
            if(test===1){
             indices[i-1][1]=indices[i][1];
             indices[i][0]++;
            }
          }
        
          thisCSV = "";
          queries = document.getElementsByClassName("url-detail-row");
        
          //use count to know when to update the page
          //column for the TSV. sorry if convoluted,
          //did this quickly not elegantly
          count = 0;
        
          for(i=0;i<queries.length;i++){
            if(indices[count][0]===i){
              thisPage = indices[count][1];
        
              do {
              count++;
              test = indices[count][0]-indices[count-1][0];
              } while(test === 1);
        
            indices[count][0]-=(count); 
            //because the pages and keywords are all in
            //tags, and were counted as the same level in the index
            //before
            }
        
            thisCSV += thisPage+"\t";
            l = queries[i].children[0].children.length
        
            if(l > 0) thisCSV+= queries[i].children[0].children[0].text+"\t";
              else thisCSV+= queries[i].children[0].innerHTML+"\t";
        
            thisCSV += queries[i].children[3].children[0].innerHTML+"\n";
        
          }
        
          //create href and click it as means to save tsv
          encodedUri = "data:text/csv;charset=utf-8,"+encodeURI(thisCSV);
          link = document.createElement("a");
          link.setAttribute("href", encodedUri);
        
          //update name w timestamp if you want
          link.setAttribute("download", "GWT_data.tsv");
          link.click();
        })();
        

        Delving into the Data

        Now that we’ve downloaded the data, let’s talk about what we can do with it. Why did we even download it in the first place? Well, as we mentioned in step 3, GWT data is only available for the past three months. If you regularly backup your data, you will have access to more than three months, and may be able to conduct better keyword analysis. In addition to maintaining historical data, we may be able to glean insight by sorting it and comparing to other data sets. I’ll outline how I used Excel for such a project. My approach was to increase the proportion of total data accounted for by the data displayed in Google Webmaster tools, based on the following assumption.

        Assumption:
         the process by which Google filters (chooses which queries are displayed in GWT) is not dependent on the keywords themselves. In other words, while Google might, for example, tend to display less long-tail keywords to us, they are not always blocking the same keywords on a weekly or monthly basis. If the above assumption holds true, we can partition data into weekly or monthly segments, and then estimate clicks for queries that appear in some time segments, but not in others. This technique would be likely be safer when working with monthly data, as there is a better chance the above assumption is met. For sake of demonstration, I download the last three months’ Search Query->Top Pages data  and partition it into six two-week segments. After importing into excel, I create a master list, previewed below.

        pivot table of GWT page-level search queries

        Exported TSV of GWT page-level search queries

         The fourth column is an index that represents the the two-week time period. Next I create a pivot chart with the data, and I am able to display a chart with query parameters as rows and the two-week time periods as columns. The values listed as visits are actually clicks. This method is most applicable to the search queries with a medium-level of clicks. These queries are common enough that they can be expected to be searched every two-weeks or month, but not so common that they need to be regularly included in the GWT reports (or else be conspicuously absent).

        Pivot Chart of Page-Level Search Queries, with Data Filled-In

        Left: Pivot Charts of Page-Level Search Queries.

        Right: With Missing Clicks Estimated

        Results

        Using this method, I’ve accounted for 13% more clicks (visits) without introducing new keywords. Further, I’ve only used:

        1. three months of search query data, and
        2. a small website with
        3. quickly changing web pages (the vast majority of landing pages are blog articles).

        This method will be even more useful for:

        1. Those with more than three months historic data
        2. larger websites
        3. websites with more-static web pages.

        Extensions

        1. Scale the monthly estimated corrections using aggregate search volume data. This will help to determine whether the absence of a search query is due to filtering by Google or just a lack of interest by searchers.
        2. Use Dimension Widening to import the refined search query data into Google Analytics, associating it with the landing page dimension.

        Assumptions Revisited

        I had reasoned that between the two-week periods, there are keywords that are sometimes displayed to us in Google Webmaster Tools, and that are sometimes blocked. For any search queries where some two-week periods have zero clicks, I average how many clicks/two-weeks they received over the three-month period, and assign that value to the given query. While there are certainly legitimate cases where a search query had no clicks for a given week, I reason that the error of wrongly assigning a search query click to a given page is less than the gain netted in terms of better understanding our search queries (and on a page-by-page basis at that!)

        And what if Google is consistently hiding the same keywords over the three-month period? I would argue that this would be very hard for Google to achieve while still displaying a relatively consistent percentage of total queries. (what happens if site traffic drops so much on a small web site that Google would be forced to display more than 20% or 30% of keywords?) They probably need to display keywords that have before been classified as hidden, even if they do not admit it.

        Anna N. says:
        Great post! Can you provide more details how to use the Dimension Widening and import the refined search query data into Google Analytics?
        Noah Haibach says:
        I probably should have expanded on what I meant by using dimension widening. We of course have no way of matching organic search visits with their actual search queries. What we might do, however, is probabilistically assign search queries to visits based on landing page. I admit the process is a bit tricky, and requires more initial setup in excel. Here are the steps as I see them:

        1. Download the GWT Top Pages report as I outlined in the article.
        2. Create a new view for your property in GA.
        3. Set a visit-level custom dimension to serve as a key for the dimension widening. Its value should be set to a random number between 1 and n where n is larger than the largest number of organic visits for a landing page displayed (in GWT) for a two week period (for us, it might be 500-1000). Use Math.random() to set this value.
        4. Set up dimension widening in GA based on two keys. The first key would join on landing page, and the second would join on the random value you set with the previous step.
        5. Use the GWT Top Pages report downloaded in step 1. Use the prior two weeks’ search queries, and distribute the random numbers (from 1 to n) so they represent the distribution of visits among the search queries. Do this for each page.
        6. Generate a CSV from the updated GWT Top Pages report. This CSV will have to be updated every two weeks, or as quickly as search trends change for your content.
        7. The Dimension Widening, along with a filter, could be used to rewrite all (not provided) entries as a custom dimension ‘keyword or estimated if not provided’.

        The initial setup is a bit complicated for this method. And there would be upkeep involved to update the Dimension Widening CSV every two weeks or month or so. While it does not reassign ACTUAL search queries to their respective visits, it could provide a more granular understanding of search intent than current landing page reports. It would be especially helpful for large websites that are less impacted by the error involved with our estimations. Let me know if I can provide more specifics or clarification.

        Also, we might wish to simplify the data we downloaded from GWT by removing stop words and grouping similar search queries (before using it for GA dimension widening).

        Chris says:
        Very clever stuff.

        If I understood correctly, the crux is:

        If there’s a time-frame (eg two week period) when a keyword doesn’t appear, and it usually does, then add the keyword with an estimated value (instead of no keyword).

        For this method, how many two week periods in a row can a keyword be missing and still get an estimated value assigned to it?

        As the data set gets bigger you’d need to account for this or you may be adding keywords (eg seasonal keywords) when they don’t exist!

        Noah Haibach says:
        Thanks Chris, and great question!

        To answer your second question first, yes, you definitely need to consider the seasonality of your site/the keyword. I think I mentioned as possible extension, you could try to put certain keywords in context by checking them against Google Web Trends and other services.

        Your first question needs a more nuanced answer. How many two-week periods in a row can a keyword be missing and still get an estimated value assigned? We can never be 100% sure; we must always take a probabilistic approach. I’ll give a rough example where we use a non-parametric estimation. Suppose we have four two-week periods and we have a keyword with the following appearances:

        weeks 1-2 – 1 query
        weeks 3-4 – 0 queries
        weeks 5-6 – 2 queries
        weeks 7-8 – 1 query

        We can thus say that over the 2-month period, the rough chance the keyword appears on any given day is (1+0+2+1)/60 = 1/15. According to the binomial theorem, the probability that we observe no queries for a 2-week period is:

        P(x=0 | n=14,p=1/15)=0.38

        So there’s a good chance the 0 queries was due to chance. Note that this is by no means an accurate model of the probability (the real deal would be more complicated), but it’s a good enough approach to give us an idea of the probability that 0 queries in a 2-week period is due to chance.

        MIke Sullivan says:
        Hey Noah, just released a new product you might like — it downloads all that data you can’t get at through the non-existent API using a simple Excel add-in (Windows only, sorry). First release…more to come…

        Tommy says:
        I have been trying and trying to get this code to download via Chrome, but so far no luck. The first part of the code, when it runs, returns undefined. Yet it looks good in Webmaster Tools. However, when I run the second part of the code to generate a download link, it is returned with an error of “TypeError: Cannot read property ’0′ of undefined” – any advice? Thank you!!!!

        Adam says:
        I had the same problem as you had when copying the script straight from the article into my browser. But when I copied it to a notepad first it worked fine.

        Gerhard Angeles says:
        It worked for me when I refreshed the page.

        Anyways, is there any way to include impressions in the tsv file?

        Thanks so much for this Noah. A great contribution to SEO in today’s (not provided) world.


Cleaning Up Catch from OutWit

Export the Catch of Table of Exploded Queries

Search and Replace within Excel to clean up the &nbsp;s etc

Copy the column containing the javascript corresponding to new nodes into Notepad++

Search and replace to clean up javascript – need to shorten the script before trying to import back into excel

within Notepad plain – need to replace any commas with a period – since we’re going to save file with .csv extension

Open the .csv file within Notepad and save as textfile with .csv file – this file (and for some reason not the Notepadd++ file can be imported into a  new excel worksheet,  and we then copy that column into the original excel file

OutWit Hub – FAQs

General

What is OutWit Hub and when should I use it?

When you are looking for something on the Web, search engines give you lists of links to the answers. The purpose of OutWit Hub is to actually go retrieve the answers for you and save them on your disk as data files, Excel tables, lists of email addresses, collections of documents, images…

If your question has one simple answer, it will be at the top of Wikipedia or Google results and you don’t need OutWit for that. When you know, however, that it would take you 20, 50, 500 clicks to get what you want, then odds are you do need OutWit Hub:

The Hub is an all-in-one application for extracting and organizing data, images, documents from online sources. It offers a wealth of data recognition, autonomous exploration, extraction and export features to simplify Web research. OutWit Hub exists both as a Firefox Add-on and as a standalone application for Windows, Mac OS, and Linux.

OK, I have downloaded OutWit Hub and I am running it. Now what?

We have an open list of 1,728 first things you can do with the application but we believe the best first thing is to run the built-in tutorials from the Help menu (Help>Tutorials).

Automatic Exploration

I want OutWit Hub to browse through a series of result pages but the ‘Next in Series’ and ‘Browse’ buttons are disabled. How come?
When opening a Web page, OutWit analyzes the source code and tries to understand as many things as possible about the page. The first thing it does is to find navigation links (next, previous…) and, when it does, the ‘Next in Series’ arrow and ‘Browse’ double arrows become active. If they are inactive, it is because OutWit did not find any additional pages. There are many workarounds to do the scrape without having to click on all links manually: depending on the cases, the best alternative solutions are using the Dig function (with advanced settings in the pro version), generating the URLs to explore, making a ‘self-navigating’ scraper with the #nextPage# directive or, finally, grab the URLs you want to scrape, put them in a directory of queries and use this directory to do a new automatic exploration. (Note that for the latter, it is also possible to grab the links to the catch in one macro and address the column of the Catch by the name you gave it in a second macro, by typing ‘catch/yourColumnName’ in the Start Page textbox.)

Some links are not working in the Standalone version of the Hub. What should I do?

These are links for which target=blank was specified in the source code. OutWit Hub cannot open separate popup windows but you can open them within the Hub. For this, check the “Open popup links in the application window” preference (Tools>Preferences>General).

Auto-Explore Functions and Fast Scraping are slower in the current version than in the previous. Why is that?

They are not, in fact. The program’s exploration functions work exactly the same way. It is possible, though, that your preference settings have changed during the upgrade. Temporization and pause settings should actually be more precise and reliable than in previous versions. You can fine-tune all this in Tools>Preferences>Time Settings. Another recent preference which may have an impact on the exploration speed is ‘Bypass Browser Cache’ in the ‘Advanced’ panel: not using the cache does slow the browsing down, so you may want to set it to ‘Never’. If, after this, you are still experiencing performance issues, consider disabling processes you may not need by right-clicking on ‘page’ in the left side bar.

The next page button functions correctly but when trying to do a Browse to capture the information, the application runs only 2 pages then stops. Why is that?

- Cause: the next page link is probably a javascript link and it is probably the same in all pages, so the program thinks this URL has already been visited and stops the exploration.

- Solution: there is a preference (Tools>Preferences>General) just for this. Uncheck “Only visit pages once…”. Important: Do not forget to check it back afterwards or your next Dig would probably last forever and bring back huge amounts of redundant data.
Data: Extracting, Importing, Exporting…

I would like to extract the details of all the products/events/companies in this site/directory/list of subsidiaries… Could you please advise me on how to do that?

Unfortunately this is the purpose of the hundreds of features covered in the present Help, so it is difficult to answer in one sentence, but the general principle is this:

Go through the standard extractors (documents, lists, tables, guess…) by clicking in the left side panel. Either you find that one of them gives you the results you want, –in which case it is just a matter of exporting the data– or you need to create a scraper for that site. In the second case, you first need to go to one of the detail pages, build a scraper in the ‘scrapers’ view for that page, test it on a few other pages. Then go to the list of results you need to grab and have OutWit browse through all the links and apply your new scraper. This can be done in two ways: either by actually going to each page (‘browse’ or ‘dig’ or a combination of both if you have the pro version) or by ‘Fast Scraping’ them (applying your scraper to selected URLs –right-click: Auto-Explore>Fast Scrape in any datasheet– or ‘Fast Scrape’ in a macro).

How can I import lists of links (URLs) or other strings into OutWit Hub?

There are many different ways to do this. Here are a few:

Put them into a text file (.txt or .csv), and open the file from the File menu. (Note that on some systems, the program may try to open .csv files with another application. In this case, just rename your file with the .txt extension.) You will find your URLs in the links view and the text in the text view.

Drag them directly from another application to the page or queries view of the Hub.

If they are in a local HTML file, simply open the file from the File menu and you will be able to process it with the Hub as any Web page.
Copy the links from whatever application they are in (you can also copy HTML source code or simple text containing URLs), right-click in the page view of the Hub and choose Edit>Paste Links.

Once your links are in the Hub, you simply need to select them, right-click on one of them and select ‘Send to Queries’ to create a directory of URLs that you will then be able to use in any way you like (in a macro for instance, or doing an automatic exploration directly from the right-click menu).

How can I import CSV or other tabulated data into OutWit Hub?

Simply open the file (.txt, .csv …) from the File menu. (Note that on some systems, the program may try to open .csv files with another application. In this case, just rename your file with the .txt extension.) If the original data was correctly tabulated, you should find the data well structured in the guess view. If the data was less structured, well, the Hub will do what it can.

I have made a scraper which works fine on the page I want to scrape, but when I do a browse and set the ‘scraped’ view to collect the data, it grabs the data of the first page over and over again. What is happening?

You are probably trying to scrape information from AJAX pages where the data is dynamically added to the page by Javascript scripts. You need to set the type of source to be used by your scraper to Dynamic. When you do, the source code of the page will be displayed on a pale yellow background. Note that you will probably have to adapt your scraper if it was created for the Original source, as the code may have changed slightly.

How can I convert a list of values into a String Generation Pattern?

If the values are in one of the Hub’s datasheets, just select them, right-click on one of them and select “Insert Rows…”. If they are in a file on your hard disk, simply import them into a directory of queries (see above) and do the same.

What is the maximum number of rows of data OutWit Hub can extract and export? After a certain number of rows, when exporting, I get a dialog telling me a script is unresponsive. What should I do?

In our tests, we have extracted and successfully exported up to 1.3 million rows (of two or three columns). Obviously, the limit varies a lot from system to system, depending on the platform, the RAM, etc. When exporting more than 50,000 or 100,000 rows, you may see such dialogs, even several times in a row, when you click on Continue. There is a checkbox to prevent it from coming back. (Note that Excel XML export is always much more demanding than CSV or TXT.) Don’t forget that you can move your results to the catch and save the catch itself in a file if you need to reuse the contents or just for backup purposes (File Menu). A catch file can only be read again in OutWit Hub but it is much faster to save than exporting the data.

The program doesn’t find all the email addresses in this Website, Why is that?

There are several ways to have OutWit look for emails in a site. The fastest is to select Fast-Search For emails>In Current Domain, either from the Navigation menu or from the popup menu you get when you right-click on the page. This method, however, doesn’t explore all pages in the site. It only looks for the most obvious (contacts, team, about us…) pages that can be found. If you want to systematically explore all pages in a site, you will have to use the Dig function, within domain, at the depth level you wish.

Why doesn’t the program find contact information (phone, address…) for some of the email addresses?

First, of course, the info has to be present in the page. Then, if it is there, no technology allows for perfect semantic recognition. An address or a phone number can take so many different forms, depending on the country, on the way it is presented or on how words are abbreviated, that we can never expect to reach a 100% success rate.

Email address recognition is nearly exhaustive in OutWit; phone numbers are recognized rather well in general; physical addresses are more of a challenge: they are better recognized for US, Canada, Australia and European countries than for the rest of the world. The program recognizes names in many cases. As for other fields like the title, for instance, automatic recognition in unstructured data is too complex at this point and results would not be reliable enough for us to include them unless they are clearly labled. We are constantly improving our algorithms so you should make sure to keep your application up-to-date.

I am observing the progress and I see that no new line is added for some pages when I am sure there is an email address or other info that should be found. Why is that?

This page (or one containing similar info) was probably visited before. Results are automatically deduplicated. This means that if an email address –or just a phone number or physical address– has already been found, the row containing this data will be updated (and no new row, created) when a new occurrence is found.

User Interface

How do I make a hidden column visible in a datasheet?

In the top right corner of every datasheet in the application is a little icon figuring a table with its header: the Column Picker. If you click on this icon, a popup menu will allow you to hide or show the different columns of the datasheet. Only visible columns are moved to the Catch and exported by default (this behavior can be changed with a custom export layout).

What is the Ordinal ID?

The Ordinal column is hidden by default in all datasheets. Use the column picker (icon at the top right corner of any datasheet) to display it. The Ordinal ID is an index composed of three groups of digits separated by dots. The first number is the number of the page from which the data line was extracted (it can only be higher than 1 if the ‘empty’ checkbox is unchecked or if the datasheet is the result of a fast scrape). The second number is the position of the data block in the page (can only be more than 1 in ‘tables’, ‘lists’, ‘scraped’ and ‘news’ views). The last number is the position of the data line in the block (or in the page, if there is only one data block in the page).

Install

I do not manage to enter my serial number in the Registration Dialog of OutWit Hub. The program keeps saying the key is invalid.

Your key was sent to you by email when you purchased the application. It is a series of letters and digits similar to this: 6YT3X-IU6TR-9V45E-AFS43-89U64. It must not be confused with the login password to your account on outwit.com which was also sent to you by email (if you miss one of these email messages, please check your spam box).

If you are wondering whether the Hub you are using is a pro or a light version, you will simply find the answer in the window title. Up to now, we haven’t had a single case where a valid serial number would not work. You might be experiencing a very rare bug but this seems very unlikely after several years. The key needs to be entered exactly like it is in the mail you received. So, either you are not typing it precisely right (in which case you should simply copy and paste the email address and the key from our original mail) or you are typing something completely different (the login to your outwit.com account, for instance?). If you have changed email adresses since you purchased your license, remember that the one to use is the one with which you originally placed your order.

I have installed OutWit Hub for Firefox (or Docs or Images) then reloaded Firefox but I don’t see the OutWit icon on my toolbar. What can I do?

Three possibilities:

1) You didn’t download the Firefox add-on but the standalone application. In this case, you just need to install the software and double-click on its icon, as you would for any other application.

2) You do have the add-on and the install worked but the icon is simply missing from the toolbar. In this case, select ‘OutWit’ in the ‘Tools’ menu, then select OutWit Hub (or the appropriate outfit) in the sub-menu. If you want to add the icon to your toolbar, right-click on the toolbar and select ‘Customize’ then drag and drop the OutWit icon onto it.

3) The add-on install failed. In this case, the most probable reason is that, even though you just downloaded the program, you do not have the latest version. The one you downloaded (probably from a third party) doesn’t work with the current version of Firefox. Download the latest version from outwit.com. Of course, every now and then, it may also be a real compatibility problem. So if the above doesn’t work or doesn’t apply, please create a support ticket on outwit.com and we’ll do our best to help.

How can I revert to OutWit Hub 3.0?

If you have upgraded to version 4 by mistake or have a problem with a feature and wish to revert to version 3, make sure your version (Hub and runner) is 4.0.4.35 or higher and type outwit:downgrade in the Hub’s address bar. (Please tell us if you believe you have discovered a problem in this version.)

Troubleshooting

On OutWit Hub For Firefox, I have been experiencing new issues recently: unresponsive scripts, timeouts, strange behaviors on pages that used to work fine… what can I do to revert to factory settings?
We are not aware of incompatibilities with other add-ons but it can always happen, some of your Frefox preferences could also have been changed by another extension or files may have been corrupted in your profile. You can try to create a blank profile and reinstall OutWit Hub (or other OutWit extensions) from outwit.com. This will bring you back to the initial state. Here is how to proceed on Windows:

http://kb.mozillazine.org/Creating_a_new_Firefox_profile_on_Windows

and on other platforms:

http://support.mozilla.com/kb/Managing+profiles

Can I create a new profile in OutWit Hub Standalone?

With the standalone version, the principle is almost exactly identical to the way it works in Firefox (see above paragraph).

Windows: click “Start” > “Run”, and type :
“C:\Program Files (x86)\OutWit\OutWit Hub\outwit-hub.exe” -no-remote -ProfileManager

Macintosh: Run the Terminal application and type :
/Applications/OutWit\ Hub.app/Contents/MacOS/outwit-hub -no-remote -ProfileManager

Linux: open a terminal and type :
[path to directory]/outwit-hub -no-remote -ProfileManager

If you need instructions to go further, refer to the profile manager instructions for Firefox:

http://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles

Where is my profile directory?

In OutWit Hub (Standalone or Firefox Add-on), if you type about:support in the address bar, you will get a page with important information about your system and configuration. In this page, you will find a button that will lead you to your profile directory. Among the files you will see there, the ones with .owc extensions are Catch files, and files ending with .owg are User Gear files (the User Gear is the database where all your automators are stored). You can back these files up or rename them if you plan to alter your profile.

OutWit Help Files

Frequently Asked Questions

General

What is OutWit Hub and when should I use it?

When you are looking for something on the Web, search engines give you lists of links to the answers. The purpose of OutWit Hub is to actually go retrieve the answers for you and save them on your disk as data files, Excel tables, lists of email addresses, collections of documents, images…

If your question has one simple answer, it will be at the top of Wikipedia or Google results and you don’t need OutWit for that. When you know, however, that it would take you 20, 50, 500 clicks to get what you want, then odds are you do need OutWit Hub:

The Hub is an all-in-one application for extracting and organizing data, images, documents from online sources. It offers a wealth of data recognition, autonomous exploration, extraction and export features to simplify Web research. OutWit Hub exists both as a Firefox Add-on and as a standalone application for Windows, Mac OS, and Linux.

OK, I have downloaded OutWit Hub and I am running it. Now what?

We have an open list of 1,728 first things you can do with the application but we believe the best first thing is to run the built-in tutorials from the Help menu (Help>Tutorials).

Automatic Exploration

I want OutWit Hub to browse through a series of result pages but the ‘Next in Series’ and ‘Browse’ buttons are disabled. How come?
When opening a Web page, OutWit analyzes the source code and tries to understand as many things as possible about the page. The first thing it does is to find navigation links (next, previous…) and, when it does, the ‘Next in Series’ arrow and ‘Browse’ double arrows become active. If they are inactive, it is because OutWit did not find any additional pages. There are many workarounds to do the scrape without having to click on all links manually: depending on the cases, the best alternative solutions are using the Dig function (with advanced settings in the pro version), generating the URLs to explore, making a ‘self-navigating’ scraper with the #nextPage# directive or, finally, grab the URLs you want to scrape, put them in a directory of queries and use this directory to do a new automatic exploration. (Note that for the latter, it is also possible to grab the links to the catch in one macro and address the column of the Catch by the name you gave it in a second macro, by typing ‘catch/yourColumnName’ in the Start Page textbox.)

Some links are not working in the Standalone version of the Hub. What should I do?

These are links for which target=blank was specified in the source code. OutWit Hub cannot open separate popup windows but you can open them within the Hub. For this, check the “Open popup links in the application window” preference (Tools>Preferences>General).

Auto-Explore Functions and Fast Scraping are slower in the current version than in the previous. Why is that?

They are not, in fact. The program’s exploration functions work exactly the same way. It is possible, though, that your preference settings have changed during the upgrade. Temporization and pause settings should actually be more precise and reliable than in previous versions. You can fine-tune all this in Tools>Preferences>Time Settings. Another recent preference which may have an impact on the exploration speed is ‘Bypass Browser Cache’ in the ‘Advanced’ panel: not using the cache does slow the browsing down, so you may want to set it to ‘Never’. If, after this, you are still experiencing performance issues, consider disabling processes you may not need by right-clicking on ‘page’ in the left side bar.

The next page button functions correctly but when trying to do a Browse to capture the information, the application runs only 2 pages then stops. Why is that?

- Cause: the next page link is probably a javascript link and it is probably the same in all pages, so the program thinks this URL has already been visited and stops the exploration.

- Solution: there is a preference (Tools>Preferences>General) just for this. Uncheck “Only visit pages once…”. Important: Do not forget to check it back afterwards or your next Dig would probably last forever and bring back huge amounts of redundant data.
Data: Extracting, Importing, Exporting…

I would like to extract the details of all the products/events/companies in this site/directory/list of subsidiaries… Could you please advise me on how to do that?

Unfortunately this is the purpose of the hundreds of features covered in the present Help, so it is difficult to answer in one sentence, but the general principle is this:

Go through the standard extractors (documents, lists, tables, guess…) by clicking in the left side panel. Either you find that one of them gives you the results you want, –in which case it is just a matter of exporting the data– or you need to create a scraper for that site. In the second case, you first need to go to one of the detail pages, build a scraper in the ‘scrapers’ view for that page, test it on a few other pages. Then go to the list of results you need to grab and have OutWit browse through all the links and apply your new scraper. This can be done in two ways: either by actually going to each page (‘browse’ or ‘dig’ or a combination of both if you have the pro version) or by ‘Fast Scraping’ them (applying your scraper to selected URLs –right-click: Auto-Explore>Fast Scrape in any datasheet– or ‘Fast Scrape’ in a macro).

How can I import lists of links (URLs) or other strings into OutWit Hub?

There are many different ways to do this. Here are a few:

Put them into a text file (.txt or .csv), and open the file from the File menu. (Note that on some systems, the program may try to open .csv files with another application. In this case, just rename your file with the .txt extension.) You will find your URLs in the links view and the text in the text view.
Drag them directly from another application to the page or queries view of the Hub,
If they are in a local HTML file, simply open the file from the File menu and you will be able to process it with the Hub as any Web page.
Copy the links from whatever application they are in (you can also copy HTML source code or simple text containing URLs), right-click in the page view of the Hub and choose Edit>Paste Links.

Once your links are in the Hub, you simply need to select them, right-click on one of them and select ‘Send to Queries’ to create a directory of URLs that you will then be able to use in any way you like (in a macro for instance, or doing an automatic exploration directly from the right-click menu).

How can I import CSV or other tabulated data into OutWit Hub?

Simply open the file (.txt, .csv …) from the File menu. (Note that on some systems, the program may try to open .csv files with another application. In this case, just rename your file with the .txt extension.) If the original data was correctly tabulated, you should find the data well structured in the guess view. If the data was less structured, well, the Hub will do what it can.

I have made a scraper which works fine on the page I want to scrape, but when I do a browse and set the ‘scraped’ view to collect the data, it grabs the data of the first page over and over again. What is happening?

You are probably trying to scrape information from AJAX pages where the data is dynamically added to the page by Javascript scripts. You need to set the type of source to be used by your scraper to Dynamic. When you do, the source code of the page will be displayed on a pale yellow background. Note that you will probably have to adapt your scraper if it was created for the Original source, as the code may have changed slightly.

How can I convert a list of values into a String Generation Pattern?

If the values are in one of the Hub’s datasheets, just select them, right-click on one of them and select “Insert Rows…”. If they are in a file on your hard disk, simply import them into a directory of queries (see above) and do the same.

What is the maximum number of rows of data OutWit Hub can extract and export? After a certain number of rows, when exporting, I get a dialog telling me a script is unresponsive. What should I do?

In our tests, we have extracted and successfully exported up to 1.3 million rows (of two or three columns). Obviously, the limit varies a lot from system to system, depending on the platform, the RAM, etc. When exporting more than 50,000 or 100,000 rows, you may see such dialogs, even several times in a row, when you click on Continue. There is a checkbox to prevent it from coming back. (Note that Excel XML export is always much more demanding than CSV or TXT.) Don’t forget that you can move your results to the catch and save the catch itself in a file if you need to reuse the contents or just for backup purposes (File Menu). A catch file can only be read again in OutWit Hub but it is much faster to save than exporting the data.

The program doesn’t find all the email addresses in this Website, Why is that?

There are several ways to have OutWit look for emails in a site. The fastest is to select Fast-Search For emails>In Current Domain, either from the Navigation menu or from the popup menu you get when you right-click on the page. This method, however, doesn’t explore all pages in the site. It only looks for the most obvious (contacts, team, about us…) pages that can be found. If you want to systematically explore all pages in a site, you will have to use the Dig function, within domain, at the depth level you wish.

Why doesn’t the program find contact information (phone, address…) for some of the email addresses?

First, of course, the info has to be present in the page. Then, if it is there, no technology allows for perfect semantic recognition. An address or a phone number can take so many different forms, depending on the country, on the way it is presented or on how words are abbreviated, that we can never expect to reach a 100% success rate.

Email address recognition is nearly exhaustive in OutWit; phone numbers are recognized rather well in general; physical addresses are more of a challenge: they are better recognized for US, Canada, Australia and European countries than for the rest of the world. The program recognizes names in many cases. As for other fields like the title, for instance, automatic recognition in unstructured data is too complex at this point and results would not be reliable enough for us to include them unless they are clearly labled. We are constantly improving our algorithms so you should make sure to keep your application up-to-date.

I am observing the progress and I see that no new line is added for some pages when I am sure there is an email address or other info that should be found. Why is that?

This page (or one containing similar info) was probably visited before. Results are automatically deduplicated. This means that if an email address –or just a phone number or physical address– has already been found, the row containing this data will be updated (and no new row, created) when a new occurrence is found.

User Interface

How do I make a hidden column visible in a datasheet?

In the top right corner of every datasheet in the application is a little icon figuring a table with its header: the Column Picker. If you click on this icon, a popup menu will allow you to hide or show the different columns of the datasheet. Only visible columns are moved to the Catch and exported by default (this behavior can be changed with a custom export layout).

What is the Ordinal ID?

The Ordinal column is hidden by default in all datasheets. Use the column picker (icon at the top right corner of any datasheet) to display it. The Ordinal ID is an index composed of three groups of digits separated by dots. The first number is the number of the page from which the data line was extracted (it can only be higher than 1 if the ‘empty’ checkbox is unchecked or if the datasheet is the result of a fast scrape). The second number is the position of the data block in the page (can only be more than 1 in ‘tables’, ‘lists’, ‘scraped’ and ‘news’ views). The last number is the position of the data line in the block (or in the page, if there is only one data block in the page).

Install

I do not manage to enter my serial number in the Registration Dialog of OutWit Hub. The program keeps saying the key is invalid.

Your key was sent to you by email when you purchased the application. It is a series of letters and digits similar to this: 6YT3X-IU6TR-9V45E-AFS43-89U64. It must not be confused with the login password to your account on outwit.com which was also sent to you by email (if you miss one of these email messages, please check your spam box).

If you are wondering whether the Hub you are using is a pro or a light version, you will simply find the answer in the window title. Up to now, we haven’t had a single case where a valid serial number would not work. You might be experiencing a very rare bug but this seems very unlikely after several years. The key needs to be entered exactly like it is in the mail you received. So, either you are not typing it precisely right (in which case you should simply copy and paste the email address and the key from our original mail) or you are typing something completely different (the login to your outwit.com account, for instance?). If you have changed email adresses since you purchased your license, remember that the one to use is the one with which you originally placed your order.

I have installed OutWit Hub for Firefox (or Docs or Images) then reloaded Firefox but I don’t see the OutWit icon on my toolbar. What can I do?

Three possibilities:

1) You didn’t download the Firefox add-on but the standalone application. In this case, you just need to install the software and double-click on its icon, as you would for any other application.

2) You do have the add-on and the install worked but the icon is simply missing from the toolbar. In this case, select ‘OutWit’ in the ‘Tools’ menu, then select OutWit Hub (or the appropriate outfit) in the sub-menu. If you want to add the icon to your toolbar, right-click on the toolbar and select ‘Customize’ then drag and drop the OutWit icon onto it.

3) The add-on install failed. In this case, the most probable reason is that, even though you just downloaded the program, you do not have the latest version. The one you downloaded (probably from a third party) doesn’t work with the current version of Firefox. Download the latest version from outwit.com. Of course, every now and then, it may also be a real compatibility problem. So if the above doesn’t work or doesn’t apply, please create a support ticket on outwit.com and we’ll do our best to help.

How can I revert to OutWit Hub 3.0?

If you have upgraded to version 4 by mistake or have a problem with a feature and wish to revert to version 3, make sure your version (Hub and runner) is 4.0.4.35 or higher and type outwit:downgrade in the Hub’s address bar. (Please tell us if you believe you have discovered a problem in this version.)

Troubleshooting

On OutWit Hub For Firefox, I have been experiencing new issues recently: unresponsive scripts, timeouts, strange behaviors on pages that used to work fine… what can I do to revert to factory settings?
We are not aware of incompatibilities with other add-ons but it can always happen, some of your Frefox preferences could also have been changed by another extension or files may have been corrupted in your profile. You can try to create a blank profile and reinstall OutWit Hub (or other OutWit extensions) from outwit.com. This will bring you back to the initial state. Here is how to proceed on Windows:

http://kb.mozillazine.org/Creating_a_new_Firefox_profile_on_Windows

and on other platforms:

http://support.mozilla.com/kb/Managing+profiles

Can I create a new profile in OutWit Hub Standalone?

With the standalone version, the principle is almost exactly identical to the way it works in Firefox (see above paragraph).

Windows: click “Start” > “Run”, and type :
“C:\Program Files (x86)\OutWit\OutWit Hub\outwit-hub.exe” -no-remote -ProfileManager

Macintosh: Run the Terminal application and type :
/Applications/OutWit\ Hub.app/Contents/MacOS/outwit-hub -no-remote -ProfileManager

Linux: open a terminal and type :
[path to directory]/outwit-hub -no-remote -ProfileManager

If you need instructions to go further, refer to the profile manager instructions for Firefox:

http://support.mozilla.org/en-US/kb/profile-manager-create-and-remove-firefox-profiles

Where is my profile directory?

In OutWit Hub (Standalone or Firefox Add-on), if you type about:support in the address bar, you will get a page with important information about your system and configuration. In this page, you will find a button that will lead you to your profile directory. Among the files you will see there, the ones with .owc extensions are Catch files, and files ending with .owg are User Gear files (the User Gear is the database where all your automators are stored). You can back these files up or rename them if you plan to alter your profile.

Next in series: Loads the next page in a series

Active when OutWit finds a navigation link to the following page (i.e. if the current page is part of a series, like a result page for a query in a search engine).

Browse: Auto-browses through the pages of a series.

Active when OutWit finds a navigation link to the following page (i.e. if the current page is part of a series, like a result page for a query in a search engine). Right-clicking or holding down the Browse button opens a menu allowing you to limit the number of pages to explore. Escape or a second click of the button will stop the browse.

Dig: Automaticallly explores the links of the current page.

Active when OutWit finds links in the current page. Right-clicking or holding down the Dig button opens a menu allowing you to limit the exploration within or outside the current domain and to set the depth of the dig. Depth = 0 will browse through all the links of the page, Depth = 1 will also explore the all the links of pages visited. Escape or a second click of the button will stop the dig. (Only links matching the list of extensions set in the advanced preference panel are explored. Some link types are systematically filtered out from the exploration: log-out pages, feeds which cannot be opened by the browser, etc.)

Up to Site Home: Goes up to the home page of the current site.

Active when the current page is not the home page of a site. Goes up one level towards the top of the current site’s hierarchy.

Slideshow: Displays the images of the page as a slideshow.

Active when OutWit finds images in the current page. The slideshow can be viewed in full screen or in the page widget. If the current page is part of a series, the slideshow will go on as long as a next page is found.

Address Bar: for URLs, macros or search queries.

You can type here a URL to load, a query which will be forwarded to the preferred search engine, or a macro to execute.

The Standalone Application

OutWit Hub exists in two guises: a standalone application and a Firefox add-on.

Both versions are basically the same program and are able to fulfill the same functions. There are however a few specificities corresponding to their nature. The ones which are worth noting are the following:

(If you wish to get to your OutWit files, please first read the Frequently Asked Questions, Troubleshooting section, for info on the Profile files in both the Standalone app. and the Firefox add-on.)

The standalone application can be launched from a terminal:

Windows: “C:\Program Files (x86)\OutWit\OutWit Hub\outwit-hub.exe”
Mac OS: /Applications/OutWit\ Hub.app/Contents/MacOS/outwit-hub
Linux: run outwit-hub from the location where you unpacked the zip file.

In the command line, to run the standalone application from a terminal, you can include the following parameters:

-url “http://…” to load an URL after starting. Using the quotes around the URL is safer, in case of special characters, especially on Window.
-macro xxx to execute the macro corresponding to the Automator ID (AID) xxx in your profile (see the list of macros in the macro manager).

-quit-after to instruct the application to quit after executing the tasks of the command line.
-p to open the profile manager.

The standalone profile files with your automators and catch are located by default in (replace XXX by your user directory):

Windows: C:\Users\XXX\AppData\Roaming\OutWit\outwit-hub\Profiles\
Mac OS: /Users/XXX/Library/Application\ Support/OutWit/outwit-hub/Profiles
Linux: outwit-hub/Profiles.
Note: In OutWit Hub (Standalone or Firefox Add-on), if you type about:support in the address bar, you will get a page with important information about your system and configuration. In this page, you will find a button that will lead you to your profile directory. The file named User_Gear.owg contains your scrapers, macros, etc. and catch.owc contains the data you placed in your Catch. Your profile directory also contains backup folders where old versions of these files are stored. (In the Firefox Add-on version, your OutWit profile files are located within the Firefox profile.)

The Firefox Add-on allows the Hub to open new browser windows as new Firefox tabs or windows. The standalone version doesn’t have this capacity.

OutWit Hub’s Menus

The menus give access to the main features of the application.

The application Menus located at the top of the screen are

the File Menu
the Edit Menu
the View Menu
the Navigation Menu
the Tools Menu
the Help Menu
the Registration/Upgrade Menu

A contextual menu, the right-click popup Menu, can be used in all datasheets of the application.

The File Menu
Gives access to the file saving/loading and data export functions.

Available options may vary with the view and the license level of your product.

Open…
Opens the File Picker Dialog to select one or several files from the hard disk or a local resource. Some file types can be explored and processed by OutWit to recognize and extract content (html, htm, xhtml, xml, txt, csv, owc…). When the selected file can be processed by OutWit Hub, it will be opened in directly OutWit Hub, otherwise, it will be ignored (or, in the case of OutWit Hub for Firefox, it will be open/processed by Firefox). When several files are selected in the Open Dialog, if some or all of them can be explored by OutWit, they will be successively browsed by the program. If one or several files are OutWit Automators or Catch files, they will be imported after confirmation by the user.
Notes about importing data:

You can open .html files of course, but also .txt, .sql, .csv… files of many different types and formats and process them with the Hub. The guess view should do a good job recognizing the fields of tabulated files in most cases –if the file is not too exotic.
Putting a list of URLs in a .txt or .csv file and opening it in the Hub is one of the easiest ways to import links for automatic exploration and processing. They will appear in the links view, from which they can be grabbed, sent to a directory in the queries view…

Save Page As…
Same command as in any browser: Saves the current page on the hard disk. The attached files and images will be saved in a folder called with the name of the page siffixed with “_files”.
Download Selected Files
Downloads and saves to the current destination folder on your hard disk, all documents and images found in the selected rows. (The same option can be found in the datasheet right-click menu.)
Download Selected Files in…
Downloads all documents and images found in the selected rows, opening the folder picker to let you decide where you want the files to be saved. (The same option can be found in the datasheet right-click menu.)
Load a Catch File…
Opens the File Picker to select a Catch file to open on the hard disk or a local resource.
Save Catch File as…
Save the content of the Catch as an OutWit Catch file (.owc) to the hard disk or a local resource.
Export Catch as…
Exports the content of the Catch to a file on your hard disk, in one of the available formats (Excel, CSV, HTML, SQL).
Export Selection as…
Exports the selected data to a file on your hard disk, in one of the available formats (Excel, CSV, HTML, SQL). (The same option can be found in the datasheet right-click menu.)
Empty Catch
Deletes the contents of the Catch panel.
Manage User Gear
Allows you to Export or Import the User Gear database, which contains all your automators. This way, you can easily transfer your scrapers, macros… from one profile to the other or from the addon to the standalone version.

The Edit Menu
Gives access to the application’s text and datasheet editing functions.

Available options may vary with the view and the license level of your product.

Editing Functions
The standard Cut, Copy, Paste, Duplicate and Delete functions apply to the selection. In a datasheet, they apply to rows.
Insert, delete, edit, copy and empty functions are available for cells. Columns can be inserted or deleted.
Insert Row
Inserts an empty row to the current datasheet, before the selected row.
Insert Rows
The Insert Rows function allows you to generate strings using the Query Generation Pattern format. Inserts the generated rows to the current datasheet, after the selected row.
Select All
Selects all rows of the datasheet.
Invert Selection
Deselects all selected rows of the datasheet and selects all rows that were not selected.
Select Similar
Selects all rows of the datasheet with content similar to that of the the selected cell. The default threshold used for determining similarity is 40 (0 selecting only identical values and 100 selecting everything). Use the sub-menu items to increase or decrease the threshold and select more or less rows.
Select Identical
Selects all rows of the datasheet with content identical to that of the the selected cell.
Select Different
Selects all rows of the datasheet with different content from that of the the selected cell.

The View Menu

Gives access to the application’s display options.

Available options may vary with the view and the license level of your product.

Slideshow
Displays the images of the page as a slideshow.
Full Screen
Displays the page in full screen. In this mode, menus disappear. To exit the full screen mode, use the platform function key or press the escape key.
Show/Hide Catch
Displays or hides the Catch Panel at the bottom of the application interface.
Show/Hide Log
Displays or hides the Log Panel at the top of the application interface.
Show/Hide Info
Displays or hides the Info/Message Bar at the top of the application interface.
Switch View Mode
Rolls through the different display settings for the current view: Data only (spreadsheet display), Export Layout only (HTML, CSV…) or a split view with both.
Highlight Series of Links
When checked, the program will highlight links of the same group or level, to simplify automatic exploration.
Show Exploration Button
When checked, the program will display a button in the page with which you can display the main automatic exploration functions in a simple click. (The exploration menu can also be displayed by right-clicking on the page.)
Windows
Lists and gives access to the windows currently open in Firefox.
Views
Lists and gives access to the Hub’s views.

The Navigation Menu
Gives access to the application’s Navigation options.

Available options may vary with the view and the license of your product.

Fast Search for Contacts and Auto-Explore Pages are also accessible using the right-click menu on the page or the Exploration Button.

Back
Goes back one page in the navigation history.
Forward
Goes forward one page in the navigation history.
Next in series
Loads the next page in a series (more info on the next page function.)

Active when OutWit finds a navigation link to the following page (i.e. if the current page is part of a series, like a result page for a query in a search engine).

Fast Search for Contacts
The program sends queries to the site(s) and searches for emails without loading the pages in the browser. Available options in this sub-menu vary with the current page and context. They include:

In Current Website
The program sends queries to the current site and searches for contacts without actually loading the pages in the browser. Not all pages are explored. OutWit Hub tries to locate the ones that are likely to include contact information.
In All Links
The program sends queries to the URLs found in the current page to search for email addresses and contact information.
In Selected Links
The program sends queries to the selected links, searching for email addresses and contact information.
In Highlighted Links
The program sends queries to the highlighted links, searching for email addresses and contact information. (Hover over the links to highlight series of links.)
In Linked Websites
The program browses through the pages of the current series of result pages (if any) and sends queries to the external URLs found (linking outside the current domain) to search for email addresses and contact information.

Auto-Explore Pages
The program actually visits and loads each page of a series or selection. Available options in this sub-menu vary with the current page and context.

Browse Selected Links
Auto-browses through the links that are selected in the current page.
Browse Highlighted Links
Auto-browses through the links that are highlighted in the current page. (Hover over the links to highlight series of links.)
Browse Series of Result Pages
Auto-browses through the pages of a series.

Active when OutWit finds a navigation link to a following page (i.e. if the current page is part of a series, like a result page for a query in a search engine). Escape or a second click on the Browse button will stop the auto-browse process. Right-clicking or clicking and holding down the Browse button shows a menu allowing to choose the extent of the automatic browse to perform (2,3,5,10 or all pages).
Dig / Browse & Dig Result Pages
Gives access to the Dig sub-menu: The Dig function is a systematic exploration of all links found in a page, in a whole site or in a series of result pages. In order to not visit hundreds of unwanted pages randomly, you can set a number of limitations. You can visit pages if they are within the same domain as the current page, outside the page domain or you can visit any link found. You can also specify the Depth of your exploration: Depth 0 is the list of links found in the page, depth 1 also includes all the links found in each visited page and depth 2 does the same one level below. In the Advanced Settings dialog, you can combine all these criteria and even set an additional filter with a string (or a regular expression) which must be present in the URL for the program to explore it. (Only links matching the list of extensions set in the advanced preference panel are explored in a Dig. Some link types are also systematically filtered out from the exploration: log-out pages, feeds which cannot be opened by the browser, etc.)

Reload the page
Reloads the current page.
Stop All Processes
Aborts current processes, like the loading of a page, the dig and browse functions, the execution of a macro, etc. In many instances, the escape key has the same effect. Only active when Outwit is browsing, digging, loading a page, etc.
Pause All Processes
Pausing complex processes and resuming them at a later time is not always possible. This function gives a simple solution by suspending all processing while displaying an alert and waiting for a click. Only active when Outwit is browsing, digging, loading a page, etc.
Bookmarks
Gives access to the bookmarks.
History
Gives access to the navigation history.
Workshop
Loads the ‘Workshop Page’, a blank page where you can paste and edit any textual content or data to be processed with OutWit Hub.

The Tools Menu
Gives access to additional tools and features.

Available options may vary with the view and the license of your product.

Reset All Views
Reverts the settings in the bottom panels of every view to their original values.
Clear History
Clear your browsing history. You can choose to erase everything, or specifically the history of pages you went to, your form filling history, your cache or all your cookies.
Downloads
Opens the download panel.

Preferences
Opens the OutWit Hub’s user preference panel.
Apply Scraper
Applies an applicable scraper to the current page.
Apply Macro
Applies a generic macro to the current page.
Error Console
Display the error console with messages (blue), warnings (yellow) and errors (pink) that have occurred recently.

The Datasheet Right-Click Menu
In all datasheets, additional features can be accessed with a right click on the selected items.

Available options can vary with the view and the license level of your product.
Note that this menu has changed in versions 3.x and 4.x

Edit
Gives access to the Edit sub-menu, with the standard Editing functions and more.

Editing Functions
Cut, Copy and Paste functions are available for cell editing.
Copy Cell(s)
Copies the content of selected cells. Use it to get the contents of selected cells in a column as a list of values.
Edit Cell
Allows for inline editing of a cell content.
Replace in Cell(s)
Opens the replace dialog for replacements in the selected cells of the current column.
Replace All
Opens the replace dialog for replacements in the whole datasheet.
Rename Column
In data views, this option allows you to change the header of a dynamic column.
Empty Cell(s)
Empties the selected cells of the current column.
Duplicate
Duplicates the contents of selected rows and inserts the duplicates as new rows after the selection.

Insert
Gives access to the Insert and Split sub-menu, with cell/row/column insertion functions.

Insert Row
Inserts a new blank row after the selection.
Insert Rows
Gives access to the String Generation Panel and inserts the generated strings as new rows before the selection. This Insert Rows function allows you to generate strings using the Query Generation Pattern format.
Split First/Last Names
If the selected cell values are recognized as people names, this function inserts new ‘Fist Name’ and ‘Last Name’ columns before the selected column (if these do not already exist) and fills them with the corresponding values found in the selected cell(s). Note that, for now, only one pair of First Name/Last Name columns can exist in the datasheet.
Split Cell(s) to Rows
If the selected cell values contain a character recognized as an item separator (;,-/), this function inserts new rows below the selected rows and fills them with the split values of the selected cells, duplicating the content of the other cells of the selected rows. Note that, as all ‘intelligent’ functions, this one can sometimes have unexpected results, but it can nevertheless save you a lot of time in many repetitive tasks.
Split Cell(s) to Columns
If the selected cell values contain a character recognized as an item separator (;,-/), this function inserts new columns left of the selected column and fills them with the split values of the selected cells. Note that, as all ‘intelligent’ functions, this one can sometimes have unexpected results, but it can nevertheless save you a lot of time in many repetitive tasks.
Insert Column
In data views, inserts a new blank column before the selection. This option only applies to dynamic columns.
Insert Cell(s)
In data views, inserts new blank cells before the selection. This option only applies to dynamic columns.

Delete
Gives access to the Delete sub-menu, to delete cells rows or columns.

Delete
Deletes the selected row(s).
Delete Unselected
Deletes the row(s) that are not selected.
Delete Column
In data views, deletes the selected column. This option only applies to dynamic columns.
Delete Columns
In data views, this options gives you access to a sub-menu allowing you to delete columns containing less than a certain number of populated cells. This option, which only applies to dynamic columns, is very useful to clean up large scrapes where useless columns have been created by poorly populated data fields.
Delete Cell(s)
In data views, deletes selected cells and moves left all the cells located at the right of the selected column. This option only applies to dynamic columns.
Delete Duplicates
Gives access to a sub-menu to delete cell duplicates (rows containing an identical value to the selected cell in the same column) or row duplicates (rows where all cells are identical to the cells of the selected row). It is also possible, through the same menu, to delete all cell or row duplicates of the datasheet.

Select
Gives access to the Select sub-menu, with various ways to select cells or rows.

Select All
Selects all rows of the datasheet.
Invert Selection
Deselects all selected rows of the datasheet and selects all rows that were not selected.
Select Block
In the lists, tables, scraped and news views, this function will select the whole block (list, table, scraped page or rss feed) where the selected row is located. Note: the selection is done using the second group of digits in the Ordinal ID. (Use the column picker at the top right corner of the datasheet to show the Ordinal column, if it is not visible.)
Select Similar
Selects all rows of the datasheet with content similar to that of the the selected cell. The default threshold used for determining similarity is 40 (0 selecting only identical values and 100 selecting everything). Use the sub-menu items to increase or decrease the threshold and select more or less rows.
Select Identical
Selects all rows of the datasheet with content identical to that of the the selected cell.
Select Different
Selects all rows of the datasheet with content different from that of the the selected cell.
Select Duplicates
Gives access to a sub-menu to select cell duplicates (rows containing an identical value to the selected cell in the same column) or row duplicates (rows where all cells are identical to the cells of the selected row). It is also possible, through the same menu, to select all cell or row duplicates of the datasheet.

Auto-Explore
This sub-menu gives access to automation functions that you can apply to the URLs of the selected column in the selected rows of the datasheet. It gives you the capacity to explore the pages or documents and apply extractors, according to the current configuration of the application.

Browse
The program explores the links included in the current selection one after the other. During the exploration, all active extraction processes will be executed on page load, depending on the settings of the views’ bottom panel.

Dig
The program explores the links found in the the pages of the current selection’s URLs. The exploration will be done within the domain of each link, with a depth of 1. During the Dig process, all active extractions will be executed on page load, according to the settings of the views’ bottom panel.

Fast Scrape
Applies a Scraper to a list of Selected URLs. When this function is invoked, XML HTTP requests are sent to all the selected URLs, to retrieve the source code of each one. The most relevant scraper is applied to it, without loading images etc. and without any other extraction being performed. All extracted data is sent to the Scraped view (which is not emptied during the process, regardless of the state of the Empty checkbox).
Fast Scrape (Include Selected Data)
Same function as ‘Fast Scrape’ above, except that the data fields included in the selection will be added to the scraped results. This saves you the work of merging back the records after the scraping, if you need to keep information from the original data.
Apply a Generic Macro
This function allows you to apply a generic macro to the selected URLs. Generic macros are simply macros for which no specific URL is set in the Start Page field.
Open URL in a New Window
In the Firefox Add-on: When the selected data contains a URL, it will be opened in a new browser window.

Download
Gives you access to the Download sub-menu. Note that a preference (in Tools>Preferences>Export) allows you to automatically rename the downloaded files.

Download Selected Files
Downloads and saves to the current destination folder on your hard disk, all documents and images found in the selected rows.
Download Selected Files in…
Downloads all documents and images found in the selected rows, opening the folder picker to let you decide where you want the files to be saved.

First Names
Gives you access to the First Names sub-menu.

The First Name Dictionary is used to enhance the recognition of contact in Web pages. A default dictionary of a few thousand first names from around the world is already included in the program. You can add your own using these options. Note that the dictionary can be saved and loaded from the File menu.

Remember First Name
Choosing this option when a first name is selected in the datasheet will add it to your dictionary.
Forget First Name
Choosing this option when a first name is selected in the datasheet will remove it from your dictionary.

Clean Up
Gives access to the Cleaning & Normalization sub-menu.

Clean Contents
Gives access to the String Cleaning sub-menu.

To Lower Case
Converts all characters of the selected cells to lower case.
To Upper Case
Converts all characters of the selected cells to upper case.
Capitalize Words
Converts the first character of each word in the selected cells to Upper case and the others to lower case.
Dust It
Cleans the text at best and capitalizes the words.
Zap It
Cleans the text from all non-alphabeltical chars and capitalizes the words.

Normalize All Figures / Selected Figures in Column
When this function is executed on a selection, the numerical data contained in each selected cell of the selected column (or in the whole datasheet, depending on the selected option) is reformatted and converted to the corresponding value in metric units (if a numerical value is found with a non-metric unit). Numerical values are normalized as much as possible, removing thousand separators, using dots as decimal separators, removing trailing zeros in decimals, etc. The purpose of this function is not to create a nice formatting but rather to homogenize the formats so that the values can be processed and sorted. Note: the feature is watched by dozens of unit tests in our system and works rather well. There are, however, many possible causes for miss-interpretation of numbers in a text, so please do not rely on this function for processes involved in the piloting of commercial airliners, nuclear power plants, etc.
To Units: Values will be converted to meters, square meters, cubic meters, grams etc.

To k Units: Values will be converted to kilometers, square kilometers, kilograms etc.

Send to Queries
Sends strings to a directory of Queries.

Send Cell(s) to Queries
Sends the selected cells to the chosen directory of the queries view:

New Directory: A new directory will be created with the selected items.

directoryName: The selected items will be sent to the chosen directory.

Send Links(s) to Queries
The first links found in the selected rows will be sent to the chosen directory of the queries view:

New Directory: A new directory will be created with the selected items.

directoryName: The selected items will be sent to the chosen directory.

Export Selection as…
Exports the selected data to a file on your hard disk, in one of the available formats (Excel, HTML, Text, CSV, SQL).

The Page Right-Click Menu
In the browser panel, additional features can be accessed with a right click on the page or a click on the Exploration Button.

Available options can vary with the context and the license level of your product.
Edit (right-click menu)
Gives access to the Edit sub-menu.

Copy Links
If a part of the current page was selected in the browser panel, OutWit Hub will copy the links found in the selection, otherwise all the links of the page will be copied to the clipboard.
Copy
Copies the selection to the clipboard.
Paste
Pastes the clipboard content.
Paste Text
Pastes the clipboard content as plain text.
Paste Links
Pastes the URLs found in the clipboard content. Note that if you use this function on the browser, the list of links will replace the currently displayed page.
Send Copied Links to Queries
If there are links (URLs) in your clipboard (copied from OutWit Hub or any other application), this function sends them to a new or existing directory of the queries view.
Send Highlighted Links to Queries
If hovering over a link, OutWit Hub highlights a series of links it belongs to, this function sends the whole series to a new or existing directory of the queries view.
Send Page Links to Queries
Sends all URLs found in the current page to a new or existing directory of the queries view.
Edit Page Tools
Gives access to series of functions to alter or reformat the current page (the resulting page can be used as the source for all extractions or saved to your hard disk):

Extract All Page Links: Replaces the currently displayed page with a generated HTML page containing all the links found in this page.
Outline Page:Replaces the currently displayed page with a generated outline of the original page, only keeping the section and paragraph titles and subtitles.
Indent Page:Replaces the currently displayed page with a generated outline of the original page, including the text content, indented within the outline.
Decode MIME inclusions: Replaces MIME inclusions (if any) within the currently displayed page, as legible (and extractable) decoded text.

Select Similar
Selects links that belong to the same series or that are at the same hierarchical level as the selected link.
Select All
Selects all the page content.
Find
Looks for a string or regular expression in the page.

Options (right-click menu)
Allows you to disable images, plugins and/or javascript in order to enhance the performance during large automatic explorations. (These settings are persistent between sessions. Do not forget to switch them back to revert to normal browsing.)

Fast Search for contacts & Auto-Explore Pages
Give access to a series of automatic navigation functions. (see option details in the Navigation Menu.)
Apply Scraper
Gives access to the Scraper Application sub-menu. Applies the most pertinent scraper to the current page or to the selected / highlighted links. When a scraper is applied to links with this function, Fast Scrape mode will be used. If you do not want to use the Fast Scrape mode, use the Auto-Explore Pages function instead after having set the scraped view to recieve the data.
Apply Macro
Applies a generic macro with the current page as start page.
First Names
Gives access to the First Names sub-menu.

The First Name Dictionary is used to enhance the recognition of contact in Web pages. A default dictionary of a few thousand first names from around the world is already included in the program. You can add your own using these options. Note that the dictionary is located in your automator database, which can be saved and loaded from the File menu.

Remember First Name
Choosing this option when a first name is selected in the page will add it to your dictionary.
Forget First Name
Choosing this option when a first name is selected in the page will remove it from your dictionary.

OutWit Hub’s Views — The Side Panel

The side panel on the left of your screen contains all available views of the application.

The different views allow you to dissect the page into its various data elements.

Some display extracted data (links, contacts, text…) others give you access to tools for performing specific extraction tasks (automators).
Items may be collapsed: to display the views they contain, click on the triangle pointed to the right (►). Some of the sections containing views (like Data) are not clickable, as they do not correspond to a view. You need to open the section, if it is collapsed, and select one of the views inside it.
Note: Some views are present in both light and pro versions, with limited or disabled features in the light version, others are only present in the pro version.

page
Displays the current web page or document analyzed in the other widgets.

links Lists URLs found in the current page.

documents
(Pro) Lists documents found in the current page.

images
Lists images found in the current page.

emails
Lists contact info found in the current page.

data
Contains the data extraction tools.

tables Extracts HTML table contents.

lists Extracts HTML list contents.

guess
Tries to guess the structure of the data and extract it.

scraped
Applies the most pertinent active scraper to the page.

text Displays the current page as simple text.

words (Pro) Displays the vocabulary used in the page, with the frequency of each word.

news Displays RSS news found in the current page or domain.

source
Displays the HTML source of the page.

autotmators
Contains the automation tools.

queries (Pro) Allows you to create directories of URLs.

scrapers
Allows you to create and edit data scrapers.

macros
(Pro) Allows you to create and edit macros.

jobs (Pro) Allows you to program the execution of a task.

history
Displays the navigation history, grouped by domain name.

The Page View
This is the browser: it displays the current web page or file that is being analyzed in the other views.

When in the page view, you can navigate through Web pages as you would in any Web browser. You can also open a local file or even drag a folder from your hard disk to the url bar to see (and navigate through) its content.

Exploration Button and Right-Click Menu
If active, the Explore Button representing a magnifying lens with an at sign (@) is located at the top left corner of the browser. It is the Exploration Button. When moving your cursor across the page, you will see it placing itself above the series of links that the program recognizes and highlights. Automatic navigation functions are available by clicking on this button or by right-clicking directly on the page. see details in the page right-click menu.

TIP – Optimizing Performances: Right-click on the page to disable or reactivate images and plugins in OutWit’s browser. Deactivating them can make the loading of each page faster for long explorations and extraction workflows, when you do not need images or flash animations.
Click on the black triangle next to the page view name in the side panel to hide or show the extractors (links, images, contacts, data, tables, lists, guess, scraper, text, news and source views).
Note: you can select, in the general preferences, whether you want the application to remain in the current view or to come to this view when a URL is typed in the address bar.
Dragging text to the page
You can Drag a selection from another application to the browser and it will appear as simple text. This is one of the many ways to import URLs from another source: just drag a selection of urls from a text editor and you will find them in the “links” view. (You can also put them in a .txt file and open the file with the Hub.)

The Links View
Shows the list of URLs found in the current page.

The links view displays a table of the URLs found in the current Web page or file, that do not link to media or documents. The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the link was found*
Page URL: The URL of the link itself
Frequency: The number of occurrences of this link in the page
Text: The description text of the link
Filename: The name of the file the URL links to
Type: The type of document
Mime Type: The Mime Type of the file on the server
First Seen: The first time this link was seen*
Last Seen: The last time this link was seen*
Main Doc URL: The URL of the page’s main document. (Useful when a page contains frames or iFrames, to have the parent URL.)*

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
If you wish to download some of the files, simply select them in the table and use the ‘Download Selected Files’ or ‘Download Selected Files in…’ option of the right-click menu.

Bottom Panel Options

When the local checkbox is unchecked, OutWit hides links to the same domain as the current page.

When the cache checkbox is unchecked, OutWit hides links considered to be cached data.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Documents View (pro version)

Shows the list of all document URLs found in the current page.

The documents view displays a table of all document files (.doc, .pdf, .xls, .rtf, .ppt…) found in the file currently displayed in the page view. It includes the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the document was found*
Document URL: The URL of the document
Filename: The name of the file the URL links to
Last Modified: The modification date, if found on the server
Size: The file size
Type: The type of document
Mime Type: The Mime Type of the file on the server

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
If you wish to download some of the documents, simply select them in the table and use the ‘Download Selected Files’ or ‘Download Selected Files in…’ option of the right-click menu.

Bottom Panel Options

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Images View

Shows the list of all images found in the current page.
The images view displays a table of all image files found in the page currently displayed in the page view or in linked pages.

TIP – Optimizing Performances: Right-click on the images view name in the side panel to disable or reactivate automatic image extraction when a new page is loaded. Deactivating this can make the processing of each page faster for long explorations and extraction workflows, when you do not need images. (Also see the page view.)

The table contains the following information:

Source URL: The URL of the page where the image was found*
Image: The thumbnail of the image
Filename: The name of the image file
Size: The size of the image in pixels (width x height)

Media URL: The URL of the image file
Found in: The DOM element where the image was found in the source code (image tag, script, background…)
Description: The size of the image file
Type: The type of image
Mime Type: The Mime Type of the image file on the server
Thumb URL: The URL of the thumbnail (if a high resolution image was found)*

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
If you wish to download some of the images, simply select them in the table and use the ‘Download Selected Files’ or ‘Download Selected Files in…’ option of the right-click menu.

Bottom Panel Options

If the adjacent checkbox is checked, OutWit will look for sequences of pictures, by trying to find numerical sequences of in URLs around the found images. For instance: if an image named obama_022.jpg is found, the program will try to find obama_021.jpg and obama_023.jpg on the same server.

When the scripts, styles, backgrounds, checkbox are checked, OutWit looks for images in the corresponding tags of the page source code. Styles is unchecked by default as style images are often small layout element of lesser interest.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Contacts View
Shows the list of email addresses and contact elements found in the current page.

The emails view displays a table of the email addresses found in the current Web page / file or in the automatically explored pages. The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the link was found*
Source Domain: The URL of the page where the link was found
Page Title: The title of the page where the link was found
Email: The email address itself
Frequency: The number of occurrences of this email address in the or in the automatically explored pages
Contact Info Columns: First Name, Last Name, Address, Phone, Fax, Mobile, Toll Free, Title… are added when the ‘Guess Contact Info’ checkbox is checked in the bottom panel.

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.

Bottom Panel Options

Strict: When unchecked, OutWit Hub will look for addresses with a looser format and will accept strings like “name at site dot com” or “pseudo[at]domain.org” as valid email addresses.
Guess Contact Info: When checked, OutWit will try to find additional contact information linked to the email address.
Filter Level: Beside the Guess Contact Info checkbox is a popup menu allowing you to select the level of filtering or strictness for the info recognition. When the filter is maximum, the contact data found is only added if it is very likely to be linked to the email address. When the filter is minimum, all found data is added to the result datasheet, at the risk of grabbing some noise or making occasional mistakes associating the info to the email address.

Note: The contact info extraction is based on recognition by the program of unstructured data in each page.
Recognizing that a series of digits is a phone number rather than a social security number or a date is easy if you know in advance that you are dealing with data from a given country. If you don’t, however, the problem is very far from trivial.
A brief description of the way OutWit searches for names, addresses, phone and fax numbers etc. will help understand how reliable it can be, depending on the source: The program first looks if additional contact information can be found in the immediate context of each email address. After this, it takes all non-assigned phone numbers and physical addresses and sees if it is likely to belong to one previously found contacts. Otherwise, these are listed independently, lower in the result datasheet.
For the data to be extracted, it must first be present in the page, of course. Then, if it is, no technology allows for perfect semantic recognition. An address or a phone number can take so many different forms, depending on the country, on the way it is presented or on how words are abbreviated, that we can never expect to reach a 100% success rate.
Email address recognition is nearly perfect in OutWit; phone numbers are recognized rather well in general; physical addresses are more of a challenge: they are better recognized for US, Canada, Australia and European countries than for the rest of the world. The program recognizes names in many cases. As for other fields like the title, for instance, automatic recognition in unstructured data is too complex at this point and results would not be reliable enough for us to include them unless they are clearly labled. We are constantly improving our algorithms so you should make sure to keep your application up-to-date.
If your need for precision in the extracted data is critical in your workflow and if you cannot afford failed automatic recognition, it may not be a good idea to rely on automatic features like this one. In these cases, you may want to create a scraper for a specific site.

Max Processing Time: Allows you to set the maximum time in seconds that the program should spend analyzing each page when searching for contacts.

Empty/Auto-Empty: This button offers two positions, accessible via the popup arrow on its right side: Empty on Demand, which allows you to only clear the contents of the results datasheet when you decide, or Auto-Empty, which tells the program to clear the results each time a new page is loaded.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Data Section

Gives access to the different data extraction views of the application.

The current version of OutWit Hub Pro, offers four data views: Tables, Lists, Guess and Scraped.
Note: You can hide or show those by clicking on the black triangle next to the section name in the side panel.
The Tables and Lists views will help you extract data with an explicit structure in the HTML source code of the page. The other data extractors will be useful when these two are not enough to get the Job done. Guess tries to automatically recognize the data structure and Scraped allows you to manually define how the extraction should be done.

The Tables View
Displays the HTML tables found in the current page.

The tables view displays in the datasheet, the HTML tables of three rows or more, found in the current page. The minimum number of rows required for tables to be extracted can be altered in the preferences (Tools>Preferences>Advanced Tab).
In case of merged cells in the HTML code, using row or column spans, the cells are kept seperate in the view datasheet and the values will be repeated in the corresponding cells. By default, tables of less than tree rows are ignored. This can be changed in the preferences.
If an hypertext link is found in the data of a table row, it will be placed by the program in the URL column at the left of the datasheet. The objective is to gather the useful links in this one column both for the Lists and Tables views. This column will usually be the simplest way for you to grab collections of links to explore further. If several links are found in each row, outwit will try to decide which column contains the most significant links. By default, the first column containing URLs will be chosen, unless there is a column with less missing links, less duplicate links, etc. This is an arbitrary algorithm, but it usually works pretty well.
The first two columns of the datasheet contain the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the list was found*
URL: The most significant URL found in the table row (if any). Often the first link found.

The following columns vary with the data extracted.

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Lists View
Displays the HTML lists found in the current page.

The lists view displays in the datasheet, the HTML lists (

    ,

      and

    1. tags) found in the current page, keeping the hierarchical level of the items.

If a link is found for a list item, it will be stored in the URL column of the datasheet. If several links are found, only the last one will be kept. The first three columns of the datasheet contain the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the list was found*
URL: The last URL found in the list item (if any)

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options
Add Titles: This option was added in v3.0 as lists are often difficult to identify or understand without the title preceeding them. When this option is checked, the program includes the content of and

… tags in the HTML page.

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Guess View
Displays the data extracted using automatic structure recognition algorithms.

The guess view tries to understand the structure of the data found in the current page, if any.

Note: The program analyzes the available html source code of the page. Labels and field/record separators are looked for, using many different strategies. The program eventually gives a rating to each possible structure found and decides of the best possible answer, if any. The Challenge of these intelligent algorithms is to understand even non-tabulated data and we will make sure they become more and more efficient, but the very nature of the problem makes it impossible to ever get close to a 100% success rate.
If your need for the scraped data is critical in your workflow and if you cannot afford failed automatic recognition, it may not be a good idea to rely on automatic features like this one. In these cases, you should probably define a scraper and use the right click menu option: ‘Auto-Explore’ > ‘Fast Scrape’. This way, if you have thoroughly tested the scraper you have designed, the process will be reliable and reproducible, at least as long as the online source is not altered and remains accessible.

Ordinal: An index composed of three groups of digits separated by dots (hidden by default)

Source URL: As in the datasheets of the other views, the Source URL is placed in the second column and is hidden by default.
Note: Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options
List: When checked, OutWit will try to find a list of records and present them as a table (one record per row, one field per column). When unchecked, the program will try to recognize “specsheet” type of data in the page (one row per field, a Label and a Value for each field). If you uncheck this option, OutWit should do better with simple text data like this:

last name: Knapp

first name: John

age: 34

phone: (674) 555-5621

You can try this option by going to the workshop page (ctrl/cmd-shift-k) and pasting text from a word processor, emails… Guess will usually do better if you paste simple text, using a right-click and chosing Edit>Paste Text.
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Scraped Data View
Displays the results of the application of a scraper to the current page (or to a series of URLs).

The scraped data view displays in a table the data extracted using the active scraper with the highest rating. Each field defined in the scraper corresponds to a column of the datasheet.
If several active scrapers can be applied to the current URL, the possible candidates will be rated according to their version number, their freshness and the specificity of the Apply to URL (mySite.com having a lower priority than www.mySite.com/myPage). If you wish to apply a scraper of a lesser rating, you can deactivate all scrapers with a higher priority in the scraper manager.
Ordinal: An index composed of three groups of digits separated by dots

Source URL: As in the datasheets of the other views, the Source URL is placed in the first column and is hidden by default.
Note: Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.

Bottom Panel Options
The Keep Order checkbox allows you to ask OutWit to force the columns in the same order as the scraper. If this option is checked, all columns will appear in the resulting data, even when they are completely empty. (This can be useful if you wish to export to an Excel or HTML file, for instance as part of a job, to then use this data in a set process, with other applications.)
In case of application of a scraper to whole lists of URLs, like with the ‘Auto-Explore’ > ‘Fast Scrape’ of the datasheets’ right-click menu, the Empty checkbox will be ignored. In all other cases, if this option is checked, the datasheet will be emptied as soon as a new page is loaded.
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Text View
Shows the current page as simple text.

The Text view displays the textual content of the current page and ignores all other content: scripts, media, animations, layout, etc.

Note: You can hide or show the Text related views available in your version of OutWit Hub, by clicking on the black triangle next to the view name in the side panel.

All or parts of the text can be moved to the Catch (or saved to a file).

The Words View (pro version)
Displays the vocabulary used in the page, with the frequency of each word.

The Words view displays a table of significant words and groups of words found in the source code of the current Web page or file. The frequency column gives you, as a fraction, the number of occurrences divided by the total number of words. Note that if you notice a higher number of occurrences than what you can actually see in the Web page, it means that the other occurrences of the word or phrase are in the source code but hidden (like alternate text, invisible blocks, etc.).

Groups of words are recurring successions of two to four words. If OutWit recognizes the page language, “empty words” are ignored in this view. This means that in English, French, German, Spanish and several other occidental languages, very common pronouns, auxiliaries, articles, etc. will be ignored. This covers words like “the”, “is”, “which” etc.

The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the word was found (or the number of pages where it appeared, if ‘Empty’ is unchecked)
Word: The word
Frequency: The number of occurrences of this word in the page

Note: The Ordinal column is hidden by default in this view. Use the column picker at the top right corner to show/hide them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.

Bottom Panel Options

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The News View
Shows the list of RSS articles found in the current page or in the current domain.

The news view displays a table of the news articles from all RSS feed found in the current Web page, or in the same domain. The table contains the following information:

Ordinal: An index composed of three groups of digits separated by dots*
Source URL: The URL of the page where the feed was found*
Feed URL: The URL of the RSS feed*
Feed Title: The name of the feed*
Feed Link: The link of the HTML page corresponding to the feed*
Feed Description: The description of the feed*
Feed Language: The language of the RSS feed*
Title: The title of the article
Article URL: The link to the full article
Date: The date and time of release
Image: The URL to the attached image*
Category: The category name of the article*
Abstract: The abstract of the article

Note: Columns marked with an asterisk (*) are hidden by default. Use the column picker at the top right corner to show them. When a row is moved to the Catch, the Source URL is always included in the moved data, even if the column is hidden in the view datasheet. All other hidden columns are ignored in the transfer.
Bottom Panel Options

In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the filter controls and right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Source View
Shows the colorized source code of the current page.

When the current file displayed in the Page view is a Web page, the source view contains the colorized HTML source code of its main document.

In the pro version, a radio control allows you to select if you want to see the original source code that was loaded by the browser when opening the page or the code as it was dynamically altered by scripts after the page was loaded. The dynamic source code is presented on a pale yellow background and the original, on a white background. This will help you recognize the setting immediately.

The source code presented in the scrapers view is another instance of the same panel.

The colorization was conceived for data search rather than programming purposes and emphasis is given to the textual content that is actually displayed on the page: it is shown in black and pops out from the cryptic HTML syntax.

The colors used are the following:

Displayed text
HTML tags
Links
Comments
HTML Entities
Styles
Scripts
Images

The History View
Displays the list of seen URLs grouped by domain.

The history view doesn’t show the list of URLs that have been visited (this would be redundant with the browser navigation history), but of URLs that have been seen in the current session. This means that, as the history is grouped by domain, after surfing for 15 minutes (or hours) on Web pages related to a certain topic, say astronomy, you will find in this view a list of the most frequently cited domains in this topic.

The History datasheet contains the following information:

Domain: The domain
First Seen on: The time when this domain was first recorded in the current session
Last Seen on: The time when this domain was seen most recently in the current session
Frequency: The number of occurrences of this link in the page

Bottom Panel Options
In general, as for the other views, the content of this view can be filtered and sorted to extract specific data, using the right-click functions in the datasheet. You can also move it to the Catch (or export it to a file).

The Automators Section

Gives access to the different automators available in the application.

In the current version of OutWit Hub Pro, four kinds of automators can be defined: Scrapers, Macros, Jobs and Queries.

A Scraper is a description of the data structure in a page, defining the markers that can be found in the document source code, around the data you wish to extract.
A Macro is a snapshot of the complete configuration of the application’s various extractors, which can be replayed in a single click for performing a specific exploration and extraction task.
A Job is a preset time and periodicity at which an action should be performed.
A Set of Queries is a directory containing a list of URLs or Query Matrices on which an action (autobrowsing, macro, scraper, slideshow, etc) can be performed.

The Automator Managers

Allows you to manage the automators stored in your profile.

In each of the automator views (Scrapers, Macros, Jobs and Queries), the manager is the panel presenting the list of all automators of the considered type stored in your profile. The manager allows you to create, delete, import, export automators and gives access to the property and automator editors.
Each automator is identified by its Automator ID (AID) in your profile. It is preceded by an Active checkbox. When this box is unchecked the automator is deactivated and grayed out in the list. You will need to activate it before using it.
If a layout change button is present at the top right corner of the panel, you will be able to switch between horizontal and vertical layout of the window, placing the editor and manager at the bottom or on the right of the screen.

The Scrapers View
A Scraper is a template telling OutWit how to extract information from a page.
When ‘tables’, ‘lists’ or ‘guess’ do not manage to recognize automatically the structure of a page and extract its data, you still have the option to create a Scraper and tell OutWit how it should handle this specific URL (or all the pages of a given Web site, a sub-section thereof, etc).

What is a scraper?

A scraper is simply a list of the fields you want to recognize and extract. For each field, it specifies the name of the field (ex.: ‘Phone Number’), the strings located immediately before and after the data to extract in the source code of the page and the format of the data to extract for this field. The pro version also allows you to set replacement string to alter the extracted data and a delimiter, to split the extracted result into several fields.

Creating and Editing Scrapers

The scrapers view can contain either the scraper manager (to create, duplicate, delete previously made scrapers), or the scraper editor, to build and edit them.
In editing mode, the source code of the current page is displayed, for you to easily identify and copy the markers you need. You can select to which source code you want your scraper to be applied to: the original (white background) or the dynamic source code (pale yellow background), using the source type popup menu.
In the bottom part of the window is the editor itself, where you can create and modify your scrapers.
To switch from one mode to the other, use the Manage or Edit button.
Activating scrapers
As for all other extractors, the ‘scraped’ view is automatically active when any control of the bottom panel is set to a non-default value. (i.e. ‘empty’ is unchecked, ‘move to catch’ is checked…). On the scrapers themselves, an additional control must be set for the extraction to be done automatically:
In the scraper manager, the ‘active’ checkboxes determine whether a scraper should be used when the corresponding URL is loaded. When unchecked, the scraper is deactivated.

When asked to scrape a page, if there are more than one applicable active scrapers for that URL, OutWit will apply the one that seems most appropriate and recent, using several criteria (version number, modification time, specificity of the ‘URL contains’ string, etc.).

In the light version, only one scraper can be active at a given time and no more than ten scrapers can be present in the manager.

Scraping Data

If a control of the bottom panel is set to a non-default value in the scraped view, the program will try to find a matching scraper and apply it, as soon as a new page is loaded.

From both the manager and the editor, you can apply the scraper to the current page, using the ‘Execute’ button (or the right-click menu, in the manager). After performing an extraction with a scraper, you will find your results in the scraped data view.
From any datasheet, you can select URLs, right click on one of them and choose ‘Auto-Explore’ > ‘Fast Scrape’ in the popup menu.
You can of course include your scrapers in macros (browsing/digging through URLs or using the ‘fast scraping’ mode) for recurring extraction tasks.

The Scraper Editor (OutWit Scrapers Syntax Reference)

In the scrapers view, the bottom part of the window can either be the Scraper Manager or the Scraper Editor. In the scraper manager, you can see and organize your scrapers and, when double-clicking on one of them or creating a new one, the scraper editor opens and you can create or alter your scraper lines.
The Editor allows you to define the following information:

Apply if URL contains…: The URL to Scrape (or a part thereof). This is the condition to apply the scraper to a page. The string you enter in this field can be a whole URL, a part of URL, or a regular expression, starting and ending with a ‘/’. (In the last case, the string will be displayed in red if the syntax is invalid.) If you try to scrape a page with the ‘Execute’ button when this field doesn’t match the URL, an error message will be displayed. If two or more scrapers match the URL of the page to be scraped, the priority will be given to the most recent, with the most significant condition (longest match).

Note: If you keep getting the message: “This scraper is not destined to the current URL”, this is the field that must be changed. A frequent mistake is to put a whole URL in this field, when the scraper is destined to several pages. Try to enter only the part of the URL which is common to all the pages you wish to scrape, but specific enough to not match unwanted pages.

You may also get this error message if you are trying to apply a disabled scraper to a valid URL. In this case, just check the OK checkbox in the scraper manager.

Source Type (pro version): You can set your scraper to be applied either to the original source code that was loaded by the browser when opening the page or to the code as it was dynamically altered by scripts after the page was loaded.

In the scraper definition itself, each line corresponds to a field of data to be extracted.

To edit or enter a value in a cell, double-click on the cell. To simplify the fabrication of scrapers and avoid typos, the best way is often to select the string you want in the source code and drag it to the cell. You can then edit it as you like.

Description (name of the field): can either contain a simple label like “Phone” or “First Name” or a directive (see below),
A) Marker Before – optional: a string or a Regular Expression marking the beginning of the data to extract,
B) Marker After – optional: a string or a Regular Expression marking the end of the data to extract,
C) Format – optional: a Regular Expression describing the format of the data to extract,
Replace – (pro version) optional: replacement pattern (or value to which this field must be set).
Separator (pro version) optional: delimiter, to split the extracted result into several fields.
List of Labels (pro version) optional: the list of labels to be used, if the result is split into several fields with a separator.

Important Notes:

1) In a scraper, a line doesn’t have to include Marker Before (A), Marker After (B) and Format (C). One or two of these fields can be empty. The authorized combinations are: ABC, AC, BC, AB, A, C.

2) When creating a scraper you can right-click on a marker or format field to find and highlight the occurrences of a string or a pattern in the source code. If you right-click on the description field it will allow you to find the whole scraper line in the source. This is very useful for troubleshooting.

3) The first line of the scraper will be considered by OutWit Hub as the field that starts a new record. This means that each time this scraper line matches data in the page, a new record will be created. Usually, the best way is to follow the order of appearance of the fields in the source document.

In the Format pattern, use the regular expression syntax and do not forget to escape reserved characters.Note: If you right-click on the text you have entered in a cell, an option will allow you to escape a literal string easily. In the Format field, the content will always be understood as a regular expression, even if not surrounded by / /.

In the Replace string, use \0 to insert the whole string extracted by this line of the scraper, or \1, \2, etc. to include the data captured by parentheses –if any– in the Format regular expression.

For instance, say you extract the string “0987654321″ with a given scraper line. Adding a replacement pattern can help you rebuild a whole URL from the extracted data:

If you enter:

http://www.mySite.com/play?id=\0&autostart=true

as replacement string, the scraper line will return

http://www.mySite.com/play?id=0987654321&autostart=true

In the Separator, use either a literal string like , or ; or a regular expression like / [,;_\-\/] /.
For technical reasons, all regular expressions used in scrapers are interpreted as case insensitive patterns by default. [A-Z], [A-Za-z] and [a-z] have the same result. This can be changed using the #caseSensitive# directive. This means that ‘Marker before’, ‘Marker after’, ‘Format’, which are always converted to regular expressions by the program, are case insensitive by default. Conversely, the ‘Separator ‘which is used as is by the program, is therefore case sensitive by default if it is a literal string, and case insensitive if it was entered as a regular expression.

To learn about Regular Expressions, please visit the RegExp Quick Start Guide

When splitting the result with a separator, use the List of Labels to assign a field name to each part of the data. Separate the labels with a comma. If there are less labels than split elements or if the Labels field is empty, default labels will be assigned using the description and an index.

Example: the string you want to extract doesn’t have remarkable markers and you do not know how to separate different elements of the data. Say the source code looks like this:

  • Dimensions:35x40x70

If you want the three dimensions in three separate columns, you can reconstitute the structure by entering the following:

Marker Before:

  • Dimensions:Marker After:

Separator:

x

Labels:

Height,Width,Depth

Regular expressions can be used to keep the scraper short if you are confortable with them. In many cases, however, it is also possible to do without. For instance, if you need a OR, just create two scraper lines with the same field name (Description).

Example: If you want your scraper to match both ‘sedan-4 doors’ and ‘coupe-2 doors’

The simple way is do it in two separate lines:

Description:

car

Format:

sedan-4 doors

Description:

car

Format:

coupe-2 doors

Or you can use a regular expression:

Description:

car

Format:

/(sedan\-4|coupe\-2) doors/

Directives (pro version)

Directives alter the normal behavior of the scraper. They can be located anywhere in the scraper and will be interpreted before all other lines by the program. Directives are identified by # characters in the description field:

Pre-Processing:
#abortIf# and #abortIfNot# Aborts the scraping and interrupts the current automatic exploration if the scraper line matches (or doesn’t match) within the page.
#autoCorrect# is destined to fix common scraper problems. For now it only corrects wrapping offsets happening when the wrong field is used as the record delimiter. (This feature is temporary.)
#caseSensitive# makes the whole scraper case sensitive. Note that as every regular expressions and literals of a scraper are combined into a single regular expression at application time, it is therefore not possible to define case sensitivity line by line or field by field. The whole scraper must be conceived with this in mind.
#checkIf# and #checkIfNot# if the scraper line matches at least one string in the page, or does not match anything, or in any case, without condition (#check#) the content of the ‘replace’ field will alter the OK column of your scraper. A string of 0s and 1s in the replace field will set the OK checkboxes of the scraper in the same order. Note that the right-click menu on the replace field of a #check# directive line will allow you to copy the values from the OK column to the cell or copy the cell string to the OK column.

Example:

You want to turn off line 5 of your scraper if the page doesn’t contain “breaking news”:

Description:

#checkIfNot#

Format:

breaking news

Replace:

11110111

#cleanHTML# normalizes the HTML tags before the scrape, placing all attributes in alphabetical order. This can prove useful in some occasions when a page was typed by a person (without rigor), instead of generated automatically.
#concatSeparator#separator# allows you to set the character or string to be used as a delimiter in contactenation functions like #CONCAT#, #DISTINCT#, etc.
#ignoreErrors# when this directive is used, cells where a function returned an error will be empty instead of containing an ##Error message.
#insertIfNot#myFieldName# if the scraper line does not match anything in the page, the content of the ‘replace’ field will be added once to each row scraped for this page. It is the only way to insert information to your extracted data when the page does not contain something.
#insertIf#myFieldName# the data extracted by this scraper line will be added once to each record scraped in this page, if the scraper line matches one or more strings in the page. It is mostly here as the corollary of the previous directive, but it is a good way to get rid of duplicate columns in certain cases.
#keepOrder# has the same effect as checking the ‘keep order’ checkbox in the scraped view or in a macro, i.e. ensuring that the columns will appear in the result datasheet in the same order as the scraper lines. Setting it directly in the scraper allows you to make sure to always have this behavior with this scraper.
#outline# alters the source code before scraping, keeping only the document/page outline.
#indentedText# alters the source code before scraping, reorganizing the document/page layout into an outline with indented text.
#pause# or #pauseAfter# instructs the scraper to wait, after the page is processed, for the number of seconds set in the Replace field.
#processPatterns# instructs the scraper to check if URLs passed to the #addToQueue# directives are generation patterns. If they are, the patterns will be interpreted and all generated strings will be added to the queue.

#replace# pre-processing replacement: the string (or regular expression) entered in the ‘format’ field will be replaced by the content of the ‘Replace’ field throughout the whole source code of the page, before the scraper is applied.

Example: The page you wish to scrape contains both “USD” and US$. You wish to normalize it before scraping:

Description:

#replace#

Format:

US$

Replace:

USD

#scrapeIf# data will only be extracted from the page if this scraper line matches something in the page source code.
#scrapeIfNot# data will only be extracted from the page if this scraper line doesn’t match anything.

Example: You want to scrape only pages that contain “breaking news”:

Description:

#scrapeIf#

Format:

breaking news

#scrollToEnd# instructs the scraper to scroll down to the end of the page and wait for the number of seconds set in the replace field (usually for AJAX pages, in order to leave time for the page to be refreshed).

Processing:
#addToQueue# stores the data scraped by the line in a global variable. The queue can then be accessed with the #nextToVisit()# function. (See below for more info.)

#exclude#myFieldName# If this directive is used, the content of the ‘Format’ field of the scraper line will not be accepted as a value for myFieldName. If the line matches with the excluded value, the match will be ignored.
#newRecord# each time the pattern of this scraper line matches a string in the page source, a new record (new row) is created in the result datasheet. The pattern to match can be entered either in the ‘marker before’ field or in the ‘format’ field.
#repeat#myFieldName# the matching or replacement value will be added in a column named myFieldName to all rows following the match of this scraper line.

Example: Say you have a page where the data to scrape is divided by continent between the following tags:

Continent: XXXXX

.

You can set the scraper to add the continent in a column for every row by adding:

Description:

#repeat#Continent

Marker Before:

Continent:

Marker After:

The repeat directive can be used to set a fixed value in a column by only entering a string in the Replace field:

Example: For inputing data directly in your database without any touchup in the process you need to add the field “location” with a set value:

Description:

#repeat#Location

Replace:

New Delhi

Note: if a variable is entered in the Replace field, all its values will be concatenated in the repeated output.
#start# switches scraping on. Data will start being extracted in the part of the source code following the match of this scraper line. (Directives are not limited by #start# and #stop#. For instance, if the #scrapeIf# directive matches outside of the start/stop zones, it will still be executed.)

Example: You only want to start scraping after a given title, say

Synopsis:

.

You simply need to type the string in the Format field of your scraper line:

Description:

#start#

Format:

Synopsis:

#stop# switches scraping off. Data extraction will stop after the match of this scraper line in the source code. (But the code analysis continues and scraping will start again if a #start# line matches.) Note that if the #stop# line matches before a #start# line (or if there is no #start# line), a #start# directive is implied at the beginning. In other words, in order to be able to stop, the scraping needs to start. Directives are not limited by #start# and #stop#. For instance, if the #scrapeIf# directive matches outside of the start/stop zones, it will still be executed.
#variable#myVariableName# Declares and sets the value of the variable (#myVariableName#). The occurrences of the variable are then replaced, at application time, by the scraped value in all other lines of the scraper. Variables can only be used within the scope of one scraper execution. They cannot be used to transfer information between two scrapers.

Example: Setting and using the variable ‘trend’.

line 1:

Description:

#variable#trend#

Marker Before:

Dow Jones:

Marker After:

 

Format:

/[-+\d,.]+/

line 2:

Description:

#showAlert#

Replace:

#if(#trend#<0,Bear,Bull)#

Anchor Functions: (The need for these functions is relatively rare, it will help you solve difficult cases when the data is presented in columns in the HTML page, using blocks with left or right ‘float’ tags.) #setAnchorRow# stores the row number where this scraper line matches, so that data that will be found later in the page source code can be added to the result table as additional columns, starting at this row number. Thus, when the directive #useAnchorRow# is encountered –and if an anchor row has been previously set– the following fields of data are added, starting at the anchor row until the #useCurrentRow# directive reverts to the normal behavior, adding a new row at the bottom of the result table each time a record separator is found.

Post-Processing:
#nextPage# allows you to tell OutWit Hub how to find the link to the next page to use in an automatic browse process. Use this when the Hub doesn’t find the next page link automatically, or when you wish to manually set a specific course for the exploration. NOTE: As any feature in scrapers, the next page directive is only applied when the scraped view is active (which means that the view’s bottom panel has non-default settings and the view name is in bold in the side panel).

Example: A typical next page scraper line.

Description:

#nextPage#

Marker Before:

Next page

Format:

/[^"]+/

Replace:

#BASEURL#\0

#cleanData# and #originalData# override the ‘Clean Text’ checkbox in the scraped view. When original data is set, the data is left as is (including HTML tags and entities), when clean data is used, HTML tags are removed from the scraped data.
#nextPage#x# You can add a positive integer rating in the next page directive: if several nextPage directives are used, the first matching line of the highest rating will be chosen. Use #nextPage#0#, the lowest, for the default value. If #nextPage# is used without a rating parameter, it will be considered as the highest rated.

Example: You want to go to the link of the text “Next Page”, if found, or go back to the previous page otherwise:

line 1:

Description:

#nextPage#0#

Replace:

#BACK#

line 2:

Description:

#nextPage#3#

Marker Before:

Next page

Replace:

#BASEURL#\0

#normalizeToK#myFieldName# and #normalizeToUnits#myFieldName# normalizes numerical value in the field myFieldName: converts it to decimal units (m, m2, m3, g…) or k units (km, km2, kg…), removes thousand separators and uses the dot as a decimal separator.

Debug directives:

#showAlert# displays an alert with the data scraped by the directive line. If only the ‘Replace’ field is filled, the alert will be shown at the end of the scraping.
#showMatches# displays an alert with all the strings that match the scraper patterns.
#showNextPage# displays an alert with the value of the selected next page URL.
#showNextPageCandidates# displays an alert with the list of possible next page URLs found.
#showRecordDelimiter# displays an alert with the name of the field selected as the record delimiter for this scraper.
#showResults# displays an alert with the data grabbed by the scraper.
#showScraper# displays an alert with the content of the scraper as interpreted by the program.

#showScraperErrors# displays an alert if an error occurs. (Most of the time alerts are not welcome as they would block the execution of automatic tasks.)
#showServerErrors# creates a separate column in the result datasheet with error messages returned by the server.
#showSource# displays an alert with the source code to which the scraper is applied (after replacements made by the #replace# directive).
#showOriginalSource# displays an alert with the original source code that was sent to the scraper (before alterations).
#showVariables# displays an alert with the values of all variables.
#showVisited# displays an alert with the list of the URLs visited since the beginning of the browse process.
#simulate# instructs the program to process the scraper without actually applying it. The interpretation is performed and some directives will still work, allowing you to display information for debug. This can be helpful if the scraper application fails –in particular in case of freezes during the application of scrapers with too complex or faulty regular expressions– in order to seek the cause of the problem.

Time Variables (pro version)
The following variables can be used in the ‘Replace’ field to complement or replace the scraped content.
Use #YEAR#, #MONTH#, #DAY#, #HOURS#, #MINUTES#, #SECONDS#, #MILLISECONDS#, #DATE#, #TIME#, #DATETIME# in the ‘Replace’ field to insert the respective values in your replacement string.

Example:

You can add a collection time to the scraper using both a directive and a time variable:

Description:

#repeat#Collected On

Replace:

#DATETIME#

Navigation Variables (pro version)
The following variables can be used in the ‘Replace’ field to complement or replace the scraped content.
Use #URL# (current page URL), #BASEURL# (current page path), #DOMAIN# (current domain), #BACK# (previous page in history), #FORWARD# (next page in history) in the ‘Replace’ field to insert the respective values in your replacement string.

Example:

You just want the source domain in a column ‘Source’:

Description:

repeat#Source

Replace:

Collected on #DOMAIN#

Redirections:
#REQUESTED-URL# gives the URL that was queried or clicked on.
#REDIRECTED-URL# returns the URL the browser eventually landed on after a redirection, if any, and returns nothing if there was no redirection.
#TARGET-URL# returns the URL the browser eventually landed on after a redirection, if any, and returns the requested (current) URL if there was no redirection.

Host Info:

#HOSTNAME# returns the most probable name of the organization hosting the current Web page.
#HOSTCOUNTRY# (Enterprise version) returns the most probable country of the current Web page.

#ORDINAL# returns the ordinal number of the page being scraped in an automatic exploration. (Note that this is different from the Ordinal ID column in datasheets. The number returned by #ORDINAL# is the first group of digits that constitute the Ordinal ID.)
#COOKIE# returns the content of the cookie(s) that have been set in your browser by the current Website if any.

Replacement functions (pro version)
The following functions can be used in the ‘Replace’ field to alter the scraped content.
These are executed when the scraper line (markers and/or format) match a string in the source code.
NOTE: these functions are still subject to evolution. At this point they can only be used alone in the replace field. They can now be used in a variable declaration.

Put #AVERAGE#, #SUM#, #MAX#, #MIN#, #CONCAT#, #HAPAX#, #UNIQUE#, #STRICTLY-UNIQUE#, #DISTINCT#, #STRICTLY-DISTINCT#, #FIRST#, #LAST#, #SHORTEST# or #LONGEST# in the ‘Replace’ field to replace the scraped values by the corresponding total calculation. (Note that totals cannot serve as record separator. They will only work if not located on the first line of a scraper.)

#AVERAGE#: if scraped values are numerical, the result is replaced by the arithmetic mean of these values

#SUM#: if scraped values are numerical, the result is replaced by the sum of these values
#MIN#: if scraped values are numerical, the result is replaced by the minimum value, otherwise by the first in alphabetical order

#MAX#: if scraped values are numerical, the result is replaced by the maximum value, otherwise by the last in alphabetical order

#CONCAT#: all values are concatenated, using semicolons as separators
#COUNT#: the number of occurrences

#HAPAX#: if only one occurrence is found, it is returned, otherwise the field does not return anything
#UNIQUE#: if only one value is found (whatever the number of occurrences), the value is returned, otherwise the field does not return anything
#STRICTLY-UNIQUE#: (case sensitive) if only one value is found (whatever the number of occurrences), the value is returned, otherwise the field does not return anything
#DISTINCT#: all distinct values are concatenated, using semicolons as separators; duplicate values are ignored (even if in different cases)
#STRICTLY-DISTINCT#: (case sensitive) all distinct values are concatenated, using semicolons as separators; exact duplicates are ignored
#DISTINCT-COUNT#: creates two columns (fields). The first one with the COUNT, the second with the DISTINCT concatenation.
#STRICTLY-DISTINCT-COUNT#: creates two columns (fields). The first one with the COUNT, the second with the STRICTLY-DISTINCT concatenation.
#FIRST#: only the first occurrence is returned
#LAST#: only the last occurrence is returned
#SHORTEST#: only the shortest matching occurrence is returned
#LONGEST#: only the longest matching occurrence is returned

Operations: #(term1 operator term2)# Works with the following operators: + (addition of integers: 1+3=4; concatenation of strings: out+wit=outwit; incrementing characters: c+3=f), – (subtraction of integers: 5-2=3 or decrementing chars: e-3=b ), * (multiplication), / (division), ^ (power), <, >, =, ==, !=,… (comparison operators): a=A (case-insensitive comparison), a==a (case-sensitive comparison), a!=b (not equal, case insensitive), a!==b (not equal, case sensitive). The terms can be literals, variables or functions. When using equality operators on strings (=, !=, ==, !==), you can now use the wildcard % in the second term to replace any string. (ex. these three statements are true: headstart = Head% ; homeland == h%d ; lighthouse = %HOUSE).
Conditions: #if(condition,valueIfTrue,valueIfFalse)# or #if(condition;valueIfTrue;valueIfFalse)# for conditional replacements. The separator used between the parameters (comma or semicolon) must not be present in the parameters themselves.
Lookup lists: #lookUp(value,listOfValuesToFind,listOfReplacementValues)# or #lookUp(value;listOfValuesToFind;listOfReplacementValues)# for replacing lists of values. The parameters listOfValuesToFind and listOfReplacementValues must include the same number of items, separated by commas or semicolons. The elements of the first list will be respectively replaced by those of the second. The separator used between the parameters must not be present in the parameters themselves.
Replace function (not to be confused with the replace directive) #replace(originalString,stringToFind,replacementString)# or #replace(originalString;stringToFind;replacementString)# replace the first occurrence of stringToFind by replacementString in originalString.
URL alteration functions: #getParam(URL,parameterName)# returns the value of a parameter in the passed URL and #setParam(URL,parameterName,parameterValue)#, to assign a new value to a parameter. When used in conjunction with #URL# in the #nextPage# directive line, this function allows you to easily set the value of the next page URL in many cases.
Alert: #alert(Your Message)# Displays an alert with the message passed as a parameter (and blocking the scraping process).

Example:

This scraper line will generate the next URL to explore, incrementing the parameter ‘page’ in the current URL.

Description:

#nextPage#

Replace:

#setParam(#URL#,page,#(#getParam(#URL#,page)#+1)#)#

Automatic Exploration and Hierarchical Scraping (pro version)
It is now possible for a scraper to set the URL of the next page to explore in a browse process (see #nextPage# directive above). Together with this feature comes a replacement function which allows advanced users to develop powerful scraping agents:

#nextToVisit(#myURL#)#, in the ‘Replace’ field, instructs the Hub to give the variable #myURL# the next value which is not found in the list of visited URLs. If you set #variable#myURL# in a scraper line, and if this line matches say 10 strings within the source code of the page, this variable will contain an array of 10 values. The #nextToVisit# directive will give #myURL# the value of the first URL which hasn’t been explored in the current Browse process. This means that, used in conjunction with #nextPage# and #BACK# you can create complex scraping workflows. You can, in particular, create multi-level scraping processes.

#addToQueue# and #nextToVisit()#: This follows exactly the same principle, but without declaring a variable. It is simpler to use but it offers a little less control as it only allows you to have a single stack of URLs to explore. Contrary to variables, the queue can be accessed by any scraper during the process of an exploration. You can put URLs in the queue with one scraper and refer to it with another.

Example 1: Two-level scraping using #addToQueue# and #nextToVisit()#

Say you have a page named ‘Widget List’ with a list of URLs leading to the ‘Widget Detail’ pages where the interesting information is. You just need to create two scrapers:

Scraper #1:

Apply if URL contains:

widget-list

line 1:

Description:

#addToQueue#

Marker Before:

See Widget Description

Replace:

#BASEURL#\0

Line 2:

Description:

#nextPage#

Replace:

#nextToVisit()#

Scraper #2:

Apply if URL contains:

widget-detail

line 1:

Description:

#nextPage#

Replace:

#BACK#

line 2…:

… scrape the data here.

Example 2: Two-level scraping using a variable #nextToVisit(#extractedURLs#)#

Same scenario, but this time, using a variable (for instance because you wish to keep two different kinds of URLs in separate piles):

Scraper #1:

Apply if URL contains:

widget-list

line 1:

Description:

#variable#extractedURLs#

Marker Before:

See Widget Description

Replace:

#BASEURL#\0

Line 2:

Description:

#nextPage#

Replace:

#nextToVisit(#extractedURLs#)#

Scraper #2:

Apply if URL contains:

widget-detail

line 1:

Description:

#nextPage#

Replace:

#BACK#

line 2…:

… scrape the data here.

Note: This may look confusing, but it’s not all that bad, once you have gotten the principle.
The idea is that you often have a list L1 that links to another list L2 (n times), which in turn links to the pages P where you want to scrape your data.

Think of it from the end:

You have to make a page scraper (#2 in the example above) for the data in P with #nextPage# set to #BACK# (It’s the “leaf” at the end of the branch, so the program will backtrack once the page is scraped.)
You also have to make one or several list scrapers where you will extract the links from L1, L2… into a variable like #extractedURLs#.
In the list scraper, you also need to set #nextPage#1# (higher priority) to #nextToVisit(#extractedURLs#)# to explore all the pages one after the other,
and, finally -still in the list scraper- set #nextPage#0# (default value) to #BACK#, to backtrack to the higher level, once all #extractedURLs# of the level have been visited.

One of the tricky things is to make sure that each scraper will apply to the right kind of page using the “URL contains” field. This may require a regular expression.

Applying a Scraper to a Page (or Series of Pages)
If you simply want to apply the best matching scraper to the current URL (the page loaded in the page view), just go to the scraped view. If you want to apply it to a series of pages or to the content of a site, you can set the scraped view’s bottom panel as you want, (uncheck ‘Empty’ to keep the results in the scraped view OR check ‘Catch selection’ to move them to the catch) and use the Browse or Dig commands to explore the pages you want.
If you need to apply a scraper to a whole list of URLs, another way is to select the rows containing the links you want to scrape (in any view: usually ‘the Catch’, ‘links’, ‘lists’ or ‘guess’), then right-click (ctrl-click on Macintosh) on one of the URLs to scrape (they should all be in the same column) and, in the contextual menu, select ‘Auto-Explore’ > ‘Fast Scrape’. For each of the selected URLs the resulting data will be added in the datasheet of the Scraped view. (A throbber beside the view name shows that the process is running.)
Note that the two methods above are different: applying a scraper by going to the scraped view does the extraction from the source code of the page loaded in the Hub’s browser, whereas using the ‘Fast Scrape’ function on Selected URLs, the program runs an XML HTTP Request for each URL, but doesn’t really load the pages (ignoring images etc.). Most of the time, the result is the same, but the ‘Fast Scraping Mode’ is simply… faster. In some cases, however, the pace can be too high for the server. In other cases, the results can be different or the fast scraping mode can even completely fail: the reason is that in the normal mode, events can happen that dynamically alter a page (mostly due to the execution of javascript scripts). These dynamic changes will not occur in the fast scraping mode, as scripts are not executed. This means that dynamically added information, javascript redirections, page reloads… will simply not happen in fast scraping mode. If you notice this kind of behavior, the best way is to accept the slower method and browse through the URLs, doing the scraping page after page.

Temporization
You can set the exploration speed in the Time Settings tab of the Preferences panel (Tools menu). By default, the temporization between pages is set to 4 seconds. You can lower it as much as you want, but do make sure that you are respecting the sites’ terms of use and that you are not overusing the servers.

Use of Regular Expressions
Regular Expressions are a powerful syntax, used to search specific patterns in text content. They can be used in several places of OutWit Hub:

In the bottom panel of each widget (images, links, contacts…) the Select If Contains text box allows you to select items of the list above it that contain the typed string. By starting and ending the string with the character / you can use Regular Expressions in these text boxes.
In the Scraper Editor located in the ‘Scrapers” widget, Marker Before and Marker After can be either a literal string or a Regular Expression. Format is always interpreted as a Regular Expression.
Lastly, the ‘URL to Scrape’ attributed to a scraper can also be a regular expression. In this case, the scraper can be applied to any URL matching the pattern.

To use regular expressions, write your string between slashes: /myRegExp/. The pattern will be displayed in green when the syntax is correct, in red otherwise.
IMPORTANT NOTES:

The ‘Format’ field of the Scraper Editor is always interpreted as a regural expression, even if not marked with slashes.

For technical reasons, all regular expressions used in scrapers are interpreted as case insensitive patterns by default. [A-Z], [A-Za-z] and [a-z] have the same result. This can be changed using the #caseSensitive# directive.
Here is what you should know if you are using regular expressions:

Ultra Quick Start Guide
Quick Start Guide
More

Regular Expressions Ultra Quick Start
Regular expression patterns are strings to match in a text, surrounded with / (slashes) and including a series of reserved characters used as wildcards (i.e. representing ranges of characters or remarkable features).

The three most useful patterns:

use the pattern \s* to match a succession of zero or more space characters, tabs, returns, etc.
use the pattern [^<]+ to match a succession of one or more characters until the next <
use the pattern [a-z]+ to match a succession of one or more letters

The two mistakes you are most likely to make, once you have learned more about RegExps:

The character . (dot) doesn’t mean ‘any character’, but ‘any character, except return characters’ (returns, line feeds, form feeds etc.), so do not use .* to say ‘anything’. Instead, you should use [\s\S]*, for instance, which means any succession of non-space characters or space characters.
Among the characters that need to be escaped in a RegExp pattern is the very common / (slash). If you forget to escape these, the regular expression will not work. You need to escape it like this: \/ (backslash followed by slash).

Example:

The pattern /<span[^>]+>\s*Phone\s*:/ will match any tag followed by Phone, followed by the colon character (:)

whatever the number of spaces, tabs or returns between these elements.

To learn some more about Regular Expressions, you can go to our RegExp Quick Start Guide

Regular Expressions Quick Start
Marking a Regular Expression: /myRegExp/
Most of the time, simple strings will be enough as markers or selection criteria. Such a literal string must be typed and will be searched as is in the data. Therefore, if you want to use a regular expression instead, you must mark it, so that the program can identify it as such. This is done by adding a / before and after the reg exp pattern.

Escaping Special Characters
Characters that are used in the regular expressions syntax, like .$*+-^\(){}[]/ should be ‘escaped’ when used literally in a regular expression (i.e. when used as the character itself, not as part of the reg. exp. syntax). Escaping means placing a backslash character \ before that special character to have it be treated literally. To search for a backslash character, for instance, double it \\ so that its first occurrence will escape the second.

Most common “special” characters in regular expressions

Wildcard
. (dot): any character except a line break (or carriage return)

Character Classes (Ranges of Characters)
In a character class, a caret character ^ excludes all characters specified by a character class, if placed immediately after the opening bracket [^... ].
[abc] list: any of the character a, b, c
[^abc] exclusion list: any character except a, b, c
[a-z] range: any character from a to z
[^aeiou] any character which is not a vowel
[a-zA-Z0-9] any character from a-z, A-Z, or 0-9
[^0-9aeiou] any character that is neither a digit nor a vowel

Escaped matching characters

\r line break (carriage return)

\n Unix line break (line feed)

\t tab

\f page break (form feed)

\\ backslash

\s any space character (space, tab, return, line feed, form feed)

\S any non-space character (any character not matched by \s)

\w any word character (a-z, A-Z, 0-9, _, and certain 8-bit characters)

\W any non-word character (all characters not included by \w, incl. returns)

\d any digit (0-9)

\D any non-digit character (including carriage return)

\b any word boundary (position between a \w character and a \W character)

\B any position that is not a word boundary

Alternation
| (pipe): Separates two expressions and matches either

Position
^: (when not in a character class) beginning of string

$: end of string

Quantifiers
x*: zero or more x

x+: one or more x

x?: zero or one x

x{COUNT}: exactly COUNT x, where COUNT is an integer

x{MIN,}: at least MIN x, where MIN is an integer

x{MIN, MAX}: at least MIN x, but no more than MAX

Note:

+ and * are ‘greedy’: they match the longest string possible. If you do not want this “longest match” behavior, you can use non-greedy quantifiers, by adding a ?.

*?: zero or more (non-greedy)

+?: one or more (non-greedy)

??: zero or one (non-greedy)

For example, Instead of:

/

Scrapers – Chrome Extensions

Web Scraper

https://github.com/martinsbalodis/web-scraper-chrome-extension

Web Scraper is a chrome browser extension built for data extraction from web pages. Using this extension you can create a plan (sitemap) how a web site should be traversed and what should be extracted. Using these sitemaps the Web Scraper will navigate the site accordingly and extract all data. Scraped data later can be exported as CSV.

Features

  1. Scrape multiple pages
  2. Sitemaps and scraped data are stored in browsers local storage or in CouchDB
  3. Multiple data selection types
  4. Browse scraped data
  5. Export scraped data as CSV
  6. Import, Export sitemaps
  7. Depends only on Chrome browser

How to use it

Imagine an on-line store that is selling items you are interested in. These items are grouped by category and also there are only 10 items visible per page. The rest of the items are accessible via pagination.

To scrape this kind of a page you need to create a Sitemap which starts with the landing page. After that you can continue with selector tree creation. Start by creating Url selectors for navigation links and pagination links. Then create an Element selector for a list item. And after that create Text selectors the items descriptors. The resulting Sitemap should look like the one in the image below. When your Sitemap is done you can start scraping it.

Selector tree

Selectors

There are different type of selectors for different type of data. Use this table to find a suited one for you. If there is not a selector that fits your needs, then you can try to create one. The scraper is built in a way that it is very easy to implement new selectors.

Selector Returned records Returned data Can lead to new Jobs Can have child selectors
Text 1 or * text N N
Element 1 or * None N Y
Group 1 JSON N N
Link 1 or * text, url Y Y
Image 1 or * image src N N
HTML 1 or * html N N
Element Attribute 1 or * text N N

Text

Used for text selection. All HTML will be stripped and only text will be returned. You can additionaly apply regex to resulting data. Regex is applied before data export so you can change the regex after data is scraped. If a link element is selected then also its href wttribute will be returned, but the scraper will not follow the link.

Element

This selector will not return any data. Use this selector select multiple elements and add child selectors within this selector.

Group

Use Group selector to select multiple items. The resulting items data will be seerialized as JSON and stored within one record.

Link

Use this selector to select links. The scraper will follow links and select data from each child page.

Image

Use this selector to retrieve image src attribute. The image itself will not be stored as it cannot be exported as CSV.

HTML

This selector will return html and text within the selected element.

Element Attribute

This selector can extract an attribute of an html element. For example you might want to extract title attribute from this link: <a href="#" title="my title">link<a>.

Issues

Submit issues in issue tracker. Please attach an exported sitemap if possible.

License

LGPLv3

New Ways to Track Keyword Rank

From a post by AJ Kohn, Blind Five Year Old – 20130113

http://www.blindfiveyearold.com/new-ways-to-track-keyword-rank

Tracking keyword rank is as old as the SEO industry itself. But how you do (and use) it is changing. Are you keeping up?

This post covers how I create and use rank indexes and introduces a new and improved way to track rank in Google Analytics.

Rankaggedon

In December of 2012 both Raven and Ahrefs made the decision to shut down their rank tracking features because they violated Google’s Terms of Service. The reaction from the SEO industry was predictable.

The debate about why Google began to enforce the TOS (I think it has to do with the FTC investigation) and the moaning about how unfair it is doesn’t interest me. BothSEOmoz and Authority Labs still offer this service and the way many use rank needs to change anyway.

Every obstacle is an opportunity. Trite but true.

Is Rank Important?

To be honest, I don’t use rank that much in my work. This has to do with a combination of the clients I choose to work with and my philosophy that increasingproductive traffic is the true goal.

Yet, you’d have to be soft in the head not to understand that securing a higher rankdoes produce more traffic. Being on the first page matters. Getting in the top three results can produce significant traffic. Securing the first position is often a huge boon to a business. Duh!

But rank is the extrinsic measurement of your activities. It’s a Google grade. Rank isn’t the goal but the result.

Unfortunately, too many get obsessed with rank for a specific keyword and spend waytoo much time trying to move it just one position up by any means necessary. They want to figure out what the teacher is going to ask instead of just knowing the material cold.

Rank Indexes

So how do I use rank? I create rank indexes.

A rank index is the aggregate rank of a basket of keywords that represent a type of query class that have an impact on your bottom line. For an eCommerce client you might have a rank index for products and for categories. I often create a rank index for each modifier class I identify for a client.

Usually a rank index will contain between 100 and 200 keywords that represent that query class. The goal is to ensure that those keywords reflect the general movement of that class and that changes in rank overall will translate into productive traffic. There’s no sense in measuring something that doesn’t move your business.

If that rank index moves down (lower is better) then you know your efforts are making a difference.

Executives Love Indexes

A rank index is also a great way to report to C Level executives. These folks understand index funds from an investment perspective. They get this approach and you can steer them away from peppering you with ‘I did this search today and we’re number 4 and I want to be number 1′ emails.

It becomes not about any one term but the aggregate rank of that index. That’s a better conversation to have in my opinion. A rank index keeps the conversation on how to move the business forward instead of moving a specific keyword up.

Getting Rank Index Data

If you’re using SEOmoz you export the entire keyword ranking history to CSV.

SEOmoz Export Full Keyword History to CSV

After a bit of easy clean up you should have something that looks like this in Excel.

SEOmoz Keyword History Raw Data

At this point I simply copy and paste this data into my prior framework. I’ve already configured the data ranges in that framework to be inclusive (i.e. – 50,000 rows) so I know that I can just refresh my pivot table and everything else will automagically update.

If you’re using Authority Labs you’ll want to export a specific date and simply perform the export each week.

Authority Labs Keyword Ranking Export

There’s a bit more clean up for Authority Labs data but in no time you get a clean four column list.

Authority Labs Keyword Data

Unlike the SEOmoz data where you replace the entire data in your framework, you simply append this to the bottom of your data. Once again, you know the pivot table will update because the data range has been configured to be quite large.

Creating The Rank Index Pivot Table

You can review my blow by blow of how to create a pivot table (though I’m not using a new version of Excel so it all looks different anyway.) It’s actually a lot easier now than it was previously which is something of a miracle for Microsoft in my view.

Keyword Rank Index Pivot Table

You’ll use the keyword as your row label, date as the column label and the Average of rank as the values. It’s important to use a label so you can create different indexes for different query classes. Even if you only have one index, use a label so you can use it as a filter and get rid of the pesky blank column created by the empty cells in your data range.

You may notice that there are a lot of 100s and that is by design.

Keyword Rank Index Pivot Table Options

All those non-ranked terms need to be counted somehow right? I chose to use 100 because it was easy and because Authority Labs reports up to (and sometimes beyond) that number.

Turning Rank Data Into A Rank Index

Now that you have all the rank data it’s time to create the rank index and associated metrics.

Keyword Rank Index Calculated Data

Below the pivot table it’s easy to use a simple AVERAGE function as well as various COUNTIF functions to create these data points. Then you can create pretty dashboard reports.

Keyword Rank Index Reports

Average Rank is the one I usually focus on but the others are sometimes useful as well and certainly help clients better understand the situation. A small caveat about the Average Rank. Because you’re tracking non-ranking terms and assigning them a high rank (100) the average rank looks a bit goofy and the movement within that graph can sometimes be quite small. Because of this you may wind up using the Average of Ranking Terms as your presentation graph.

Average of Ranking Terms Graph

I don’t care much about any individual term as long as the index itself is going in the right direction.

Projecting Traffic

I can always look at the details if I want and I’ve also created a separate tab which includes the expected traffic based on the query volume and rank for each term.

Rank Index Traffic Projections

This simply requires you to capture the keyword volume (via Google Adwords), use a click distribution table of your choosing and then do a VLOOKUP.

IFERROR(([Google Adwords Keyword Volume])*(VLOOKUP([Weekly Rank],[SERP Click Distribution Table]),2,0)),0)

You’ll need to divide by 4 to get the weekly volume but at that point you can match that up to real traffic in Google Analytics by creating a regex based advanced segment using the keywords in that index.

Of course, you have to adjust for (not provided) and the iOS attribution issue so this is very far from perfect. And that’s what got me really thinking about whether rank and rank indexes could be relied on as a stable indicator.

What is Rank?

The rise in (not provided) and the discrepancies often seen between reported rank volume and the traffic that shows up point to the increase in personalization. SERPs are no longer as uniform as they once were and personalization is only going to increase over time.

So you might have a ‘neutral’ rank of 2 but your ‘real’ rank (including context and personalization) might be more like a 4 or 5.

That’s why Google Analytics rank tracking seems so attractive, because you can get real world ranking data based on user visits. But that method is limited and makes reporting a huge pain in the ass. The data is there but you can’t easily turn it into information … until now.

Improved Google Analytics Rank Tracking

I got to talking to Justin Cutroni (a really nice and smart guy) about the difficulties around tracking rank in Google Analytics. I showed him how I use rank indexes to better manage SEO efforts and over the course of a conversation (and a number of QA iterations) he figured out a way to deliver keyword rank the way I wanted in Google Analytics.

Keyword Rank Tracking In Google Analytics with Events

Using Events and the value attached to it, we’ve been able to create real keyword rank tracking in Google Analytics.

The Avg. Value is calculated by dividing the Event Value by Total Events. You could change this calculation once you do the export to be Event Value by Unique Events if you’re concerned about those users who might refresh the landing page and trigger another Event. I haven’t deployed this on a large site yet to know whether this is a real concern or not. Even if it is, you can always change it in the export.

Keyword Rank Tracking Data via Analytics Events

So you can just make Avg. Value a calculated field and then continue to tweak the exported data so that it’s in a pivot table friendly format. That means adding a date column, retaining the Event Action column but renaming it keyword, adding a Tag column, and retaining the Avg. Value column.

You essentially want it to mimic the four column exports from other providers. I suppose you could keep a bunch of this stuff in there and not use it in the pivot table too. I just like it to be clean.

Event Based Rank Tracking Code

Start tracking rank this way on any Google Analytics enabled site by dropping the following code into your header.

Google Analytics Rank Tracking Code

To make it easier, the code can be found and copied at jsFiddle. Get it now!

Just like the old method of tracking rank in Google Analytics, this method relies on finding the cd parameter (which is the actual rank of that clicked result) in the referring URL. This time we’re using Event Tracking to record rank and putting it in a field which treats it as a value.

The code has also been written in a way to ensure it does not impact your bounce rate. So there’s no downside to implementation. You will find the data under the Content > Events section of Google Analytics.

Where To Find Average Rank in Google Analytics

Just click on Content, Top Events and then RankTracker and you’ll find keyword ranking data ready for your review.

Google Analytics Rank Indexes

I’ve been working at applying my index approach using this new Event based Google Analytics rank tracking data. The first thing you’ll need to do is create an advanced segment for each index. You do this by creating a regex of the keywords in that index.

Rank Index Regex Advanced Segement

Sometimes you might not get a click on a term that is ranked 20th and certainly not those that are ranked 50th. That’s a constraint of this method but you can still populate an entire list of keywords in that index by doing a simple VLOOKUP.

IFERROR(VLOOKUP(A1,'Export Event Data'!$A$1:$E$5000,5,FALSE),100)

The idea is to find the keyword in your export data and report the rank for that keyword. If the keyword isn’t found, return a value of 100 (or any value you choose). From there it’s just about configuring the data so you can create the pivot table and downstream reports.

Caveats

This new way of tracking is different and has some limitations. So lets deal with those head on instead of creating a grumble-fest.

The coverage isn’t as high as I’d like because of (not provided) and the fact that the cd parameter is still only delivered in about half of the referrers from Google. I’m trying to find out why this is the case and hope that Google decides to deliver the cd parameter in all referrers.

Full coverage would certainly increase the adoption of rank tracking in Google Analytics and reduce those seeking third party scraped solutions, something Googlereally doesn’t like. It’s in their self-interest to increase the cd parameter coverage.

As an aside, you can get some insight into the rank of (not provided) terms and match those to landing pages, which could be pretty useful.

Rank of Not Provided Terms by Landing Page

The other limitation is that you only get the rank for those queries that received clicks. So if you’re building a rank index of terms you want to rank for but aren’t and track it over time it becomes slightly less useful. Though as I’ve shown above you can track the average of ranking terms and of the index as a whole at the same time.

One of the better techniques is to find terms that rank at 11 to 13 and push them up to the front page, usually with some simple on-page optimization. (Yes, seriously, it’sway more effective than you read about.) So this type of tracking might miss a few of these since few people get to page 2 of results. Then again, if you see a rank of 11 for a term with this tracking that’s an even higher signal that getting that content to the front page could be valuable.

Finally, the data configuration is, admittedly, a bit more difficult so you’re working a tad harder to get this data. But on the other hand you’re seeing ranking data from real users. This could get really interesting as you apply geographic based advanced segments. Larger organizations with multiple locations might be able to determine which geographies they rank well in versus those where they’re struggling.

And not Or

At this point I can’t say that I’d scrap traditional rank tracking techniques altogether, though I’m sure Google would like me to say as much. Instead, I think you should use the new Google Analytics Event Based Rank Tracking in conjunction with other ranking tools.

First off, it’s free. So there’s no reason not to start using it. Second, you get to see real world rank, which while limited in scope can be used to compare against neutral rank offerings. Lastly, if you’re trying to future proof your efforts you need to be prepared for the potential end to traditional ranking tools or such high variation in personalization to make them unreliable.

Did I mention this new rank tracking method is free?

I’m looking forward to putting this into practice and comparing one tracking method to the other. Then we’ll see the potential variance between personalized ranking versus anonymized ranking.

TL;DR

The closure of recent third-party rank tracking services is an opportunity to think about rank in a different way. Using a rank index can help keep you focused on moving the business forward instead of a specific keyword. To future proof your efforts you should implement improved Google Analytics rank tracking for free.

THE NEXT POST: 

THE PREVIOUS POST: 

11 TRACKBACKS/PINGBACKS

  1. Pingback: A New Method to Track Keyword Ranking using Google Analytics on January 14, 2013
  2. Pingback: SEO content marketing roundup, week ending January 16th on January 16, 2013
  3. Pingback: My Favorite Internet Marketing Readings Of The Week (Week 3 – 2013) | Inside my Rocket on January 20, 2013
  4. Pingback: Tracciare i Posizionamenti Su Google Delle Keyword? Con Analytics?!on January 28, 2013
  5. Pingback: Content Strategy & Optimization w/ Google Analytics RankTracker Dataon March 25, 2013
  6. Pingback: Is There Life Beyond Rankings? on March 28, 2013
  7. Pingback: Rankings vs. Tráfico: ¿Cual es la finalidad del SEO? | Sergio Redondoon April 18, 2013
  8. Pingback: Área Podcast 21: Link Building y SEO – Área on July 5, 2013
  9. Pingback: Marketing voices, part two: The changing metrics of SEO on July 30, 2013
  10. Pingback: Suivre le positionnement des mots clés dans Google Analytics on September 12, 2013
  11. Pingback: Look UPs: Solving The Keyword Secure Search (Not Provided) Problem As Best You Can | UP Search on September 25, 2013

COMMENTS:

  1. KANE JAMISON The first rank tracking company that can automate those four rank index graphs on a project level will certainly get some business. One click export of that dashboard to PDF… would be great.
  2. MIKE WILTON Interesting read AJ. Unfortunately even with this it seems you are still going to be heavily reliant on keyword ranking data from a toolset from moz or Authority Labs. In looking at your analytics ranking data, what’s to stop you from just using the average position data that Google provides when you link up analytics and webmaster tools? Just curious what additional data/benefits you are getting out of your analytics method in comparison to the position average provided by Google (aside from the fact Webmaster tools data is only the last 30 days).
  3. AJ KOHN  Thanks Kane and yes, anyone who could reduce the friction of this process would likely get a bunch of business.  I’m also exploring whether the Google Analytics Event based rank tracking would be available via the Analytics API. (I’m pretty sure it is.) So you could also go that route too. Or Google could finally decide to just integrate this into Google Analytics.
  4. SYLVAIN Thanks for the tips, it could be useful !
  5. AJ KOHN Mike, Good point about linking Webmaster Tools data to Analytics! There are a few reasons I like using this new Events based tracking. The first is I have greater granularity of keywords (outside of the top 1000 for a site) and I can apply advanced segments against the data so I can create easier rank indexes. I also like the fact that I have another set of data I can compare against. It’ll be interesting to see how these numbers compare to the Webmaster Tools Data.
  6. AJ KOHN Thanks Sylvain, I hope you get some use out of it.
  7. TED IVES Really good work there AJ, and great timing – I had been thinking about starting to use that GWT data but am going to check this out as well. Too bad the longer-term direction appears to be that (not provided) will consume all the data like a massive black hole anyway, eventually!
  8. AJ KOHN Thanks Ted. The GWT data is good but this is a) pretty easy to implement and b) gives you a lot more reporting and filtering capabilities. It’s true though that (not provided) will consume more of the keyword picture as users search logged in. Of course, that’s a double-whammy of sorts because that’s an indication that those SERPs are personalized, making those traditional rank tracking methods less and less reliable over time.
  9. MATT MORGAN Pretty amazing process and ingenious AJ. I’ve been trying to find a better way to report the success of our SEO campaigns for our clients, so this is timely for me as well. Two questions: 1) Is this practical for my local SEO clients or overkill? 2) I’d be interested in seeing what you send to your clients after you analyze everything.
  10. SAHIL Well done AJ, great article. Most of the clients are really obsessed by keyword rankings. They really never let companies focus on rank index which can help them move their business forward. Keyword rank tracking must be done internally to check how their search traffic is improving for particular high competitive keywords just to gauge where they stand. There are some amazing tools which help people find accurate rankings for the keywords – seomoz rank tracker, sescout, rank watch, etc. Business have to also mainly focus on usability, conversions and ORM which are very important to maintain the trust that an user has on them.
  11. STEFAN Great article, thanks for sharing! We use SEOlytics for international Rankings, their API is also already integrated in the SeoTools for Excel add-in. Makes it very easy to transfer data, and create a kind of self-updating dashboard in Excel as well.
  12. CURTIS WORTHINGTON Awesome as usual AJ. Very in depth piece. Never really thought to do an index rank before but looks very beneficial.
  13. ROSENBAUM Great post AJ. I don’t get way using the code is better then the old way. when I export the data to a Pivot Table i can get average anyway.
  14. RAMI Thank you for this article. However, I’m unable to see the code that I should add when I click on the link. Anyone else is having that problem? Thanks
  15. BILL BEAN Rank index approach = forehead slapping moment. That’s a phrase I’ve been looking for. Worth the price of admission. But wait, there’s more… I’m stealing this quote (or at least the concept) for client conversations: “A rank index keeps the conversation on how to move the business forward instead of moving a specific keyword up. ”
  16. AJ KOHN Matt, I think doing this for a local client is likely overkill. You probably don’t have enough keywords (upwards of 100) to create a good index. As for what I send to clients, it’s usually just a self-updated dashboard report. I provide commentary or insight when I see movement or may do a deep dive into the details if the index lines move materially.
  17. AJ KOHN Saheil, Thank you. There is an obsession with keyword rankings. That was reinforced when Raven and Ahrefs decided to close their rank tracking features. Clearly rank matters but for larger organizations the focus of keywords misses the point. The index keeps the discussion on moving the business forward through many of the things you mention rather than specific tactics to move a keyword up a few positions.
  18. AJ KOHN Stefan, Excellent! And you’re right there are a number of ways you can automate the transfer of data through an API, or even writing some Python scripts. You can often find a way to get to that self-updating dashboard if you spend some time (serious time) configuring the process. It’s well worth it though.
  19. AJ KOHN Thanks Curtis. I’m eager to see if this works for other people who are working with larger sites.
  20. AJ KOHN Thanks Rosenbaum. The new way allows for easier analysis and does most of the heavy lifting for you.
  21. AJ KOHN I can click and see it now Rami. Is anyone else having this problem?
  22. AJ KOHN Thanks Bill. It’s gratifying that you picked out that sentence. (Of course, I didhighlight it.) But I really do think that’s the real value of a rank index. The conversation changes and you can start working to build a business instead of moving keywords.
  23. SUNITA BIDDU This is one of the most valuable and “new” reads of the day. I was always inclined to measuring productive traffic and not ranks. This not only just pushed me with an easy but also gonna help my customers a better level of measuring the efforts. Thanks AJ :) Glad I found this link on Google+
  24. AJ KOHN Glad I could help Sunita and let me know how the techniques work for you.
  25. SLAVA Can’t you create a script that will be parsing top 100 google results for your keywords and return you your current position?
  26. STEVE ROSS Thank you! Will be applying these tips to rank index. Hope this works for us!
  27. AJ KOHN You can Slava and that’s how most rank tracking is performed. It just happens to violate Google’s Terms of Service.
  28. AJ KOHN Good luck Steve and let me know how it works for you.
  29. DANA TAN Thanks for a perfectly-timed post AJ! This ties in very neatly with a dilemma I just posted at SEOMoz and Dr. Pete Meyers made exactly the same suggestion, i.e. to track the performance of a group of keywords rather than fixating on vanity keywords. Just in case you’re interested the thread is here: http://www.seomoz.org/q/can-seo-increase-a-page-s-authority-or-can-authority-only-be-earned-via-rcsMy post [please for help???] was all prompted by one of “those” emails from the CEO of the company where I am an in-house SEO. You know…”I just Googled [this keyword] and Competitor X is beating us. I give you 6 months to beat them.” This post is exactly what I needed. Fortunately, he’s a pretty smart guy and isn’t adverse to listening to reason, so as long as I can make my case with good data, I think it’ll be better for the company (lol, and my job security!) in the long run. Thanks for the awesome work. I agree with Kane Jamison. The first company to automate this process would get my business in a heartbeat.
  30. ANGIE SCHOTTMULLER (@ASCHOTTMULLER) Stellar post, AJ! Lots of effort and thought obviously went into this process. Very well done!
  31. DYLAN Nice post. I might implement some of these techniques intohttp://www.serpscan.com (a rank tracker).
  32. ANDRE BUXEY WOW! this is a really well thought out post, interesting way to look “rank”, will pass the code onto my dev guy to have a look at!
  33. DAVIDE DI PROSSIMO Very very interesting AJ, You know what this obsession with ranking makes me feel a little tired and bored. As you said it is better focusing in productive traffic than just ranking. But again, yes it is benficial being on the first page after a search. I mean…I do not even remember if I ever in my life turned to page number 2 after a Google search. I gotta go now! I need to try implementing all this knowledge you passed through. Thanks again
  34. DANA TAN Hi AJ, I am a big fan of your blog and have heard you speak at MozCon and most recently as a participant in a Google Hangout #maximpact hosted by @MaxMinzer. I am working my way through this post. It’s brilliant, but I am stuck. I’ve got a beautiful pivot table [thanks for that - just that was awesome]. I made it to this phrase: “Below the pivot table it’s easy to use a simple AVERAGE function as well as various COUNTIF functions to create these data points. Then you can create pretty dashboard reports.” I think I understand how to get the Average Rank dashboard, but I am not understanding how to create the other dashboard reports, i.e. “Terms Outside of the Top 50.” Can you provide an illustration? Thanks!
  35. AJ KOHN Thanks for the kind words Dana and I’m glad you’ve gotten most of the way through the set-up. To get those other reports you’re going to use COUNTIF functions. So for Terms outside of the top 50 you’d use: =COUNTIF(F5:F154,”>50″) where F5:F154 is the range of values in your pivot table. The others look similar. Top 10: =COUNTIF(F5:F154,”<11″) Top 3: =COUNTIF(F5:F154,”<4″) Top 1: =COUNTIF(F5:F154,”1″) Then it’s just graphing it all, or is that where you’re stuck?
  36. DANA TAN Thanks AJ, This is perfect. I really appreciate it. The graphic I’ve got down (thanks to Annie Cushing!). This is totally amazing stuff. Love it!
  37. DANA TAN Okay, I’ve got it all set up and it’s looking beautiful! One last question AJ, you make reference to using the “Average of Ranking Terms” dashboard instead of the “Average Rank” dashboard and I would like to compare those two, but can’t figure out the easiest way to omit the non-ranking keywords from my pivot table. Can you explain how/where to pull the data for the “Average of Ranking Terms” dashboard chart? Thanks again, Dana
  38. GIORGIO Thanks!! I implement the code on both of my two blogs! Hope it will work fine :)
  39. ERIC TSAI AJ, awesome piece and totally agree with you on moving the conversation towards the right direction with keyword index. I’m wondering if this can help further validate the true CTR of organic keywords with some modeling as we know GWT’s impressions/CTRs are simply not true and/or reliable.
  40. MIKE Really awesome stuff AJ! Was curious though, if you’re using the “cd” parameter, that means you can get a value of let’s say 15, and actually be on the first page, due to Google’s Universal Search, right? Also about what Mike Wilton said about the Webmaste Tools data in Analytics, you can use that, yes, but you can’t get it through the API :( .
  41. RYAN BRADLEY I agree that it’s not about rankings at the end of the day, it’s about traffic. But most clients (who are uninformed) want to see rankings reports. When they pay hundreds if not thousands of dollars, they want to make sure they rank and see reports of which keywords they rank for. I know this can be solved somewhat by some education on the agency or consultants part but this is without a doubt a plight many consultants and agencies deal with on a daily basis.
  42. AJ KOHN Ryan, You’re absolutely right which is why I’m trying to give agencies and consultants another way to present rankings to clients that wind up being more productive. If you can move them away from specific rank to a rank index the conversation changes and you’ll be better able to truly help that client move their business forward. No doubt, easier said than done.
  43. KAMIL ORUÇBAZ it is also possible to track visit source. For example : http://www.google.com.tr/url?sa=t&rct=j&q=kaleci&source=video&cd=2&cad=rja&ved=0CDcQtwIwAQ&url=http%3A%2F%2Fwww.uzmantv.com%2Fiyi-bir-kaleci-ne-tur-ozelliklere-sahip-olmali&ei=VosPUerCO4jJhAepy4DQCQ&usg=AFQjCNEh5NUkS6DByQuqo0-tqFwoGE216w

    q=kaleci
    cd=2
    source=video
    url

  44. AJ KOHN Yes Kamil, there are a lot of interesting things you can track by parsing the referral string aren’t there?
  45. DON STURGILL This is exactly what I’ve been wanting to do, AJ. Next step will be to read the article slowly and see if I can walk through and reproduce your system using my tools. Much appreciated. Hey, I’m going to miss your presentation at SMX West, but wish you great success.
  46. PETER MEAD Yeah, I like it. I really like it. I will need to spend some time and get this setup properly. This is really going to go down well for my clients. Thanks, Peter Mead
  47. AJ KOHN Thanks Don and let me know if you have any problems with the new process.
  48. AJ KOHN Great Peter. Let me know how it works for you once you have it up and running.
  49. JOSH BRAATEN AJ – This is an amazing technique you’ve uncovered and made available for everyone. I’ve seen it before, but just rediscovered it as part of preparation for a deck I’m working on. I’ll be sure to credit and direct folks to this awesome page.
  50. GIORGIO Hi AJ, I implement this on my blog but seems like that it can only provide 30% of the whole data. My blog’s total organic visits is 8000 per month but total events for ranktracker is only 2,303. Many keywords can’t be found in that report. I changed document.referrer.match(/google\.com/gi to document.referrer.match(/google\.fr/gi because it’s a French blog. Can you tell me what’s wrong?
  51. ANDREA Cool snippet, I’m interested too in tracking my websites in italy/usa/germany. Giorgio are you sure that your problem is not related to the keyword placement which could occurr on the 2nd result page?
  52. GEORGE PHILLIP Hey, thanks for the rank tracking tips. Must admit I haven’t seen this one before
  53. TYLER MAGNUSSON I have to admit I’m a bit frustrated with Google’s crackdown on SEO tactics. I understand they’re trying to fold SEO from its own separate discipline into the more general umbrella of digital marketing, but Google’s emphasis on content marketing and social signaling is being shoved down our throats. In an attempt to move away from something that can be “gamed,” they’ve moved towards something else that can still be gamed. Perhaps I’m just a bit frustrated that a few things I could once do easily, like tracking keyword rankings and finding competitor keyword phrases are being marginalized.
  54. E-ROCK CHRISTOPHER Thanks for the great idea! Found this after reading Justin’s site. Implemented the idea using Google Tag Manager, which in of itself is pretty cool. Now if there was a rockstar way to get that (not provided) info!!
  55. JIM ROBINSON Hi AJ. I’ve found this post extremely valuable and re-read it a number of times. I implemented this using Event based tracking in Google Analytics on a site that gets around 3M search referrals per month, so I’m getting a nice volume of data. One thing I’m seeing, however, is event values like 820, 737, 644, etc. Any thoughts on what’s happening when the cd parameter is populated with numbers like these?
  56. AJ KOHN Jim, I’m thrilled that it’s been working for you. So the question here is how often you’re seeing those high numbers? Is it often or are they outliers? If it’s the latter, I’d guess that there are some people actually going back that far, perhaps in some sort of research mode. I know as an SEO I’m often on pages that no one else would ever get to regularly. If it’s the former, it might be image based search results. I used to see the cd parameter passed by image referrals but haven’t recently. But perhaps a few still do based on datacenter (?) or some other oddity.
  57. JIM ROBINSON It’s the former. It drops off gradually as you hit those high numbers – not just sporadic entries. I think image search makes a lot of sense here since you can pretty easily view that many results in an image query. Plus the site in question is very image focused. Great insight, AJ. I’m going to dig in further.
  58. KIRK Well to start with I didn’t even know my Analytics had an average rank.
  59. BARBARA VAROVSKY Awesome article! )) By the way, I know the tool that was not affected by Google’s new polices. This rank checking app is called Rank Tracker from SEO PowerSuite (you can download it here:http://www.seopowersuite.com/rank-tracker/). The app is cool and handles even advanced rank tracking tasks (Universal search results tracking, geo-targeted search and a lot more!).
  60. NATHAN The Analytics avg rank is a great tool to identify long tail opportunities that can be improved with an on-page tweak. This is great insight!
  61. GLEN WILSON Well….that was excellent. Again, the interwebs delivers me what I was looking for and I now have another tool and technique in my arsenal. I agree with Barbara as well, I use Rank Tracker and love using it. The developers are constantly updateing it with the latest algorithms too.
  62. GVANTO Thanks for this article – absolutely agree on rank index being a much better indicator than individual keywords (paricularly like the “I want to be number 1″ from the boss comment!) If this process could be made simpler, would definitely provide value. Would love to see more posts of this kind …
  63. D Hi, thanks very much for sharing this script, I use it regularly and have created some handy reports when combining it with custom dimensions in UA such as one for “Page Type” I wanted to comment that after tracking for a couple of months, the event action “(not provided)” is upwards of 90% of the total number of events (and I am not in a computer-saavy industry) You mentioned that the cd is only sent about 1/2 the time, is this the part that sends the keyword information? As mentioned, still useful for matching up landing pages to avg. Google ranking. The application I mentioned above for example (avg. rank of each page type) does not rely on keywords, but I can still monitor the ranking effect of making content improvements and tweaks. Anyway, just wanted to report my findings so far
  64. MARY KAY LOFURNO Thanks for sharing this script. I finally got it up as a test on one of my product sites. I am looking forward to rolling it out to all our sites. Thanks, Mary Kay
  65. DAN CARTER Well to start with I didn’t even know my Analytics had an average rank.
  66. KRUNAL Hi, Its great to collect these data directly into GA. However, I don’t find Content > Events in my GA! Can you please advise this in current GA version, where can I find this report, and what is the name of report?
  67. KRUNAL I got RankTracker events in the Google Analytics, however I don’t see any keyword anywhere! Can you please advise from where I can get it reports.

Generating Keyword Clusters

From a post by Dorcas Alexander, Lunametrics – 20130404

http://www.lunametrics.com/blog/2013/04/04/keyword-clusters-nlp-analysis/#sr=g&m=o&cp=or&ct=-tmc&st=(opu%20qspwjefe)&ts=1392951128

Now I’d like to turn to something complex, or at least with the potential for complexity: keyword analysis.

Keywords can be a rich source of visitor intent. I’m talking about search queries that lead to visits, as well as terms entered in site search after visitors arrive.

But looking at the top 100 or even top 1,000 keywords (ranked by your favorite metric: bounce rate, conversion rate, or whatever you like) won’t necessarily lead to the most accurate analysis because it neglects information in the long tail, which may be on the order of tens of thousands or more keywords.

If you’ve spent any time examining keyword data, you’ve observed similar terms dispersed throughout the long tail. I want to group those terms and analyze each group’s aggregated data to give a more complete picture. So what’s the best way to do that?

The Answer: Keyword Clusters

Of course, I’m not the first person to propose that analyzing groups or clusters of keywords can lead to more valuable insight than analyzing individual keywords alone.

In January, AJ Kohn wrote about his method for creating keyword rank indexes by exporting a CSV file with keyword rank history and leveraging pivot tables in Excel. Together with Justin Cutroni, he describes how to use event tracking to put keyword rank data into Google Analytics.

Another article from a couple years ago describes clustering keywords by their performance on various metrics. The author mentions (but doesn’t go into detail about) using tools like SPSS or SAS to do the cluster analysis and come up with related groups of terms.

And recently SEOmoz published an article about tracking SEO ‘broad match’ keywords in Google Analytics. Author Tracy Mu creates keyword clusters using regular expressions and then saves them as advanced segments. She then applies four segments at a time to custom reports and, really smartly, saves those reports as GA shortcuts.

The Next Question: Linguistic Complexity

All of those techniques are interesting and useful, but not quite what I’m looking for. The first two methods group keywords by a non-linguistic feature such as rank or performance. What keywords are in those groups? Still individual keywords dispersed across a slightly shorter long tail.

The last method, borrowing the idea of broad match from paid search, does what I want but with a limited number of clusters. The other drawback (for me) is that I don’t want to guess which keyword clusters to create. That’s a little too much art and not enough science.

What I really want to do is apply text analytics methods to discover patterns in keyword data, related to the semantic domain of the customer, and create related keyword groups automatically. This would account for linguistic complexity in all the forms actually produced by site visitors, a seemingly endless variety of word choices, phrasing, and spelling.

Text Analytics to the Rescue

I found someone else looking for the same thing in a question on Stack Overflow about using Python to cluster search engine keywords. The tricky part, as suggested in the question, is developing a domain-specific word source rather than relying on a more generally-informed source like WordNet.

One way to develop a customized word source, or “topic library”, is to mine web content related to the customer’s industry, and then cross-reference it with a database of phrases (such as the customer’s actual keywords). This allows for identification of phrases that will be treated as one word, as well as proper nouns and acronyms that may be specific to the customer’s products or services.

I’m planning to combine the customized topic library with a tool like the Python Natural Language Toolkit to create keyword clusters for better analysis. I’ll keep you updated on the results.

Responses:

Sean says:

About 2 years ago I scraped the top 100 results off of Google for roughly 30K keywords and built a matrix of keyword x website. Threw that into R’s hclust function, and about 3 days later I had a dendrogram of what Google thought of the keywords. It wasn’t too bad, a lot of work needs to be done on figuring the “distance” between two keywords since I just learning this stuff as I went. But I figure Google knows which keywords are related, and if it shows similar results for two keywords, that must be an indicator, no?

Dorcas Alexander Dorcas Alexander says:

Hi Sean, I thought I might end up using R eventually, but wanted to give the Python Toolkit a try first. And yes, figuring the distance between keywords is another potentially time-intensive part of the process, and probably requires some trial-and-error experimentation. I would also guess that Google showing similar results for two keywords is an indicator of shorter distance. Thanks for your comments!

Seamus says:

Dorcas, Great to see a post on this. Since forever I have been looking for a way to group keywords semantically. Long tail is great but I have never been able to find a way to group `themes` in say 10/20k long tail Keywords. Thanks for a nice round up of info. Having tried all kinds of tools, one that I have found ( a bit useful) that uses textual analysis is Open Calais. I await your update with interest…

Dorcas Alexander Dorcas Alexander says:

Thanks for your comments, Seamus. I took at quick look at Open Calais and it seems like something I might want to try. With all the tools I try, it’s a matter of finding the right balance between letting the tool do the work and having more control (meaning having to do more work myself).

Overview of Cluster Analysis

Source:  https://en.wikipedia.org/wiki/Cluster_analysis – (most) links to Wikipedia

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learningpattern recognitionimage analysisinformation retrieval, and bioinformatics.

Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including values such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It will often be necessary to modify data pre-processing and model parameters until the result achieves the desired properties.

Besides the term clustering, there are a number of terms with similar meanings, including automatic classificationnumerical taxonomy and typological analysis. The subtle differences are often in the usage of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification the resulting discriminative power is of interest. This often leads to misunderstandings between researchers coming from the fields of data mining and machine learning, since they use the same terms and often the same algorithms, but have different goals.

Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and introduced to psychology by Zubin in 1938 and Robert Tryon in 1939[1][2] and famously used by Cattell beginning in 1943[3] for trait theory classification in personality psychology.

Clusters and clusterings

According to Vladimir Estivill-Castro, the notion of a “cluster” cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms.[4] There is a common denominator: a group of data objects. However, different researchers employ different cluster models, and for each of these cluster models again different algorithms can be given. The notion of a cluster, as found by different algorithms, varies significantly in its properties. Understanding these “cluster models” is key to understanding the differences between the various algorithms. Typical cluster models include:

  • Connectivity models: for example hierarchical clustering builds models based on distance connectivity.
  • Centroid models: for example the k-means algorithm represents each cluster by a single mean vector.
  • Distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the Expectation-maximization algorithm.
  • Density models: for example DBSCAN and OPTICS defines clusters as connected dense regions in the data space.
  • Subspace models: in Biclustering (also known as Co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes.
  • Group models: some algorithms do not provide a refined model for their results and just provide the grouping information.
  • Graph-based models: a clique, i.e., a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster. Relaxations of the complete connectivity requirement (a fraction of the edges can be missing) are known as quasi-cliques.

A “clustering” is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example a hierarchy of clusters embedded in each other. Clusterings can be roughly distinguished as:

  • hard clustering: each object belongs to a cluster or not
  • soft clustering (also: fuzzy clustering): each object belongs to each cluster to a certain degree (e.g. a likelihood of belonging to the cluster)

There are also finer distinctions possible, for example:

  • strict partitioning clustering: here each object belongs to exactly one cluster
  • strict partitioning clustering with outliers: objects can also belong to no cluster, and are considered outliers.
  • overlapping clustering (also: alternative clustering, multi-view clustering): while usually a hard clustering, objects may belong to more than one cluster.
  • hierarchical clustering: objects that belong to a child cluster also belong to the parent cluster
  • subspace clustering: while an overlapping clustering, within a uniquely defined subspace, clusters are not expected to overlap.

Clustering algorithms

Clustering algorithms can be categorized based on their cluster model, as listed above. The following overview will only list the most prominent examples of clustering algorithms, as there are possibly over 100 published clustering algorithms. Not all provide models for their clusters and can thus not easily be categorized.

There is no objectively “correct” clustering algorithm, but as it was noted, “clustering is in the eye of the beholder.”[4] The most appropriate clustering algorithm for a particular problem often needs to be chosen experimentally, unless there is a mathematical reason to prefer one cluster model over another. It should be noted that an algorithm that is designed for one kind of model has no chance on a data set that contains a radically different kind of model.[4] For example, k-means cannot find non-convex clusters.[4]

Connectivity based clustering (hierarchical clustering)

Connectivity based clustering, also known as hierarchical clustering, is based on the core idea of objects being more related to nearby objects than to objects farther away. These algorithms connect “objects” to form “clusters” based on their distance. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will form, which can be represented using a dendrogram, which explains where the common name “hierarchical clustering” comes from: these algorithms do not provide a single partitioning of the data set, but instead provide an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge, while the objects are placed along the x-axis such that the clusters don’t mix.

Connectivity based clustering is a whole family of methods that differ by the way distances are computed. Apart from the usual choice of distance functions, the user also needs to decide on the linkage criterion (since a cluster consists of multiple objects, there are multiple candidates to compute the distance to) to use. Popular choices are known as single-linkage clustering (the minimum of object distances), complete linkage clustering (the maximum of object distances) or UPGMA (“Unweighted Pair Group Method with Arithmetic Mean”, also known as average linkage clustering). Furthermore, hierarchical clustering can be agglomerative (starting with single elements and aggregating them into clusters) or divisive (starting with the complete data set and dividing it into partitions).

These methods will not produce a unique partitioning of the data set, but a hierarchy from which the user still needs to choose appropriate clusters. They are not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge (known as “chaining phenomenon”, in particular with single-linkage clustering). In the general case, the complexity is \mathcal{O}(n^3), which makes them too slow for large data sets. For some special cases, optimal efficient methods (of complexity \mathcal{O}(n^2)) are known: SLINK[5] for single-linkage and CLINK[6] for complete-linkage clustering. In the data mining community these methods are recognized as a theoretical foundation of cluster analysis, but often considered obsolete. They did however provide inspiration for many later methods such as density based clustering.

  • Linkage clustering examples
  • Single-linkage on Gaussian data. At 35 clusters, the biggest cluster starts fragmenting into smaller parts, while before it was still connected to the second largest due to the single-link effect.

  • Single-linkage on density-based clusters. 20 clusters extracted, most of which contain single elements, since linkage clustering does not have a notion of “noise”.

Centroid-based clustering

In centroid-based clustering, clusters are represented by a central vector, which may not necessarily be a member of the data set. When the number of clusters is fixed to k, k-means clusteringgives a formal definition as an optimization problem: find the k cluster centers and assign the objects to the nearest cluster center, such that the squared distances from the cluster are minimized.

The optimization problem itself is known to be NP-hard, and thus the common approach is to search only for approximate solutions. A particularly well known approximative method is Lloyd’s algorithm,[7] often actually referred to as “k-means algorithm“. It does however only find a local optimum, and is commonly run multiple times with different random initializations. Variations of k-means often include such optimizations as choosing the best of multiple runs, but also restricting the centroids to members of the data set (k-medoids), choosing medians (k-medians clustering), choosing the initial centers less randomly (K-means++) or allowing a fuzzy cluster assignment (Fuzzy c-means).

Most k-means-type algorithms require the number of clusters - k - to be specified in advance, which is considered to be one of the biggest drawbacks of these algorithms. Furthermore, the algorithms prefer clusters of approximately similar size, as they will always assign an object to the nearest centroid. This often leads to incorrectly cut borders in between of clusters (which is not surprising, as the algorithm optimized cluster centers, not cluster borders).

K-means has a number of interesting theoretical properties. On the one hand, it partitions the data space into a structure known as a Voronoi diagram. On the other hand, it is conceptually close to nearest neighbor classification, and as such is popular in machine learning. Third, it can be seen as a variation of model based classification, and Lloyd’s algorithm as a variation of theExpectation-maximization algorithm for this model discussed below.

  • k-Means clustering examples
  • K-means separates data into Voronoi-cells, which assumes equal-sized clusters (not adequate here)

  • K-means cannot represent density-based clusters

Distribution-based clustering

The clustering model most closely related to statistics is based on distribution models. Clusters can then easily be defined as objects belonging most likely to the same distribution. A nice property of this approach is that this closely resembles the way artificial data sets are generated: by sampling random objects from a distribution.

While the theoretical foundation of these methods is excellent, they suffer from one key problem known as overfitting, unless constraints are put on the model complexity. A more complex model will usually always be able to explain the data better, which makes choosing the appropriate model complexity inherently difficult.

One prominent method is known as Gaussian mixture models (using the expectation-maximization algorithm). Here, the data set is usually modelled with a fixed (to avoid overfitting) number ofGaussian distributions that are initialized randomly and whose parameters are iteratively optimized to fit better to the data set. This will converge to a local optimum, so multiple runs may produce different results. In order to obtain a hard clustering, objects are often then assigned to the Gaussian distribution they most likely belong to; for soft clusterings, this is not necessary.

Distribution-based clustering is a semantically strong[clarification needed] method, as it not only provides you with clusters, but also produces complex models for the clusters that can also capture correlation and dependence of attributes. However, using these algorithms puts an extra burden on the user: to choose appropriate data models to optimize, and for many real data sets, there may be no mathematical model available the algorithm is able to optimize (e.g. assuming Gaussian distributions is a rather strong assumption on the data).

  • Expectation-Maximization (EM) clustering examples
  • On Gaussian-distributed data, EM works well, since it uses Gaussians for modelling clusters

  • Density-based clusters cannot be modeled using Gaussian distributions

Density-based clustering

In density-based clustering,[8] clusters are defined as areas of higher density than the remainder of the data set. Objects in these sparse areas – that are required to separate clusters – are usually considered to be noise and border points.

The most popular[9] density based clustering method is DBSCAN.[10] In contrast to many newer methods, it features a well-defined cluster model called “density-reachability”. Similar to linkage based clustering, it is based on connecting points within certain distance thresholds. However, it only connects points that satisfy a density criterion, in the original variant defined as a minimum number of other objects within this radius. A cluster consists of all density-connected objects (which can form a cluster of an arbitrary shape, in contrast to many other methods) plus all objects that are within these objects’ range. Another interesting property of DBSCAN is that its complexity is fairly low – it requires a linear number of range queries on the database – and that it will discover essentially the same results (it is deterministic for core and noise points, but not for border points) in each run, therefore there is no need to run it multiple times. OPTICS[11] is a generalization of DBSCAN that removes the need to choose an appropriate value for the range parameter \varepsilon, and produces a hierarchical result related to that of linkage clustering. DeLi-Clu,[12]Density-Link-Clustering combines ideas from single-linkage clustering and OPTICS, eliminating the \varepsilon parameter entirely and offering performance improvements over OPTICS by using an R-treeindex.

The key drawback of DBSCAN and OPTICS is that they expect some kind of density drop to detect cluster borders. Moreover, they cannot detect intrinsic cluster structures which are prevalent in the majority of real life data. A variation of DBSCAN, EnDBSCAN,[13] efficiently detects such kinds of structures. On data sets with, for example, overlapping Gaussian distributions – a common use case in artificial data – the cluster borders produced by these algorithms will often look arbitrary, because the cluster density decreases continuously. On a data set consisting of mixtures of Gaussians, these algorithms are nearly always outperformed by methods such as EM clustering that are able to precisely model this kind of data.

  • Density-based clustering examples
  • Density-based clustering with DBSCAN.

  • DBSCAN assumes clusters of similar density, and may have problems separating nearby clusters

  • OPTICS is a DBSCAN variant that handles different densities much better

Recent developments

In recent years considerable effort has been put into improving algorithm performance of the existing algorithms.[14][15] Among them are CLARANS (Ng and Han, 1994),[16] and BIRCH (Zhang et al., 1996).[17] With the recent need to process larger and larger data sets (also known as big data), the willingness to trade semantic meaning of the generated clusters for performance has been increasing. This led to the development of pre-clustering methods such as canopy clustering, which can process huge data sets efficiently, but the resulting “clusters” are merely a rough pre-partitioning of the data set to then analyze the partitions with existing slower methods such as k-means clustering. Various other approaches to clustering have been tried such as seed based clustering.[18]

For high-dimensional data, many of the existing methods fail due to the curse of dimensionality, which renders particular distance functions problematic in high-dimensional spaces. This led to new clustering algorithms for high-dimensional data that focus on subspace clustering (where only some attributes are used, and cluster models include the relevant attributes for the cluster) and correlation clustering that also looks for arbitrary rotated (“correlated”) subspace clusters that can be modeled by giving a correlation of their attributes. Examples for such clustering algorithms are CLIQUE[19] and SUBCLU.[20]

Ideas from density-based clustering methods (in particular the DBSCAN/OPTICS family of algorithms) have been adopted to subspace clustering (HiSC,[21] hierarchical subspace clustering and DiSH[22]) and correlation clustering (HiCO,[23] hierarchical correlation clustering, 4C[24] using “correlation connectivity” and ERiC[25] exploring hierarchical density-based correlation clusters).

Several different clustering systems based on mutual information have been proposed. One is Marina Meilă’s variation of information metric;[26] another provides hierarchical clustering.[27] Using genetic algorithms, a wide range of different fit-functions can be optimized, including mutual information.[28] Also message passing algorithms, a recent development in Computer Science and Statistical Physics, has led to the creation of new types of clustering algorithms.[29]

Evaluation of clustering results

Evaluation of clustering results sometimes is referred to as cluster validation.

There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. These measures are usually tied to the type of criterion being considered in assessing the quality of a clustering method.

Internal evaluation

When a clustering result is evaluated based on the data that was clustered itself, this is called internal evaluation. These methods usually assign the best score to the algorithm that produces clusters with high similarity within a cluster and low similarity between clusters. One drawback of using internal criteria in cluster evaluation is that high scores on an internal measure do not necessarily result in effective information retrieval applications.[30] Additionally, this evaluation is biased towards algorithms that use the same cluster model. For example k-Means clustering naturally optimizes object distances, and a distance-based internal criterion will likely overrate the resulting clustering.

Therefore, the internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, but this shall not imply that one algorithm produces more valid results than another.[4] Validity as measured by such an index depends on the claim that this kind of structure exists in the data set. An algorithm designed for some kind of models has no chance if the data set contains a radically different set of models, or if the evaluation measures a radically different criterion.[4] For example, k-means clustering can only find convex clusters, and many evaluation indexes assume convex clusters. On a data set with non-convex clusters neither the use of k-means, nor of an evaluation criterion that assumes convexity, is sound.

The following methods can be used to assess the quality of clustering algorithms based on internal criterion:

The Davies–Bouldin index can be calculated by the following formula:
 DB = \frac {1} {n} \sum_{i=1}^{n} \max_{i\neq j}\left(\frac{\sigma_i + \sigma_j} {d(c_i,c_j)}\right)
where n is the number of clusters, c_x is the centroid of cluster x\sigma_x is the average distance of all elements in cluster x to centroid c_x, and d(c_i,c_j) is the distance between centroids c_iand c_j. Since algorithms that produce clusters with low intra-cluster distances (high intra-cluster similarity) and high inter-cluster distances (low inter-cluster similarity) will have a low Davies–Bouldin index, the clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm based on this criterion.
The Dunn index aims to identify dense and well-separated clusters. It is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance. For each cluster partition, the Dunn index can be calculated by the following formula:[31]
 D = \min_{1\leq i \leq n}\left\{\min_{1\leq j \leq n,i\neq j}\left\{\frac {d(i,j)}{\max_{1\leq k \leq n}{d^{'}(k)}}\right\}\right\}
where d(i,j) represents the distance between clusters i and j, and d^{'}(k) measures the intra-cluster distance of cluster k. The inter-cluster distance d(i,j) between two clusters may be any number of distance measures, such as the distance between the centroids of the clusters. Similarly, the intra-cluster distance d^{'}(k) may be measured in a variety ways, such as the maximal distance between any pair of elements in cluster k. Since internal criterion seek clusters with high intra-cluster similarity and low inter-cluster similarity, algorithms that produce clusters with high Dunn index are more desirable.

External evaluation

In external evaluation, clustering results are evaluated based on data that was not used for clustering, such as known class labels and external benchmarks. Such benchmarks consist of a set of pre-classified items, and these sets are often created by human (experts). Thus, the benchmark sets can be thought of as a gold standard for evaluation. These types of evaluation methods measure how close the clustering is to the predetermined benchmark classes. However, it has recently been discussed whether this is adequate for real data, or only on synthetic data sets with a factual ground truth, since classes can contain internal structure, the attributes present may not allow separation of clusters or the classes may contain anomalies.[32] Additionally, from a knowledge discovery point of view, the reproduction of known knowledge may not necessarily be the intended result.[32]

Some of the measures of quality of a cluster algorithm using external criterion include:

The Rand index computes how similar the clusters (returned by the clustering algorithm) are to the benchmark classifications. One can also view the Rand index as a measure of the percentage of correct decisions made by the algorithm. It can be computed using the following formula:
 RI = \frac {TP + TN} {TP + FP + FN + TN}
where TP is the number of true positivesTN is the number of true negativesFP is the number of false positives, and FN is the number of false negatives. One issue with the Rand index is that false positives and false negatives are equally weighted. This may be an undesirable characteristic for some clustering applications. The F-measure addresses this concern.
The F-measure can be used to balance the contribution of false negatives by weighting recall through a parameter \beta \geq 0. Let precision and recall be defined as follows:
 P = \frac {TP } {TP + FP }
 R = \frac {TP } {TP + FN}
where P is the precision rate and R is the recall rate. We can calculate the F-measure by using the following formula:[30]
 F_{\beta} = \frac {(\beta^2 + 1)\cdot P \cdot R } {\beta^2 \cdot P + R}
Notice that when \beta=0F_{0}=P. In other words, recall has no impact on the F-measure when \beta=0, and increasing \beta allocates an increasing amount of weight to recall in the final F-measure.
  • Pair-counting F-Measure is the F-Measure applied to the set of object pairs, where objects are paired with each other when they are part of the same cluster. This measure is able to compare clusterings with different numbers of clusters.
  • Jaccard index
The Jaccard index is used to quantify the similarity between two datasets. The Jaccard index takes on a value between 0 and 1. An index of 1 means that the two dataset are identical, and an index of 0 indicates that the datasets have no common elements. The Jaccard index is defined by the following formula:
 J(A,B) = \frac {|A \cap B| } {|A \cup B|} = \frac{TP}{TP + FP + FN}
This is simply the number of unique elements common to both sets divided by the total number of unique elements in both sets.
The Fowlkes-Mallows index computes the similarity between the clusters returned by the clustering algorithm and the benchmark classifications. The higher the value of the Fowlkes-Mallows index the more similar the clusters and the benchmark classifications are. It can be computed using the following formula:
 FM = \sqrt{ \frac {TP}{TP+FP} \cdot \frac{TP}{TP+FN}  }
where TP is the number of true positivesFP is the number of false positives, and FN is the number of false negatives. The FM index is the geometric mean of the precision and recall P and R, while the F-measure is their harmonic mean.[35] Moreover, precision and recall are also known as Wallace’s indices B^I and B^{II}.[36]
A confusion matrix can be used to quickly visualize the results of a classification (or clustering) algorithm. It shows how different a cluster is from the gold standard cluster.
  • The Mutual Information is an information theoretic measure of how much information is shared between a clustering and a ground-truth classification that can detect a non-linear similarity between two clusterings. Adjusted mutual information is the corrected-for-chance variant of this that has a reduced bias for varying cluster numbers.

Applications

Business and marketing
Market research
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers, and for use in market segmentation,Product positioningNew product development and Selecting test markets.
Grouping of shopping items
Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn’t have the concept of a SKU).
World wide web
Social network analysis
In the study of social networks, clustering may be used to recognize communities within large groups of people.
Search result grouping
In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. There are currently a number of web based clustering tools such as Clusty.
Slippy map optimization
Flickr‘s map of photos and other map sites use clustering to reduce the number of markers on a map. This makes it both faster and reduces the amount of visual clutter.
Computer science
Image segmentation
Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
Recommender systems
Recommender systems are designed to recommend new items based on a user’s tastes. They sometimes use clustering algorithms to predict a user’s preferences based on the preferences of other users in the user’s cluster.
Markov chain Monte Carlo methods
Clustering is often utilized to locate and characterize extrema in the target distribution.
Social science
Crime analysis
Cluster analysis can be used to identify areas where there are greater incidences of particular types of crime. By identifying these distinct areas or “hot spots” where a similar crime has happened over a period of time, it is possible to manage law enforcement resources more effectively.
Educational data mining
Cluster analysis is for example used to identify groups of schools or students with similar properties.
Typologies
From poll data, projects such as those undertaken by the Pew Research Center use cluster analysis to discern typologies of opinions, habits, and demographics that may be useful in politics and marketing.


Related methods

See also category: Data clustering algorithms

References

  1. Jump up^ Bailey, Ken (1994). “Numerical Taxonomy and Cluster Analysis”. Typologies and Taxonomies. p. 34.ISBN 9780803952591.
  2. Jump up^ Tryon, Robert C. (1939). Cluster Analysis: Correlation Profile and Orthometric (factor) Analysis for the Isolation of Unities in Mind and Personality. Edwards Brothers.
  3. Jump up^ Cattell, R. B. (1943). The description of personality: Basic traits resolved into clusters. Journal of Abnormal and Social Psychology, 38, 476-506.
  4. Jump up to:a b c d e f Estivill-Castro, Vladimir (20 June 200202). “Why so many clustering algorithms — A Position Paper”ACM SIGKDD Explorations Newsletter 4 (1): 65–75.doi:10.1145/568574.568575.
  5. Jump up^ R. Sibson (1973). “SLINK: an optimally efficient algorithm for the single-link cluster method”The Computer Journal(British Computer Society) 16 (1): 30–34.doi:10.1093/comjnl/16.1.30.
  6. Jump up^ D. Defays (1977). “An efficient algorithm for a complete link method”. The Computer Journal (British Computer Society)20 (4): 364–366. doi:10.1093/comjnl/20.4.364.
  7. Jump up^ Lloyd, S. (1982). “Least squares quantization in PCM”.IEEE Transactions on Information Theory 28 (2): 129–137.doi:10.1109/TIT.1982.1056489edit
  8. Jump up^ Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek (2011). “Density-based Clustering”WIREs Data Mining and Knowledge Discovery 1 (3): 231–240.doi:10.1002/widm.30.
  9. Jump up^ Microsoft academic search: most cited data mining articles: DBSCAN is on rank 24, when accessed on: 4/18/2010
  10. Jump up^ Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu (1996). “A density-based algorithm for discovering clusters in large spatial databases with noise”. In Evangelos Simoudis, Jiawei Han, Usama M. Fayyad. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96)AAAI Press. pp. 226–231. ISBN 1-57735-004-9.
  11. Jump up^ Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander (1999). “OPTICS: Ordering Points To Identify the Clustering Structure”ACM SIGMOD international conference on Management of dataACM Press. pp. 49–60.
  12. Jump up^ Achtert, E.; Böhm, C.; Kröger, P. (2006). “DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking”. LNCS: Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science 3918: 119–128. doi:10.1007/11731139_16ISBN 978-3-540-33206-0edit
  13. Jump up^ S Roy, D K Bhattacharyya (2005). “An Approach to find Embedded Clusters Using Density Based Techniques”.LNCS Vol.3816Springer Verlag. pp. 523–535.
  14. Jump up^ D. Sculley (2010). “Web-scale k-means clustering”. Proc. 19th WWW.
  15. Jump up^ Z. Huang. “Extensions to the k-means algorithm for clustering large data sets with categorical values”. Data Mining and Knowledge Discovery, 2:283–304, 1998.
  16. Jump up^ R. Ng and J. Han. “Efficient and effective clustering method for spatial data mining”. In: Proceedings of the 20th VLDB Conference, pages 144-155, Santiago, Chile, 1994.
  17. Jump up^ Tian Zhang, Raghu Ramakrishnan, Miron Livny. “An Efficient Data Clustering Method for Very Large Databases.” In: Proc. Int’l Conf. on Management of Data, ACM SIGMOD, pp. 103–114.
  18. Jump up^ Can, F.; Ozkarahan, E. A. (1990). “Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases”. ACM Transactions on Database Systems 15 (4): 483. doi:10.1145/99935.99938.edit
  19. Jump up^ Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P. (2005). “Automatic Subspace Clustering of High Dimensional Data”. Data Mining and Knowledge Discovery11: 5. doi:10.1007/s10618-005-1396-1edit
  20. Jump up^ Karin Kailing, Hans-Peter Kriegel and Peer Kröger.Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM’04), pp. 246-257, 2004.
  21. Jump up^ Achtert, E.; Böhm, C.; Kriegel, H. P.; Kröger, P.; Müller-Gorman, I.; Zimek, A. (2006). “Finding Hierarchies of Subspace Clusters”. LNCS: Knowledge Discovery in Databases: PKDD 2006. Lecture Notes in Computer Science 4213: 446–453. doi:10.1007/11871637_42.ISBN 978-3-540-45374-1edit
  22. Jump up^ Achtert, E.; Böhm, C.; Kriegel, H. P.; Kröger, P.; Müller-Gorman, I.; Zimek, A. (2007). “Detection and Visualization of Subspace Cluster Hierarchies”. LNCS: Advances in Databases: Concepts, Systems and Applications. Lecture Notes in Computer Science 4443: 152–163.doi:10.1007/978-3-540-71703-4_15ISBN 978-3-540-71702-7edit
  23. Jump up^ Achtert, E.; Böhm, C.; Kröger, P.; Zimek, A. (2006). “Mining Hierarchies of Correlation Clusters”. Proc. 18th International Conference on Scientific and Statistical Database Management (SSDBM): 119–128.doi:10.1109/SSDBM.2006.35ISBN 0-7695-2590-3edit
  24. Jump up^ Böhm, C.; Kailing, K.; Kröger, P.; Zimek, A. (2004). “Computing Clusters of Correlation Connected objects”.Proceedings of the 2004 ACM SIGMOD international conference on Management of data – SIGMOD ’04. p. 455.doi:10.1145/1007568.1007620ISBN 1581138598edit
  25. Jump up^ Achtert, E.; Bohm, C.; Kriegel, H. P.; Kröger, P.; Zimek, A. (2007). “On Exploring Complex Relationships of Correlation Clusters”. 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007). p. 7.doi:10.1109/SSDBM.2007.21ISBN 0-7695-2868-6edit
  26. Jump up^ Meilă, Marina (2003). “Comparing Clusterings by the Variation of Information”. Learning Theory and Kernel Machines. Lecture Notes in Computer Science 2777: 173–187. doi:10.1007/978-3-540-45167-9_14ISBN 978-3-540-40720-1.
  27. Jump up^ Alexander Kraskov, Harald Stögbauer, Ralph G. Andrzejak, and Peter Grassberger, “Hierarchical Clustering Based on Mutual Information”, (2003) ArXiv q-bio/0311039
  28. Jump up^ Auffarth, B. (2010). Clustering by a Genetic Algorithm with Biased Mutation Operator. WCCI CEC. IEEE, July 18–23, 2010. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.170.869
  29. Jump up^ B.J. Frey and D. Dueck (2007). “Clustering by Passing Messages Between Data Points”. Science 315 (5814): 972–976. doi:10.1126/science.1136800PMID 17218491Papercore summary Frey2007
  30. Jump up to:a b Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press. ISBN 978-0-521-86571-5.
  31. Jump up^ Dunn, J. (1974). “Well separated clusters and optimal fuzzy partitions”. Journal of Cybernetics 4: 95–104.doi:10.1080/01969727408546059.
  32. Jump up to:a b Ines Färber, Stephan Günnemann, Hans-Peter Kriegel, Peer Kröger, Emmanuel Müller, Erich Schubert, Thomas Seidl, Arthur Zimek (2010). “On Using Class-Labels in Evaluation of Clusterings”. In Xiaoli Z. Fern, Ian Davidson, Jennifer Dy. MultiClust: Discovering, Summarizing, and Using Multiple ClusteringsACM SIGKDD.
  33. Jump up^ W. M. Rand (1971). “Objective criteria for the evaluation of clustering methods”. Journal of the American Statistical Association (American Statistical Association) 66 (336): 846–850. doi:10.2307/2284239JSTOR 2284239.
  34. Jump up^ E. B. Fowlkes & C. L. Mallows (1983), “A Method for Comparing Two Hierarchical Clusterings”, Journal of the American Statistical Association 78, 553–569.
  35. Jump up^ L. Hubert et P. Arabie. Comparing partitions. J. of Classification, 2(1), 1985.
  36. Jump up^ D. L. Wallace. Comment. Journal of the American Statistical Association, 78 :569– 579, 1983.
  37. Jump up^ R. B. Zadeh, S Ben-David. “A Uniqueness Theorem for Clustering”, in Proceedings of the Conference of Uncertainty in Artificial Intelligence, 2009.
  38. Jump up^ J Kleinberg, “An Impossibility Theorem for Clustering”, Proceedings of The Neural Information Processing Systems Conference 2002
  39. Jump up^ Bewley A. et al. “Real-time volume estimation of a dragline payload”. “IEEE International Conference on Robotics and Automation”,2011: 1571-1576.
  40. Jump up^ Basak S.C., Magnuson V.R., Niemi C.J., Regal R.R. “Determining Structural Similarity of Chemicals Using Graph Theoretic Indices”. Discr. Appl. Math.19, 1988: 17-44.
  41. Jump up^ Huth R. et al. “Classifications of Atmospheric Circulation Patterns: Recent Advances and Applications”. Ann. N.Y. Acad. Sci.1146, 2008: 105-152

Chris Ridings on PageRank

In 1998, search engines added link analysis to their bag of tricks. As a result, content spam and cloaking alone could no longer fool the link analysis engines and garner spammers unjustifiably high rankings. Spammers and SEOs adapted by learning how link analysis works. The SEO community has always been active – its members, then and now, hold conferences, write papers and books, host weblogs, and sell their secrets. The most famous and informative SEO papers were written by Chris Ridings, PageRank Explained: Everything you’ve always wanted to know about PageRank and PageRank uncovered. These papers offer practical strategies for hoarding PageRank and avoiding such undesirable things as PageRank leak. Search engines consider unethical SEOs to be adversaries, some web analysts call them an essential part of the web food chain, because they drive innovation and research and development.

Langville & Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings, p. 44.