Scraping Top Queries in GWT

Set the Parameters

  • Select Report Type = Top Queries.
  • Set Date Range with “Start Date” and “End Date”.
  • Select Filter = {All | Web | Mobile} – disregard “Location” and “Traffic” options.
  • Set Rows to display = Total number of Top Queries by editing the “grid.s” parameter in URL.
  • In practice, we generate 3 datasets (Filter = {All | Web | Mobile})  for a given Date Range.
  • The most important single dataset is Filter = Web.
  • The comparison of Filter = Web vs. Filter = Mobile may also be of interest – if so, use Filter = All to generate the exhaustive list of Queries to support this comparison.

Download two main datasets and save as .csv files:

  • Click on “Download This Table” – save as dtt-tq.csv.
  • Click on “Download Chart Data” – save as dcd-tq.csv.

Import the .csv files to Excel

  • Import dtt-tq.csv as comma-delimited UTF-8 format Data into Excel – save as <start date>-<end date>-<filter>-dtt-tq.xls e.g. 20140415-20140428-web-dtt-tq-xls.
  • Ditto dcd-tq.csv - save as <start date>-<end date>-<filter>-dcd-tq.xls.

Generate iMacros script to Browse the Queries

  • Every Top Query links to two tables – (1) the Query’s target URLS; (2) the Query’s CTR as a function of POSITIONS on the SERP (without regard to the target URL). We will use iMacros to browse and download these tables for every Query:
  • Sort the “Top Queries” by Impressions-Descending – as we will want to align with <start date>-<end date>-<filter>-dtt-tq.xls (Step 3).
  • Copy the source code for the “Top Queries” page to source.txt.
  • Using PowerGrep, re-write source.txt as a series of EVENT TYPE commands. PowerGrep searches for (wmt.common.display.Star\(‘star_)([^\u0027]+) and /collects:
    
    EVENT TYPE=CLICK SELECTOR="#qd_\2" BUTTON=0
    EVENT TYPE=CLICK SELECTOR="#download-urls>DIV" BUTTON=0
    EVENT TYPE=CLICK SELECTOR="#top-search-queries>DIV:nth-of-type(13)>DIV:nth-of-type(3)>BUTTON" BUTTON=0
    EVENT TYPE=CLICK SELECTOR="#download-positions>DIV" BUTTON=0
    EVENT TYPE=CLICK SELECTOR="#top-search-queries>DIV:nth-of-type(13)>DIV:nth-of-type(3)>BUTTON" BUTTON=0
    BACK
    

    Note: one or both of the values of xx and yy in DIV:nth-of-type(xx)>DIV:nth-of-type(yy) seem to change from session to session – and may need to be changed in the iMacros script. Be prepared – this is a LONG process – and the script crashes periodically, and you must delete those commands that have executed properly before proceeding with the remaining commands.

  • Copy and paste the “EVENT TYPE … BACK” lines from source.txt into scrape-tq.iim.
  • Run scrape-tq.iim to browse the Queries and download/save the URLS and POSITIONS tables as *.csv files.
  • Separate the POSITIONS and URLS *.csv files into two folders.

Edit the Query-specific URLS and POSITIONS files

Deleting the Header

  • We use PowerGREP to delete the Header in the *.csv files, so it isn’t repeated when we merge the files into one:
  • In the URLS files, delete “Page,Impressions,Change,Clicks,Change,CTR,Change”
  • In the POSITIONS files, delete “Position in search results,Impressions,Clicks,CTR”
  • We will insert a modified Header (“New Header” = “Query,” & “Old Header”) into the merged file.

Inserting the Query = FileName

URLS files

Using PowerGREP’s Search & Replace:

  • To deal with lines beginning with a ” in front of the Url, (^\u0022)(.*) –> %FILENAMENODOT%\u002C\1\2
  • To deal with lines with no “, (^http)(.*) –> %FILENAMENODOT%\u002C\1\2

POSITIONS files

Using PowerGREP’s Search & Replace:

  • To deal with lines beginning with numbers 1 to 5, (^[\d\w]+)([\d\w\002B\u002C\u0025]+$) –> %FILENAMENODOT%\u002C\1\2
  • For lines beginning with the characters 6 to 10, (^6 to 10)([\d\w\002B\u002C\u0025]+$) –> “%FILENAMENODOT%\u002C\1\2
  • Then (^2nd page)([\d\w\002B\u002C\u0025]+$) –> %FILENAMENODOT%\u002C\1\2
  • And (^3rd page )(\u002b)([\d\w\002B\u002C\u0025]+$) –> %FILENAMENODOT%\u002C\1\2\3

Merge the Query-specific URLS and POSITIONS files

  • Open a cmd window, navigate to the folder containing the URLS *.csv files, and run the command copy *.csv urls.csv – likewise, for the folder containing the POSITIONS *.csv files.
  • Import urls.csv and positions.csv as comma-delimited, UTF-8 format Data in Excel. Save as —urls-tq.xls and —positions-tq.xls.
  • Insert the Column Headings:
    • In —urls-tq.xls, insert “Query,Page,Impressions,Change,Clicks,Change,CTR,Change”
    • In —positions-tq.xls, insert “Query,Position,Impressions,Clicks,CTR”
  • Scan the *.xls files – usually there are a few entries out-of-whack – it’s usually clear how to fix them.
  • To convert the Unicode characters in some of the URLs, we found this site is great – http://code.cside.com/3rdpage/us/url/converter.html – just paste all the URLs, including the “%”s into the widget, decode, then paste the conversion into the Excel.

Repeat for the other Filters defined in Step 1.

Leaving us with …

For each of the 3 Filters defined in Set the Parameters, we generate 4 Excel files:

  • <start date>-<end date>-<filter>-dtt-tq.xls – from “Download This Table”
  • <start date>-<end date>-<filter>-dcd-tq.xls – from “Download Chart Data”
  • <start date>-<end date>-<filter>-urls-tq.xls – from scraping/merging “Top Queries” URLS tables
  • <start date>-<end date>-<filter>-positions-tq.xls – from scraping/merging “Top Queries” POSITIONS tables

Also see Re-presenting CTRs as a function of Query x Position on SERP.

The following two tabs change content below.

allenpg

Latest posts by allenpg (see all)

Leave a Reply