All posts by allenpg

iMacros Javascript Scripting Interface

Automate complex tasks: iMacros commands refer to web page elements, so any programming logic must be put into a script that then uses iMacros to automate the website. For this purpose iMacros for Firefox contains a built-in Javascript Scripting Interface, which runs directly inside the browser.

The following information focuses on this built-in Javascript Scripting Interface. Supported commands (see below):

Note that the syntax of the regular, commercial Windows Web Scripting Interface and the built-in Firefox Javascript Scripting Interface is identical (unless where explicitly noted). Therefore they use the same documentation.


By default each Javascript step is shown during replay. This option is useful for testing and debugging, but it slows down the Javascript execution artificially. To run Javascript at its normal (very fast) speed please uncheck this option.


Javascript code running inside iMacros. The the//imacros-js:showsteps yes/no comment at the top of your Javascript file (including the //) overrides the global setting of “Show Javascript” checkbox in the option dialog.

Javascript examples.png
Examples: iMacros for Firefox automatically installs the SI-Send-Macro-Code.js - View Script Source Code:

SI-Send-Macro-Code.js

Sample Javascript script for use with iMacros for Firefox.

 /*Simple send code example */
 var MyMacroCode
 var jsNewLine="\n"
 MyMacroCode = "CODE:"
 var i
 
 MyMacroCode = MyMacroCode+"URL GOTO=http://www.iopus.com" + jsNewLine
 MyMacroCode = MyMacroCode+"URL GOTO=http://forum.iopus.com"
 iimDisplay("Send Macro via iimPlay")
 iimPlay(MyMacroCode)
 
 /*Some different ways to do looping*/
 iimDisplay("For Loop")
 for (i = 1; i <= 2; i++)
 {
   iimDisplay("i="+i)
   iimPlay("CODE:URL GOTO=http://forum.iopus.com/viewtopic.php?t="+i*10)
 }
 
 iimDisplay("While Loop")
 var i=1;
 while (i<=2)
 {
   iimDisplay("i="+i)
   iimPlay("CODE:URL GOTO=http://forum.iopus.com/viewtopic.php?t="+i*100)
   i=i+1;
 }
 
 iimDisplay("Do...While Loop")
 i = 1;
 do
 {
   iimDisplay("i="+i)
   iimPlay("CODE:URL GOTO=http://forum.iopus.com/viewtopic.php?t="+i*1000)
   i++;
 }
 while (i <= 2)
   
 /*Howo to generate a random wait time*/
 var mydelay
 /*Generate a number between 1 and 10*/
 mydelay=Math.round(10*Math.random());
 iimDisplay("Random wait t="+mydelay)
 MyMacroCode = "CODE:"
 MyMacroCode = MyMacroCode+"URL GOTO=http://wiki.imacros.net" + jsNewLine
 MyMacroCode = MyMacroCode+"WAIT SECONDS=" + mydelay + jsNewLine
 MyMacroCode = MyMacroCode+"URL GOTO=http://wiki.imacros.net/iMacros_for_Firefox"
 iimPlay(MyMacroCode)
 
 iimDisplay("Script completed.")

Important: iMacros macros must have the “.iim” file extension and Javascript scripts must have the “.js” file extension.

Note: Firefox can be remote controlled by the regular iMacros Scripting Interface via iimInit (“-fx”). The Javascript Scripting Interface does not include iimInit and iimExit, because they are not required. The Javascript runs inside the browser. The regular iMacros Scripting Interface is now available for Linux. It allows you to remote control Firefox and Chrome via Python.

Running multiple iMacros js scripts simultaneously

If you need to run more than one js script in iMacros for Firefox at the same time, you have to use a different Firefox profile for each script and make sure each opens as a different process.

Scripting Firefox

Mozilla Firefox, the complete browser, can be scripted with the commercial iMacros Enterprise Edition (= iMacros Scripting API). So while the free Java scripting runs inside Firefox, the API allows you to control Firefox from external software (C++, C#, Python, Perl,…). For details, see the chapter with the iimOpen command.

iimDisplay()

Displays a short message in the iMacros browser. A typical usage would be to distinguish several running iMacros Browsers or display information on the current position within the script.

Syntax


int ret_code = iimDisplay ( String message [, int timeout] ) 

Parameters

  • String message
    The message that is to be displayed in the iMacros Browser
    -or-
    #HIDEDISPLAY# – hides the message box
    #KIOSKMODE# – enables kiosk mode
    #KIOSKMODEOFF# – disables kiosk mode
  • int timeout
    The optional timeout value determines when the Scripting Interface returns a timeout error if the command is not completed in time. The default value is 10 seconds.

Examples

Visual Basic Script example:

Dim imacros1, imacros2, iret 

Set imacros1 = CreateObject("imacros") 
iret = imacros1.iimInit() 
iret = imacros1.iimDisplay("This is the 1st iMacros Browser")   

Set imacros2 = CreateObject("imacros") 
iret = imacros2.iimInit() 
iret = imacros2.iimDisplay("This is the 2nd iMacros Browser")


In iMacros for Chrome, if the sidebar is not available (e.g. if you start the browser from scripting interface API or run macros from bookmarks menu)errors and iimDisplay() messages are shown in a desktop notification pop-up window.

iimSet()

Defines variables for use inside the macro and assigns values to them. There are limitations as to what variables you can set using this command. You can set all built-in variables which you also can set via the command line. Additionally, you can set all user defined variables. After iimPlay all variables are erased. The return code is always 0.

Syntax

int ret_code = iimSet ( String VARNAME, String VARVALUE )

Parameters

  • String VARNAME
    A string defining which variable is to be set. The variable is created by iimSet. It does not have to be defined somewhere. Use VARNAME to create a user defined variable named {{VARNAME}} (case insensitive). Note: You can not use any of the built-in variables with iimSet.
  • String VARVALUE
    The value which is to be assigned to the variable.
    In contrast to TAG commands, blank spaces must not be replaced by <SP>. iimSet() takes care of that.

Examples

Loop over a number, for example to extract one table element after the other

Dim imacros, iret, i 
Set imacros = CreateObject("imacros") 
iret = imacros.iimInit() 
For i=0 To 4  
  ' You have to convert the value into a string! 
  iret = imacro.iimSet("myloop", CStr(i)) 
  iret = imacros.iimPlay("mymacro") 
Next

Note that variables defined with iimSet lose their values after each iimPlay. This is by design. If you want to use the same variables and values in another macro, you need to use iimSet again:


iret = imacro.iimSet("greeting", "hello") 
iret = imacros.iimPlay("1st-macro") 
 
iret = imacro.iimSet("greeting", "hello") 
iret = imacros.iimPlay("2nd-macro")

See Also

Related forum posts:

iimPlay()

Plays a macro. After the macro has played all options that have been set with the iimSet command are reset. Use iimGetLastExtract to get the extracted text. Upon the next iimPlay() call, internal variables like !TIMEOUT_PAGE and !EXTRACT for instance, will also be reset to their default values.

There are two fundamentally different ways of playing a macro using the iimPlay command. The first is to specify the filename (without the extension) of the macro in the String macro parameter. The other is to generate macro code on-the-fly in your program, preceded by “CODE:”, and pass it directly to iimPlay via the String macro parameter (see note below).

Syntax

int ret_code = iimPlay ( String macro [, int timeout] )

Parameters

  • String macro
    Either the macro’s filename without the extension, a string holding macro commands or the macro code.

(1) iimPlay (“demo-download”) – If you just supply the macro name, iMacros looks for the file in the standard macro folder (as specified in the Options dialog).
(2) iimPlay (“c:\MyMacros\macro1.iim”) – Full path*
(3) iimPlay (“Test\macro1″) – Relative path* to the iMacros Macros folder
(4) iimPlayCode (“URL GOTO….”) (old: iimPlay (“CODE:URL GOTO….”) => Code Example, Tips: see note below.

* Backslashes in the path need to be escaped when using Javascript or any other language that requires backslashes in paths to be escaped.
For example: “c:\\MyMacros\\macro1.iim”

  • int timeout
    The optional timeout value. If iimPlay does not return before this time span, the Scripting Interface returns a timeout error -3. No extraction data is returned in this case. The default value is 600 seconds. This is the timeout for the overall macro runtime. This value should not be confused with the several timeouts inside a macro. The iimPlay timeout is typically triggered by a browser crash, a browser freeze or if the macro runtime exceeds this value.

Error Handling

iimPlay returns a detailed error code for every problem encountered. Please see the Scripting Interface Return Codes and the general iMacros Error-Codes, which are transmitted via the iimPlay command back to the calling application.

The return codes of iimPlay can not only be used to deal with “big” issues such as web browser crashes etc, but are often simply used to react to missing elements on a website. So if an element is not found on a website, and then the TAG command reports an error, and iimPlay returns this error to the script. Example: If you extract book ISBN numbers, some books may not have an ISBN number and the TAG command reports a “not found” error.

The error codes of iimPlay are exactly the same that you get from the iMacros Browser/IE/Firefox itself. In addition, there there Interface specific error codes that deal with unexpected errors timeouts or browser crashes.

Examples

Play a macro located in the Macros\ directory of your iMacros installation (Visual Basic Script example):

Dim imacros, iret 
Set imacros = CreateObject("imacros") 
iret = imacros.iimOpen() 
iret = imacros.iimPlay("mymacro") 

Play some on-the-fly generated code (Visual Basic Script example):

Dim imacros, iret, mycode, myURL 

myURL = "http://www.iopus.com"  

mycode = "URL GOTO=" + myURL + vbNewLine 
mycode = mycode + "TAG POS=1 TYPE=FONT ATTR=TXT:<SP><SP>Online<SP>Store" 

Set imacros = CreateObject("imacros") 
iret = imacros.iimOpen() 
iret = imacros.iimPlayCode(mycode)

Note

Relative path

  • You have the option to use the relative path to the iimPlay command. For example, if your macro is in a subfolder “test” of the iMacros Macros folder, you may use iimPlay(“test\yourmacro”). The same is valid for the iMacros for Firefox built-in Javascript Scripting Interface.

CODE:

  • The recommended method for playing a macro generated-on-the fly is to assign the entire macro to a single string and then use one call to iimPlayCode to play the macro. While it is possible to use multiple calls to iimPlayCode to play each line of your macro separately, keep in mind that each time you call iimPlay or iimPlayCode, all of the iMacros internal variables are reset, and this can produce undesired results if you call each line of your macro this way.
  • Several commands in a macro generated on-the-fly must be separated by the CR (carriage return) symbol. These are vbNewLine or vbCrLf in Visual Basic or \r\n in C, C++ or C#.
  • iimPlayCode is not yet supported in iMacros for Firefox. Please continue to use iimPlay(“CODE:…”) instead.
  • Use the iMacros Editor “Code Generator” (in the File menu) for converting your macro to inline code.

Drop-down list boxes

  • If you start a macro via iimPlay which contains a TAG TYPE=SELECT… statement and the specified value is not in the drop down list then the iimPlay command returns an -1700 error code. In the corresponding error message (see iimGetErrorText) the maximum index is given. You can use this value, for example, to always select the last entry of a changing drop down list.

Playing iMacros for Firefox Javascript (.js) files

  • The version of iimPlay provided with the iMacros Enterprise Edition supports the playback of Javascript (.js) scripts in iMacros for Firefox. For example:

ret = iim1.iimOpen("-fx")
ret = iim1.iimPlay("MyScript.js")
  • The version of iimPlay provided with the built-in Javascript scripting interface in iMacros for Firefox only supports the playback of macro (.iim) files. However, there is a workaround as described in the following forum post:

iimGetLastExtract()

Name change: Please use iimGetExtract instead. See API enhancements for details.

Returns the contents of the !EXTRACT variable. If the last command was iimPlay and if EXTRACT is used inside a macro iimGetLastExtract returns the extracted text. If the EXTRACT command could not find the extraction anchor then an #EANF# (Extraction Anchor Not Found) message is returned. If there is no EXTRACT command in the macro which was just played then iimGetLastExtract returns an empty string (“”).

If in one macro several EXTRACT commands appear then the results are separated by the string [EXTRACT]. If complete tables where extracted, adjacent table elements are separated by the string #NEXT# and ends of table rows are delimited by the string #NEWLINE#.

Syntax

String extract = iimGetLastExtract ( [int index_of_extracted_text]   )

Parameters

Since version 6 this command supports the option to return the extracted information separately, so no further parsing and splitting is required:

iimGetLastExtract () – returns all extracted information at once

iimGetLastExtract (0) – returns all extracted information at once

iimGetLastExtract (1) – returns 1st extracted data

iimGetLastExtract (2) – returns 2nd extracted data (and so on)

Examples

Display the extracted results from a macro (Visual Basic Script example):

Dim imacros, iret 
Set imacros = CreateObject("imacros") 
iret = imacros.iimInit() 
iret = imacros.iimPlay("myextractmacro") 
MsgBox "The extract was: "+ vbNewline + _ 
  imacros.iimGetLastExtract() 
iret = imacros.iimExit()

See Also

Related forum posts:

iimInitiimPlayiimDisplayiimExitiimGetLastErroriimTakeBrowserScreenshot

iimGetLastError()

Returns the text associated with the last error.

Name change: Please use iimGetErrorText instead. See API enhancements for details.

Syntax

String err_message = iimGetLastError()

Parameters

None

Examples

Display a dialog if iMacros cannot be initialized (Visual Basic Script example):

Dim imacros, iret 
Set imacros = CreateObject("imacros") 
iret = imacros.iimInit() 
If iret < 0 Then 
  MsgBox "An error occured: " + vbNewline + _ 
    imacros.iimGetLastError() 
End If

 

Integrate Python and Eclipse IDE

There are two main ways you can work with Python: through the command line or through an IDE. I’ve chosen the Eclipse IDE.

  1. Eclipse requires Java Virtual Machine (JVM) – Download and install the Java Developer Kit.
  2. Download and install the 32-bit Kepler version of Eclipse.
  3. Install plug-ins to integrate Eclipse and Python:
    • Mylyn:
      1. Help -> Install New Software
      2. Beside “Work with” – Add “Mylyn” – “http://download.eclipse.org/mylyn/releases/latest”
      3. Select All and follow prompts to install – restart Eclipse
    • Pydev:
      1. Help -> Install New Software
      2. Beside “Work with” – Add “Pydev” – “http://pydev.org/updatest”
      3. Select All and follow prompts to install – restart Eclipse
  4. Configure Pydev – within Eclipse, select Window -> Preferences -> Pydev -> Interpreters -> Python Interpreter :: Enter c:\Python34\python.exe in top panel and click way through to set up.

Setting up Python in Windows 8.1

Set up Python on Windows 8.1

1. Visit the official Python download page and grab the Windows installer. Choose the 32-bit version.

2. Run the installer and accept all the default settings, including the “C:\Python34″ directory it creates.


3. Next, set the system’s PATH variable to include directories that include Python components and packages we’ll add later. To do this:

  • Open the Control Panel (you can find it using Search on the Charms Bar).
  • In the Control Panel, search for and open System.
  • In the dialog box, select Advanced System Settings.
  • In the next dialog, select Environment Variables.
  • In the User Variables section, edit the PATH statement to include this (if there is no PATH variable, click NEW to create one):
C:\Python34;C:\Python34\Lib\site-packages\;C:\Python34\Scripts\;

4. Now, you can open a command prompt (Charms Bar | Search | cmd) and type:

C:\> python

That will load the Python interpreter:

Python 3.4.1  etc etc
Type "help", "copyright", "credits" or license for more information.
>>>

Because of the settings you included in your PATH variable, you can now run this interpreter — and, more important, a script — from any directory on your system.

Press Control-Z plus Return to exit the interpreter and get back to a C: prompt.

Set up useful Python packages

setuptools and pip are installed with python 3.4.1 -  they will cover most of your installation needs, so go ahead and add pip. MechanizeRequests and BeautifulSoup are must-have utilities for web scraping, and we’ll add those next:

C:\> pip install mechanize
C:\> pip install requests
C:\> pip install beautifulsoup4

4. csvkit, which was covered here, is a great tool for dealing with comma-delimited text files. Add it:

C:\> pip install csvkit

You’re now set to get started using and learning Python under Windows 8.1. If you’re looking for a handy guide, start with the Official Python tutorial.

BUT FIRST, … install Eclipse IDE to support working in Python.

Eclipse & JVM

How to install Python 3.4.1 on CentOS 6

CentOS 6 ships with Python 2.6.6 and several critical system utilities, for example yum, will break if the default Python interpreter is upgraded. The trick is to install new versions of Python in /usr/local so that they can live side-by-side with the system version.

Execute all the commands below as root either by logging in as root or by using sudo.

Preparations – install prerequisites

In order to compile Python you must first install the development tools and a few extra libs. The extra libs are not strictly needed to compile Python but without them your new Python interpreter will be quite useless.

 

Things to consider

Before you compile and install Python there are a few things you should know and/or consider:

Unicode

Python has a long and complicated history when it comes to Unicode support. In Python 3.4 the Unicode support has been completely rewritten and strings are automatically stored using the most efficient encoding possible.

Shared library

You should probably compile Python as a shared library. If you compile Python as a shared library you must also tell it how to find the library. Our option:

  • Compile the path into the executable by adding this to the end of the configure command:LDFLAGS="-Wl,-rpath /usr/local/lib"

Use “make altinstall” to prevent problems

It is critical that you use make altinstall when you install your custom version of Python. If you use the normal make install you will end up with two different versions of Python in the filesystem both namedpython. This can lead to problems that are very hard to diagnose.

Download, compile and install Python

Here are the commands to download, compile and install Python.

After running the commands above your newly installed Python interpreter will be available as /usr/local/bin/python3.4. The system version of Python 2.6.6 will continue to be available as /usr/bin/python/usr/bin/python2 and /usr/bin/python2.6.

Setuptools + pip

Setuptools has replaced Distribute as the official package manager used for installing packages from the Python Package Index. Setuptools and pip are installed with Python 3.4.1. It builds on top of Setuptools and provides a few extra functions that are useful when you manage your packages.

The packages will end up in /usr/local/lib/pythonX.Y/site-packages/ (where X.Y is the Python version).

What’s next?

Since you are using Python 3.4 you don’t need to install virtualenv because that functionality is already built in.

Each isolated Python environment (also called sandbox) can have its own Python version and packages. This is very useful when you work on multiple projects or on different versions of the same project.

Create your first isolated Python environment

When you use pyvenv to create a sandbox you must install setuptools and pip inside the sandbox. You can reuse the ez_setup.py file you downloaded earlier and just run it after you activate your new sandbox.

Unix (wget)

Most Linux distributions come with wget.

Download ez_setup.py and run it using the target Python version. The script will download the appropriate version and install it for you:

> wget https://bootstrap.pypa.io/ez_setup.py -O - | python

Note that you will may need to invoke the command with superuser privileges to install to the system Python:

> wget https://bootstrap.pypa.io/ez_setup.py -O - | sudo python

Alternatively, Setuptools may be installed to a user-local path:

> wget https://bootstrap.pypa.io/ez_setup.py -O - | python - --user

Unix including Mac OS X (curl)

If your system has curl installed, follow the wget instructions but replace wget with curl and -O with -o. For example:

> curl https://bootstrap.pypa.io/ez_setup.py -o - | python

Advanced Installation

For more advanced installation options, such as installing to custom locations or prefixes, download and extract the source tarball from Setuptools on PyPIand run setup.py with any supported distutils and Setuptools options. For example:

setuptools-x.x$ python setup.py install --prefix=/opt/setuptools

Use --help to get a full options list, but we recommend consulting the EasyInstall manual for detailed instructions, especially the section on custom installation locations.

PHP, jQuery, Javascript Notes

.serializeArray() – jQuery

The .serializeArray() method creates a JavaScript array of objects, ready to be encoded as a JSON string. It operates on a jQuery object representing a set of form elements. The .serializeArray() method uses the standard W3C rules for successful controls to determine which elements it should include; in particular the element cannot be disabled and must contain a name attribute. No submit button value is serialized since the form was not submitted using a button. Data from file select elements is not serialized. This method can act on a jQuery object that has selected individual form elements, such as <input><textarea>, and <select>. However, it is typically easier to select the <form> tag itself for serialization. This produces the following data structure (provided that the browser supports console.log):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
{
name: "a",
value: "1"
},
{
name: "b",
value: "2"
},
{
name: "c",
value: "3"
},
{
name: "d",
value: "4"
},
{
name: "e",
value: "5"
}
]

 

.append() – jQuery

The .append() method inserts the specified content as the last child of each element in the jQuery collection (To insert it as the first child, use .prepend()). With .append(), the selector expression preceding the method is the container into which the content is inserted. Similar to other content-adding methods such as .prepend() and .before().append() also supports passing in multiple arguments as input. Supported input includes DOM elements, jQuery objects, HTML strings, and arrays of DOM elements.

$(“el”) – - name selector – jQuery

An element to search for – by Name. Refers to the tagName of DOM nodes. JavaScript’s getElementsByTagName() function is called to return the appropriate elements when this expression is used.

$(“#el”) – id selector – jQuery

An ID to search for, specified via the id attribute of an element. For id selectors, jQuery uses the JavaScript function document.getElementById(), which is extremely efficient. Calling jQuery() (or $()) with an id selector as its argument will return a jQuery object containing a collection of either zero or one DOM element. Each id value must be used only once within a document. If more than one element has been assigned the same ID, queries that use that ID will only select the first matched element in the DOM. This behavior should not be relied on, however; a document with more than one element using the same ID is invalid. If the id contains characters like periods or colons you have to escape those characters with backslashes.

 

jQuery()


Return a collection of matched elements either found in the DOM based on passed argument(s) or created by passing an HTML string.

jQuery( selector [, context ] )Returns: jQuery

Description: Accepts a string containing a CSS selector which is then used to match a set of elements.

In the first formulation listed above, jQuery() — which can also be written as $() — searches through the DOM for any elements that match the provided selector and creates a new jQuery object that references these elements:

1
$( "div.foo" );

If no elements match the provided selector, the new jQuery object is “empty”; that is, it contains no elements and has .lengthproperty of 0.

Selector Context

By default, selectors perform their searches within the DOM starting at the document root. However, an alternate context can be given for the search by using the optional second parameter to the $() function. For example, to do a search within an event handler, the search can be restricted like so:

1
2
3
$( "div.foo" ).click(function() {
$( "span", this ).addClass( "bar" );
});

When the search for the span selector is restricted to the context of this, only spans within the clicked element will get the additional class.

Internally, selector context is implemented with the .find() method, so $( "span", this ) is equivalent to $( this ).find( "span" ).

Using DOM elements

The second and third formulations of this function create a jQuery object using one or more DOM elements that were already selected in some other way. When passing an array, each element must be a DOM element; mixed data is not supported. A jQuery object is created from the array elements in the order they appeared in the array; unlike most other multi-element jQuery operations, the elements are not sorted in DOM order.

A common use of single-DOM-element construction is to call jQuery methods on an element that has been passed to a callback function through the keyword this:

1
2
3
$( "div.foo" ).click(function() {
$( this ).slideUp();
});

This example causes elements to be hidden with a sliding animation when clicked. Because the handler receives the clicked item in the this keyword as a bare DOM element, the element must be passed to the $() function before applying jQuery methods to it.

XML data returned from an Ajax call can be passed to the $() function so individual elements of the XML structure can be retrieved using .find() and other DOM traversal methods.

1
2
3
$.post( "url.xml", function( data ) {
var $child = $( data ).find( "child" );
});

 

When a jQuery object is passed to the $() function, a clone of the object is created. This new jQuery object references the same DOM elements as the initial one.

As of jQuery 1.4, calling the jQuery() method with no arguments returns an empty jQuery set (with a .length property of 0). In previous versions of jQuery, this would return a set containing the document node.

At present, the only operations supported on plain JavaScript objects wrapped in jQuery are: .data(),.prop(),.on().off().trigger() and .triggerHandler(). The use of .data() (or any method requiring .data()) on a plain object will result in a new property on the object called jQuery{randomNumber} (eg. jQuery123456789). Should .trigger( "eventName" ) be used, it will search for an “eventName” property on the object and attempt to execute it after any attached jQuery handlers are executed. It does not check whether the property is a function or not. To avoid this behavior, .triggerHandler( "eventName" ) should be used instead.

Chosen

Chosen (v1.1.0)

Chosen has a number of options and attributes that allow you to have full control of your select boxes.

Options

The following options are available to pass into Chosen on instantiation.

Example:

  $(".my_select_box").chosen({
    disable_search_threshold: 10,
    no_results_text: "Oops, nothing found!",
    width: "95%"
  });
Option Default Description
allow_single_deselect false When set to true on a single select, Chosen adds a UI element which selects the first elment (if it is blank).
disable_search false When set to true, Chosen will not display the search field (single selects only).
disable_search_threshold 0 Hide the search input on single selects if there are fewer than (n) options.
enable_split_word_search true By default, searching will match on any word within an option tag. Set this option to false if you want to only match on the entire text of an option tag.
inherit_select_classes false When set to true, Chosen will grab any classes on the original select field and add them to Chosen’s container div.
max_selected_options Infinity Limits how many options the user can select. When the limit is reached, the chosen:maxselected event is triggered.
no_results_text “No results match” The text to be displayed when no matching results are found. The current search is shown at the end of the text (e.g., No results match “Bad Search”).
placeholder_text_multiple “Select Some Options” The text to be displayed as a placeholder when no options are selected for a multiple select.
placeholder_text_single “Select an Option” The text to be displayed as a placeholder when no options are selected for a single select.
search_contains false By default, Chosen’s search matches starting at the beginning of a word. Setting this option to trueallows matches starting from anywhere within a word. This is especially useful for options that include a lot of special characters or phrases in ()s and []s.
single_backstroke_delete true By default, pressing delete/backspace on multiple selects will remove a selected choice. When false, pressing delete/backspace will highlight the last choice, and a second press deselects it.
width Original select width. The width of the Chosen select box. By default, Chosen attempts to match the width of the select box you are replacing. If your select is hidden when Chosen is instantiated, you must specify a width or the select will show up with a width of 0.
display_disabled_options true By default, Chosen includes disabled options in search results with a special styling. Setting this option to false will hide disabled results and exclude them from searches.
display_selected_options true

By default, Chosen includes selected options in search results with a special styling. Setting this option to false will hide selected results and exclude them from searches.

Note: this is for multiple selects only. In single selects, the selected result will always be displayed.

Attributes

Certain attributes placed on the select tag or its options can be used to configure Chosen.

Example:

  <select class="my_select_box" data-placeholder="Select Your Options">
    <option value="1">Option 1</option>
    <option value="2" selected>Option 2</option>
    <option value="3" disabled>Option 3</option>
  </select>
Attribute Description
data-placeholder

The text to be displayed as a placeholder when no options are selected for a select. Defaults to “Select an Option” for single selects or “Select Some Options” for multiple selects.

Note:This attribute overrides anything set in the placeholder_text_multiple orplaceholder_text_single options.

multiple The attribute multiple on your select box dictates whether Chosen will render a multiple or single select.
selected, disabled Chosen automatically highlights selected options and disables disabled options.

Classes

Classes placed on the select tag can be used to configure Chosen.

Example:

  <select class="my_select_box chosen-rtl">
    <option value="1">Option 1</option>
    <option value="2">Option 2</option>
    <option value="3">Option 3</option>
  </select>
Classname Description
chosen-rtl

Chosen supports right-to-left text in select boxes. Add the class chosen-rtl to your select tag to support right-to-left text options.

Note: The chosen-rtl class will pass through to the Chosen select even when theinherit_select_classes option is set to false.

Triggered Events

Chosen triggers a number of standard and custom events on the original select field.

Example:

  $('.my_select_box').on('change', function(evt, params) {
    do_something(evt, params);
  });
Event Description
change

Chosen triggers the standard DOM event whenever a selection is made (it also sends a selected or deselected parameter that tells you which option was changed).

Note: in order to use change in the Prototype version, you have to include the Event.simulate class. The selected and deselected parameters are not available for Prototype.

chosen:ready Triggered after Chosen has been fully instantiated.
chosen:maxselected Triggered if max_selected_options is set and that total is broken.
chosen:showing_dropdown Triggered when Chosen’s dropdown is opened.
chosen:hiding_dropdown Triggered when Chosen’s dropdown is closed.
chosen:no_results Triggered when a search returns no matching results.

Note: all custom Chosen events (those that being with chosen:) also include the chosen object as a parameter.

Triggerable Events

You can trigger several events on the original select field to invoke a behavior in Chosen.

Example:

  // tell Chosen that a select has changed
    $('.my_select_box').trigger('chosen:updated');
Event Description
chosen:updated This event should be triggered whenever Chosen’s underlying select element changes (such as a change in selected options).
chosen:activate This is the equivalant of focusing a standard HTML select field. When activated, Chosen will capure keypress events as if you had clicked the field directly.
chosen:open This event activates Chosen and also displays the search results.
chosen:close This event deactivates Chosen and hides the search results.

Simple PHP Scraper Class

I gave a presentation entitled “The SEOs Guide to Scraping Everything” on May 10th at the SEOmoz and SEER Interactive Meetup in Philadelphia, PA.  Since I only had 8 minutes to present, I figured I’d augment my presentation by providing a simple PHP scraper class that people can use (and extend) to get started with scraping.

You can download the scraper class here.

There’s a quick sample for how to use the scraper class below my slide deck from the meetup:

Using the Scraper:

setProxies($proxies);

$scraper->scrape('http://www.cnn.com');
var_dump($scraper);
?>

 

And here’s the actual scraper class:

 

class Eppie_Service_Scraper{

    public function __construct(){
        // set proxies -- you can add your own here or use the setProxies method
        $this->_proxies = array();
    }

    public function scrape($url)
    {
        $this->_url = $url;
        $dom = new DOMDocument();
    	$proxy = $this->_pickProxy();

    	$ch = curl_init();
    	curl_setopt($ch, CURLOPT_URL, $url);
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	curl_setopt($ch, CURLOPT_REFERER, "");
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
    	curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19");
    	if($proxy){
        	curl_setopt($ch, CURLOPT_PROXY, $proxy);
        	curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    	}
    	$body = curl_exec($ch);
    	curl_close($ch);

    	$this->_curl_result = $body;
    	@$dom->loadHTML($body);
    	$this->_dom = $dom;

    	$this->_parseDOM();
    }

    public function setProxies($proxies)
    {
        $this->_proxies = $proxies;
    }

    private function _pickProxy()
    {
        if(count($this->_proxies) > 0)
            return $this->_proxies[rand(0, count($this->_proxies) - 1)];
        else return false;
    }

    public function setKeyword($keyword)
    {
        $this->_keyword = $keyword;
    }

    private function _parseDOM()
    {
        $xpath = new DOMXPath($this->_dom);
        $title = $xpath->query("//head/title");
        $meta_desc = $xpath->query("//head/meta[@name='description']/@content");
        $meta_kw = $xpath->query("//head/meta[@name='keywords']/@content");
        $h1 = $xpath->query("//h1");
        $h2 = $xpath->query("//h2");
        $h3 = $xpath->query("//h3");
	$h4 = $xpath->query("//h4");
	$h5 = $xpath->query("//h5");
	$h6 = $xpath->query("//h6");
        $img = $xpath->query("//img");
        $img_alt = $xpath->query("//img[@alt!='']/@alt");
        $strong = $xpath->query("//strong | //b");
        $body = $xpath->query("//body");

        if($title->length > 0)
            $this->_title = $title->item(0)->nodeValue;

        if($meta_desc->length > 0)
            $this->_meta_desc = $meta_desc->item(0)->nodeValue;

        if($meta_kw->length > 0)
            $this->_meta_kw = $meta_kw->item(0)->nodeValue;

        if($h1->length > 0)
        {
            for($i=0; $i < $h1->length; $i++)
                $this->_h1[] = $h1->item($i)->nodeValue;
        }

        if($h2->length > 0)
        {
            for($i=0; $i < $h2->length; $i++)
                $this->_h2[] = $h2->item($i)->nodeValue;
        }

        if($h3->length > 0)
        {
            for($i=0; $i < $h3->length; $i++)
                $this->_h3[] = $h3->item($i)->nodeValue;
        }

	if($h4->length > 0)
        {
            for($i=0; $i < $h4->length; $i++)
                $this->_h4[] = $h4->item($i)->nodeValue;
        }

	if($h5->length > 0)
        {
            for($i=0; $i < $h5->length; $i++)
                $this->_h5[] = $h5->item($i)->nodeValue;
        }

	if($h6->length > 0)
        {
            for($i=0; $i < $h6->length; $i++)
                $this->_h6[] = $h6->item($i)->nodeValue;
        }

        if($img_alt->length > 0)
        {
            for($i=0; $i < $img_alt->length; $i++)
                $this->_img_alt[] = $img_alt->item($i)->nodeValue;
        }

        $this->_img_alt_pct = ($img_alt->length / $img->length)*100;

        if($strong->length > 0)
        {
            for($i=0; $i < $strong->length; $i++)
                $this->_strong[] = $strong->item($i)->nodeValue;
        }

    }

}

Using DOMXPath for Parsing Page Content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
If you have a small set of HTML pages that you want to scrape data from and then to stuff into a database, Regexes might work fine… this works well for a limited, one-time job (from community Wiki).

If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.

Here I’ve summed the basic steps to be done with DOMXPath class usage:
  1. Initialize a DOMDocument class instance from page content (work with HTML as with XML)
  2. Initialize a DOMXPath class instance from DOMDocument class instance.
  3. Parse the DOMXPath object.

1. Initializing a DOMDocument  class instance from page content

  • create a new DOMDocument class instance
1
$DOM = new DOMDocument;
1
libxml_use_internal_errors(true);
When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.
  • load the HTML text into the DOMDocument object
1
if (!$DOM-&gt;loadHTML($page))
  • enable user error handling
1
2
3
4
5
6
7
8
    {   $errors=&amp;quot;&amp;quot;;
        foreach (libxml_get_errors() as $error)  {
           $errors.=$error-&gt;message.’&lt;br/&gt;’;
        }
        libxml_clear_errors();
        print “libxml errors:&lt;br&gt;$errors”;
        return;
    }

Now the DOMDocument object (named ‘$DOM’) contains all the target text as a HTML DOM structure. It’s ready for different methods and properties to be applied.

2. Initializing a DOMXPath object from the DOMDocument object

  • Initialize DOMXPath object for further parse
1
$xpath = new DOMXPath($DOM);

Now XPath methods are applicable to the content

Parsing the DOMXPath object

As a test page I took the Blocks Testing Ground page and wrote a code using XPath to retrieve data.

1
2
3
4
5
6
$case1 = $xpath-&gt;query(‘//*[@id="case1"]‘)-&gt;item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath-&gt;query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry-&gt;firstChild-&gt;nodeValue} &lt;br /&gt; “;
}

 

How libxml library reacts to a malformed HTML

The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:

  • No warning for this case: <p><p><p>
  • For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div>  the library has issued an exception for the DOMXPath ‘query’ method.

The whole Scraper Listing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?php
$curl = curl_init(‘http://testing-ground.scraping.pro/blocks’);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
    echo ‘Scraper error: ‘ . curl_error($curl);
    exit;
}
curl_close($curl);
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
    {
        $errors=””;
        foreach (libxml_get_errors() as $error)  {
            $errors.=$error->message.”<br/>”;
        }
        libxml_clear_errors();
        print “libxml errors:<br>$errors”;
        return;
    }
$xpath = new DOMXPath($DOM);
$case1 = $xpath->query(‘//*[@id="case1"]‘)->item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry->firstChild->nodeValue} <br /> “;
}
?>

http://scraping.pro/5-best-xpath-cheat-sheets-and-quick-references/#more-5731

 

User-Sensitive PageRank

Problems with Conventional PageRank

Conventional PageRank computes authority weights of different HTML pages based on a Random Surfer Model.

In this model a steady-state distribution of the Markov chain is computed based on a transition matrix defined by a surfer that uniformly randomly follows the page out-links. To meet certain mathematical requirements, a blend of such a random surfer with uniform teleportation is typically used.

In such an approach, a surfer either:

  • randomly selects an out-link to follow with probability c, or
  • “gets bored” and randomly selects some unconnected page to jump (or teleport) to, with probability 1−c.

According to a conventional formulation, PageRank can be introduced as a vector defined over all nodes of a Web graph that satisfies the following PageRank linear system:

p=cP T p+(1−c)v.  (1)

Here P is a Markov transition matrix in which

Pij = 1 / deg(i) if there is a link i -> j, and
Pij = 0 if there is no link i -> j
c is a teleportation coefficient usually picked around 0.85 – 0.9
v=(1/n,1/n, . . . ,1/n) is a uniform teleportation vector
n is a total number of all Web pages

The system can be rewritten in a more straightforward component-wise way that explicitly uses Web graph structure (deg(i) is out-degree of node i)

US 7624104 Eq 2

Many iterative methods of solving PageRank equation (1) have been proposed. 1

And though the numerical properties of PageRank are relatively well studied, the usefulness of conventional formulations of PageRank in the relevancy ranking of query search results (one of its primary uses) is debatable. This is due in large part to the fact that some of the basic assumptions underlying widely used PageRank formulations are either flawed or not reflective of reality. Indeed, this fact is evidenced in the many attempts which have been made to adjust PageRank formulations to more realistic settings from the time of its introduction.

For example, the assumption that all the outgoing links in a Web page are followed by a surfer uniformly randomly is unrealistic. In reality, links can be classified into different groups, some of which are followed rarely if at all (e.g., disclaimer links). Such “internal links” are known to be less reliable and more self-promotional than “external links” yet are often weighted equally. Attempts to assign weights to links based on IR similarity measures have been made but are not widely used. 2

The uniform teleportation jump to all the Web pages is another example of an unrealistic assumption upon which conventional PageRank formulations are based. That is, nothing is further from reality than the assumption that users begin new sessions on major portals and obscure home pages with equal probability. Alternatively, it is sometimes assumed that teleportation is restricted to a trusted set of pages or sites. 3 However, this assumption is equally flawed in that it is intended to combat link spam rather than being reflective of real-world user behavior. An additional and less recognized problem is that attrition from different pages is very different and therefore cannot accurately be described by the same scalar coefficient 1−c.

Conventional PageRank formulations have another issue which relates to the manner in which they are used in practice. That is, because of the vast number of pages on the Web, PageRank computing is typically implemented with regard to aggregations of pages by site, host, or domain, also referred to as “blocked” PageRank. 4 In formulating viable blocked PageRank computations, links between pages have to be somehow aggregated to a block level. Unfortunately, most heuristics for performing this aggregation do not work well.

In view of the foregoing, new formulations of PageRank are needed which address these shortcomings.

User-Sensitive PageRank - P. Berkhin et al (2006)

Techniques are provided for generating an authority value of a first one of a plurality of documents and/or for generating a variety of ways to compute PageRank with reference to various types of data corresponding to actual user behavior.

Figure 1 is a flow diagram which illustrates this general idea. User data 100 which reflect the behavior and/or demographics of an underlying user population are collected and indexed (102). At least some of these data track the navigational behavior of the user population with regard to documents, pages, sites, and domains visited, and links selected. The user population, the computing context, and the techniques for collecting these data may vary considerably.

US07624104-20091124-D00000
Figure 1.

First Component

A first component of the authority value is generated with reference to outbound links associated with the first document. The outbound links enable access to a first subset of the plurality of documents.

According to a more specific embodiment, generation of the first component includes assigning a weight to each of the outbound links. Each of the weights is derived with reference to a portion of the user data representing a frequency with which the corresponding outbound link was selected by a population of users.

Second Component

A second component of the authority value is generated with reference to a second subset of the plurality of documents. Each of the second subset of documents represents a potential starting point for a user session.

According to another specific embodiment, generation of the second component of the authority value includes generating a teleportation distribution which includes a term for each of the second subset of documents. Each of the terms is derived with reference to a portion of the user data representing relevance of the corresponding document among a population of users.

Third Component

A third component of the authority value is generated representing a likelihood that a user session initiated by any of a population of users will end with the first document.

The first, second, and third components of the authority value are combined to generate the authority value.

At least one of the first, second, and third components of the authority value is computed with reference to user data relating to at least some of the outbound links and the second subset of documents.

According to some embodiments, authority value components generated with reference to user data may also be generated with reference to conventional formulations for these components such as, for example, the components represented in equation (1). Moreover, these new and conventional components may be blended together to varying degrees to generate the authority value components of the present invention.

According to yet another embodiment, an authority value of a first one of a plurality of documents is generated. Text associated with each of a plurality of inbound links enabling access to the first document is identified. A weight is assigned to the text associated with each of the inbound links. Each of the weights is derived with reference to user data representing a frequency with which the corresponding inbound link was selected by a population of users. The authority value is generated with reference to the weights.

 

PageRank computation is performed for a plurality of pages and/or documents using a PageRank formulation constructed according to the present invention (104). Such PageRank formulations include at least one component which is derived with reference to the user data. In addition, the PageRank computation may be performed for each page/document on the Web or at some higher level of aggregation (e.g., site, host, domain, etc.). The PageRank computations may then be employed in support of a wide variety of applications (106) such as, for example, in relevancy determinations for the ranking of search results in response to user queries. And because the set of pages, the connections between them, and user behavior may change over time, the user data collection and PageRank computations may be iterated (dashed line) to ensure that they reflect the most current conditions in the computing environment.

Various embodiments of the present invention may employ PageRank formulations which incorporate or make use of user data in a variety of ways which address one or more of the issues described above. For example, as noted above the assumption of uniform endorsement along all outward-bound links associated with a page is unrealistic, e.g., internal links (e.g., disclaimer links) are typically not equal to external links. To the contrary, users “vote” by their behavior in terms of the links they actually select. Moreover, the popularity of links selected is not static, but changes over time.

Therefore, according to various embodiments, empirical data corresponding to link selection behavior by users are employed to weight outbound links in a PageRank formulation such that this user behavior is taken into account. According to a specific embodiment, the number of users who browsed from page i to page j along a link connecting the two pages is employed to assign to the link a weight which reflects a likelihood that a user will move along the directed edge corresponding to the link. Additional details regarding exemplary techniques by which this weighting may be accomplished are provided in U.S. Pat. No. 6,792,419 for System And Method For Ranking Hyperlinked Documents Based On A Stochastic Backoff Processes, the entire disclosure of which is incorporated herein by reference for all purposes.

Because most pages have very little traffic associated with them, and the traffic they do have corresponds to a low confidence estimate of user intent, according to a specific embodiment of the invention, the terms in the Markov transition matrix of equation (1) may instead be derived as follows:

w

ij = 1 + α ⁢ ⁢ n i -> j deg ⁡ ( i ) + α ⁢ ∑ i -> j ⁢ ⁢ n i -> j ( 3 )
where α≧0 reflects some Laplace smoothing factor, and ni→j is the number of users following a particular link. It should be noted that coefficient α=0 corresponds to a conventional formulation of this component. Notice also that higher values of ni→jrepresent a higher impact on wij in agreement with the fact that higher values imply higher confidence.

While equation (3) does incorporate some measure of the likelihood that specific links will be selected by users, more specific embodiments of the invention are contemplated which reflect further refinement of the underlying assumptions. That is, for example, users are not equal. Rather, they are part of a social network in which different weights can be assigned to different users based on a variety of factors. In addition, because the popularity of pages and links change over time, the incorporation of one or more recency factors into the PageRank formulation may be desirable. Third, the use of user data enables the creation of a targeted PageRank by aggregating user behavior over a particular user segment as defined by demographics, behavioral characteristics, user profile, etc.

According to a more specific embodiment of the invention, these refinements result in the following generalization of equation (3) in which u denote a user and S stands for a particular user segment:

w

ij = 1 + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u ) deg ⁡ ( i ) + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u ) ( 3 ⁢ A )
Here uεi→j means that user u followed link i→j. According to one formulation, u reflects user meta-data which may include, but are not limited to, weight, recency, tenure, and time spent on a page, thus yielding:

w

ij = 1 + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u weight , u recency , u tenure , u time ⁢ ⁢ spent ⁢ ⁢ on ⁢ ⁢ j ) deg ⁡ ( i ) + α ⁢ ∑ u ∈ S ⋂ u ∈ i -> j ⁢ ⁢ f ⁡ ( u weight , u recency , u tenure ) . ( 3 ⁢ B )

Yet another specific embodiment reflects a further generalization of this idea. That is, conditioning by a user segment may assume use of a step function that is equal to one for users within S and to zero for users outside S. However, it should be noted that this idea may be generalized to any probability distribution ρ(in practice we can assign different significance levels to different user segments), thus yielding:

w

ij = 1 + α ⁢ ∑ u ∈ i -> j ⁢ ⁢ ρ u ⁢ f ⁡ ( u ) deg ⁡ ( i ) + α ⁢ ∑ u ∈ ⁢ i -> j ⁢ ⁢ ρ u ⁢ f ⁡ ( u ) . ( 3 ⁢ C )

It should be noted that embodiments of the invention may work on any level of aggregation (i.e., for blocked PageRank formulations). For example, for a site or host level graph, a link between site I and site J exists if there are pages i and j connected by a hyperlink such that iεI, jεJ. Now we can assign weights WIJ to the link I→J using a formula similar to any of (3)-(3C) with NIJ being a count of users who proceeded from any page i in site I to any page j in site J.

Because of “dangling” pages, i.e., pages having no out-links, and because of the requirement of a graph’s strong connectivity (i.e., the Markov transition matrix P has to be irreducible), a degree of teleportation is added to the PageRank formulation of equation (1) as described above. And a typical teleportation distribution v=(vj) used in a conventional PageRank formulation is selected either uniformly or uniformly among a subset of trusted pages. As noted above, both approaches have shortcomings. That is, users do not start from obscure pages with the same probability as from popular hubs (e.g., think of the effect of bookmarks), and uniform teleportation actually leads to a link-based spam. On the other hand, what can be trusted is in dispute, and a restrictive definition of trust defeats the purpose of creating a strongly-connected graph.

Therefore, according to various embodiments of the invention, user data are utilized to meaningfully estimate a teleportation distribution for a PageRank formulation. Consider different user sessions. Each session has a first or a starting page. Let mjbe the count of how many times a page j was a first page in a session. Then, according to a specific embodiment, a realistic teleportation distribution v′ can be defined as a blend of a more conventional distribution (e.g., v as defined above) with user-data-based component as follows:

v

j ′ = β ⁢ ⁢ v j + ( 1 - β ) ⁢ m j ∑ j ⁢ ⁢ m j . ( 4 )
where 0≦β≦1 is a tuning parameter which adjusts the degree of blending of the two components. Again, it should be noted that β=1 corresponds to a conventional formulation of this component. A higher β means a larger degree of exploration and a lesser degree of relying on behavioral data. According to one exemplary embodiment, β=0.2 is recommended as a reasonable tradeoff. It should be noted that equation (4) can be generalized in a manner similar to the generalization of equation (3) to equations (3A)-(3C) to incorporate user network utility, user tenure, recency, and time spent on a page. Even, if relatively few pages on the Web actually have a non-zero count mj, the idea leads to a good teleportation distribution with a small β accounting for a degree of exploration. The fact that only a small fraction of pages on the Web would have significant teleportation component agrees with the well known fact that a small portion of pages actually carries the bulk volume of PageRank distribution. Again, in deriving this teleportation distribution, we can take into account many other characteristics beyond frequency counts as was done for equations (3A)-(3C). The above-described embodiments suggest simple yet powerful frameworks for addressing two of the faulty assumptions underlying conventional PageRank formulations, i.e., uniform link weighting and uniform teleportation. According to further embodiments of the invention, another shortcoming of conventional PageRank formulations, i.e., the teleportation coefficient c, is addressed. Previously, it has been assumed that given a particular page, a random surfer “becomes bored” and jumps or “teleports” to a new session (i.e., at a new page) with uniform probability (1−c). In reality, uniformly assuming this dropout rate is a very bad approximation. Therefore, according to various embodiments of the invention, user data are utilized to estimate individual teleportation coefficients for specific pages or blocks. Let gbe a fraction of sessions that end on the page i of all sessions containing i. Then, according to a specific embodiment, a page-specific estimate of a dropout rate may be given by:
(1−c i)=(1−c)γ+(1−γ)g i  (5)
where c is a conventional teleportation coefficient, and 0≦γ≦1 is a tuning parameter which enables varying degrees of blending of conventional teleportation coefficients with page-specific data. Here γ=1 corresponds to a conventional formulation with γ=0.25 being a reasonable default.

As discussed above, equations (3), (4), and (5) compute quantities related to PageRank formulations with reference to data corresponding to actual user behavior. In addition, further generalizations make it possible to account for other elements of user behavior such as, for example, user network utility, user recency, user tenure, time spent on a page, etc., e.g., equations (3A)-(3C). However, because the confidence levels for user behavior estimates relating to infrequently visited pages are low, some regularization may be desirable for specific embodiments of the invention.

It can be argued that the fraction of pages for which user data are available is small in comparison with the realm of all Web pages. Were it not so, the count of visits per page would serve as a good approximation of authority. Therefore, as described above, embodiments of the invention utilize authority propagation from conventional PageRank formulations while deriving out-link weights, teleportation vectors, and teleportation coefficients based on user behavioral data, thus blending these two types of data to varying degrees. Thus, embodiments of the invention provide more accurate PageRank authority of all pages, including pages that have little or no visitation.

Put another way, embodiments of the present invention, consolidate conventional formulations applicable to any pages with new formulations applicable to relatively few frequent, and so high authority, pages. According to some of the exemplary formulations described herein, this consolidation may be achieved to varying degrees using a kind of Laplace smoothing represented in equations (3)-(5) by parameters α, β, γ. For α=0 and β=γ=1 the formulations are reduced to the conventional formulations represented by equation (1). On the other hand, if any one of these three parameters departs from these values, some level of blending occurs and is therefore within the scope of the invention. Thus, it should be noted that embodiments of the invention are contemplated in which these tuning parameters range in value such that only one, two, or all three of the corresponding components are in play.

Further refinements and applications of the present invention will now be described.

User Segment Personalized PageRank

Many attempts have been made to define personalized PageRank formulations. For example, by selecting a narrow set of topic specific pages and restricting teleportation to these pages, a topical PageRank formulation can be constructed. According to specific embodiments of the present invention, PageRank formulations (or individual components thereof) derived in accordance with the present invention may be flexibly and straightforwardly applied to or used with any type of personalized PageRank formulation.

For example, user segmentation is commonly used in targeted advertising. A user segment can be defined in terms of a user demographic profile (e.g., age, gender, income, etc.), user location, user behavior, etc. Any or all of equations (3)-(5) above can then be specified to reflect any such user segment in that they are constructed with reference to user data corresponding to an underlying population which, in turn, can be restricted to the relevant user segment. Moreover, as discussed above, such formulations can take into account any probabilistic distribution of user relevancy such as, for example, assigning weights to different users on the basis of an age range distribution.

Blocked PageRank

As discussed above, PageRank formulations are often applied to aggregations at the host, site, or domain levels, often referred to as blocked PageRank. Blocked PageRank is useful in acceleration of PageRank computing and in PageRank personalization. To construct a blocked PageRank formulation, parameters for a factorized directed graph are defined. For example, equal weights may be assigned for any link from one block to another as between two blocks having nodes connected by a directed edge. However, such a formulation would not distinguish between a pair of blocks connected by a single spurious link, and a pair of blocks connected by multiple direct edges. A variety of schemes have been developed to derive weights for block super-edges, but performance in practice has yielded mixed results.

However, because user behavior naturally aggregates at the various different “block” levels (i.e., site, host, domain, etc.), the various PageRank formulations of the present invention naturally scale up to the various block levels.

Overall PageRank Iterations

PageRank computing is related to the so-called simple power iteration method. This method depends on parameters such as edge probability distribution and teleportation described above. Equations (3)-(5) above and the generalization exemplified by equation (3A) thus lead to the following:

p

j ( n + 1 ) = ∑ i -> j ⁢ ⁢ c i ⁢ w ij ⁢ p i ( n ) + ( 1 + c i ) ⁢ v j ( 6 )
where transition weights wij are defined by equation (3) or its analogs (e.g., equations (3A)-(3C)), teleportation distribution vis defined by equation (4) or its analogs, and teleportation coefficients care defined by equation (5) or its analogs. And any derived iterative schemes that accelerate PageRank convergence and/or construct or compute blocked PageRank which employ any of the PageRank formulations or components thereof described herein are within the scope of the present invention.

Time Dynamics

In principle, PageRank should be periodically recomputed because the Web graph grows and its topology changes with time. In line with this is purely topological change, core pages with the same in and out-links still come in and out of fashion or significance over time. This is particularly important given that there is no “garbage collection” on the Web. Yet another advantage of the PageRank formulations of the present invention is that it is relatively straightforward to incorporate time dynamics. For example, a discount procedure such as, for example, exponential averaging, could readily be included into user behavior counts to emphasize recent events and discount old ones. Not only does such a modification capture temporally dependent changes in page popularity, it also operates as a de-facto Web garbage collection utility.

Other Applications

As will be understood, the various PageRank formulations of the present invention may be used in conjunction with other information to evaluate page relevance in ranking search results according to any of a wide variety of techniques. However, it should be noted that the PageRank formulations of the present invention may be used in a wide variety of other applications. An example of one such application is controlling the manner in which a web crawling application crawls the Web. That is, the PageRank formulations of the present invention may be used to support decision making by a web crawler to determine whether and on which links associated with a given page to crawl.

Moreover, the basic principles described herein can be generalized beyond PageRank formulations. Consider an anchor-text that is known as one of the most useful features used in ranking retrieved Web search results. It is usually assembled through aggregation of different \href HTML tag text strings related to incoming links. However, since incoming links have different popularity, this text can be supplied with some weights derived according to the present invention. According to the invention, knowledge of user behavior may be incorporated into such a technique as follows. Given a target page j, anchor-texts corresponding to incoming links i→j are weighted with user behavior scores wij computed as described above. As will be understood, various formulas may be used in relevancy ranking to aggregate hyperlink anchor text. Any of those formulas may be modified in accordance with the present invention to reflect link weights corresponding to user behavior in a manner similar to equations (3)-(3C).

Embodiments of the present invention may be employed to compute PageRank or similar formulations in any of a wide variety of computing contexts. For example, as illustrated in FIG. 2, implementations are contemplated in which the relevant population of users interact with a diverse network environment via any type of computer (e.g., desktop, laptop, tablet, etc.)202, media computing platforms 203 (e.g., cable and satellite set top boxes and digital video recorders), handheld computing devices (e.g., PDAs) 204, cell phones 206, or any other type of computing or communication platform.

And according to various embodiments, user data processed in accordance with the invention may be collected using a wide variety of techniques. For example, collection of data representing a user’s interaction with specific Web pages may be accomplished using any of a variety of well known mechanisms for recording a user’s online behavior. However, it should be understood that such methods of data collection are merely exemplary and that user data may be collected in many other ways. For example, user data may be collected when a user registers with, for example, a particular web site or service.

Once collected, the user data are processed and stored in some centralized manner. This is represented in FIG. 2 by server208 and data store 210 which, as will be understood, may correspond to multiple distributed devices and data stores. The invention may also be practiced in a wide variety of network environments (represented by network 212) including, for example, TCP/IP-based networks, telecommunications networks, wireless networks, etc. In addition, the computer program instructions with which embodiments of the invention are implemented may be stored in any type of computer-readable media, and may be executed according to a variety of computing models including a client/server model, a peer-to-peer model, on a stand-alone computing device, or according to a distributed computing model in which various of the functionalities described herein may be effected or employed at different locations.

  1. For an introduction to this subject see A Survey on “PageRank” Computing, P. Berkhin, Internet Mathematics, Vol. 2, No 1., pp. 73-120, 2005.
  2.  See, for example, The Intelligent Surfer. Probabilistic Combination of Link and Content Information in PageRank, M. Richardson and P. Domingos, Advances in Neural Information Processing Systems 14, MIT Press, 2002.
  3.  See, for example, Combating Web Spam with TrustRank, Z. Gyongyi, H. Garcia-Molina, J. Pedersen, In Proceedings of 30thVLDB Conference, Toronto, Canada, ACM Press, 2004.
  4.  See, for example, Exploiting the Block Structure of the Web for Computing PageRank, S. Kamvar, T. Haveliwala, C. Manning, G. Golub, Stanford University Technical Report, 2003.