iMacros Javascript Scripting Interface

Automate complex tasks: iMacros commands refer to web page elements, so any programming logic must be put into a script that then uses iMacros to automate the website. For this purpose iMacros for Firefox contains a built-in Javascript Scripting Interface, which runs directly inside the browser.

The following information focuses on this built-in Javascript Scripting Interface. Supported commands (see below):

Note that the syntax of the regular, commercial Windows Web Scripting Interface and the built-in Firefox Javascript Scripting Interface is identical (unless where explicitly noted). Therefore they use the same documentation.


By default each Javascript step is shown during replay. This option is useful for testing and debugging, but it slows down the Javascript execution artificially. To run Javascript at its normal (very fast) speed please uncheck this option.


Javascript code running inside iMacros. The the//imacros-js:showsteps yes/no comment at the top of your Javascript file (including the //) overrides the global setting of “Show Javascript” checkbox in the option dialog.

Javascript examples.png
Examples: iMacros for Firefox automatically installs the SI-Send-Macro-Code.js - View Script Source Code:

SI-Send-Macro-Code.js

Sample Javascript script for use with iMacros for Firefox.

 /*Simple send code example */
 var MyMacroCode
 var jsNewLine="\n"
 MyMacroCode = "CODE:"
 var i
 
 MyMacroCode = MyMacroCode+"URL GOTO=http://www.iopus.com" + jsNewLine
 MyMacroCode = MyMacroCode+"URL GOTO=http://forum.iopus.com"
 iimDisplay("Send Macro via iimPlay")
 iimPlay(MyMacroCode)
 
 /*Some different ways to do looping*/
 iimDisplay("For Loop")
 for (i = 1; i <= 2; i++)
 {
   iimDisplay("i="+i)
   iimPlay("CODE:URL GOTO=http://forum.iopus.com/viewtopic.php?t="+i*10)
 }
 
 iimDisplay("While Loop")
 var i=1;
 while (i<=2)
 {
   iimDisplay("i="+i)
   iimPlay("CODE:URL GOTO=http://forum.iopus.com/viewtopic.php?t="+i*100)
   i=i+1;
 }
 
 iimDisplay("Do...While Loop")
 i = 1;
 do
 {
   iimDisplay("i="+i)
   iimPlay("CODE:URL GOTO=http://forum.iopus.com/viewtopic.php?t="+i*1000)
   i++;
 }
 while (i <= 2)
   
 /*Howo to generate a random wait time*/
 var mydelay
 /*Generate a number between 1 and 10*/
 mydelay=Math.round(10*Math.random());
 iimDisplay("Random wait t="+mydelay)
 MyMacroCode = "CODE:"
 MyMacroCode = MyMacroCode+"URL GOTO=http://wiki.imacros.net" + jsNewLine
 MyMacroCode = MyMacroCode+"WAIT SECONDS=" + mydelay + jsNewLine
 MyMacroCode = MyMacroCode+"URL GOTO=http://wiki.imacros.net/iMacros_for_Firefox"
 iimPlay(MyMacroCode)
 
 iimDisplay("Script completed.")

Important: iMacros macros must have the “.iim” file extension and Javascript scripts must have the “.js” file extension.

Note: Firefox can be remote controlled by the regular iMacros Scripting Interface via iimInit (“-fx”). The Javascript Scripting Interface does not include iimInit and iimExit, because they are not required. The Javascript runs inside the browser. The regular iMacros Scripting Interface is now available for Linux. It allows you to remote control Firefox and Chrome via Python.

Running multiple iMacros js scripts simultaneously

If you need to run more than one js script in iMacros for Firefox at the same time, you have to use a different Firefox profile for each script and make sure each opens as a different process.

Scripting Firefox

Mozilla Firefox, the complete browser, can be scripted with the commercial iMacros Enterprise Edition (= iMacros Scripting API). So while the free Java scripting runs inside Firefox, the API allows you to control Firefox from external software (C++, C#, Python, Perl,…). For details, see the chapter with the iimOpen command.

iimDisplay()

Displays a short message in the iMacros browser. A typical usage would be to distinguish several running iMacros Browsers or display information on the current position within the script.

Syntax


int ret_code = iimDisplay ( String message [, int timeout] ) 

Parameters

  • String message
    The message that is to be displayed in the iMacros Browser
    -or-
    #HIDEDISPLAY# – hides the message box
    #KIOSKMODE# – enables kiosk mode
    #KIOSKMODEOFF# – disables kiosk mode
  • int timeout
    The optional timeout value determines when the Scripting Interface returns a timeout error if the command is not completed in time. The default value is 10 seconds.

Examples

Visual Basic Script example:

Dim imacros1, imacros2, iret 

Set imacros1 = CreateObject("imacros") 
iret = imacros1.iimInit() 
iret = imacros1.iimDisplay("This is the 1st iMacros Browser")   

Set imacros2 = CreateObject("imacros") 
iret = imacros2.iimInit() 
iret = imacros2.iimDisplay("This is the 2nd iMacros Browser")


In iMacros for Chrome, if the sidebar is not available (e.g. if you start the browser from scripting interface API or run macros from bookmarks menu)errors and iimDisplay() messages are shown in a desktop notification pop-up window.

iimSet()

Defines variables for use inside the macro and assigns values to them. There are limitations as to what variables you can set using this command. You can set all built-in variables which you also can set via the command line. Additionally, you can set all user defined variables. After iimPlay all variables are erased. The return code is always 0.

Syntax

int ret_code = iimSet ( String VARNAME, String VARVALUE )

Parameters

  • String VARNAME
    A string defining which variable is to be set. The variable is created by iimSet. It does not have to be defined somewhere. Use VARNAME to create a user defined variable named {{VARNAME}} (case insensitive). Note: You can not use any of the built-in variables with iimSet.
  • String VARVALUE
    The value which is to be assigned to the variable.
    In contrast to TAG commands, blank spaces must not be replaced by <SP>. iimSet() takes care of that.

Examples

Loop over a number, for example to extract one table element after the other

Dim imacros, iret, i 
Set imacros = CreateObject("imacros") 
iret = imacros.iimInit() 
For i=0 To 4  
  ' You have to convert the value into a string! 
  iret = imacro.iimSet("myloop", CStr(i)) 
  iret = imacros.iimPlay("mymacro") 
Next

Note that variables defined with iimSet lose their values after each iimPlay. This is by design. If you want to use the same variables and values in another macro, you need to use iimSet again:


iret = imacro.iimSet("greeting", "hello") 
iret = imacros.iimPlay("1st-macro") 
 
iret = imacro.iimSet("greeting", "hello") 
iret = imacros.iimPlay("2nd-macro")

See Also

Related forum posts:

iimPlay()

Plays a macro. After the macro has played all options that have been set with the iimSet command are reset. Use iimGetLastExtract to get the extracted text. Upon the next iimPlay() call, internal variables like !TIMEOUT_PAGE and !EXTRACT for instance, will also be reset to their default values.

There are two fundamentally different ways of playing a macro using the iimPlay command. The first is to specify the filename (without the extension) of the macro in the String macro parameter. The other is to generate macro code on-the-fly in your program, preceded by “CODE:”, and pass it directly to iimPlay via the String macro parameter (see note below).

Syntax

int ret_code = iimPlay ( String macro [, int timeout] )

Parameters

  • String macro
    Either the macro’s filename without the extension, a string holding macro commands or the macro code.

(1) iimPlay (“demo-download”) – If you just supply the macro name, iMacros looks for the file in the standard macro folder (as specified in the Options dialog).
(2) iimPlay (“c:\MyMacros\macro1.iim”) – Full path*
(3) iimPlay (“Test\macro1″) – Relative path* to the iMacros Macros folder
(4) iimPlayCode (“URL GOTO….”) (old: iimPlay (“CODE:URL GOTO….”) => Code Example, Tips: see note below.

* Backslashes in the path need to be escaped when using Javascript or any other language that requires backslashes in paths to be escaped.
For example: “c:\\MyMacros\\macro1.iim”

  • int timeout
    The optional timeout value. If iimPlay does not return before this time span, the Scripting Interface returns a timeout error -3. No extraction data is returned in this case. The default value is 600 seconds. This is the timeout for the overall macro runtime. This value should not be confused with the several timeouts inside a macro. The iimPlay timeout is typically triggered by a browser crash, a browser freeze or if the macro runtime exceeds this value.

Error Handling

iimPlay returns a detailed error code for every problem encountered. Please see the Scripting Interface Return Codes and the general iMacros Error-Codes, which are transmitted via the iimPlay command back to the calling application.

The return codes of iimPlay can not only be used to deal with “big” issues such as web browser crashes etc, but are often simply used to react to missing elements on a website. So if an element is not found on a website, and then the TAG command reports an error, and iimPlay returns this error to the script. Example: If you extract book ISBN numbers, some books may not have an ISBN number and the TAG command reports a “not found” error.

The error codes of iimPlay are exactly the same that you get from the iMacros Browser/IE/Firefox itself. In addition, there there Interface specific error codes that deal with unexpected errors timeouts or browser crashes.

Examples

Play a macro located in the Macros\ directory of your iMacros installation (Visual Basic Script example):

Dim imacros, iret 
Set imacros = CreateObject("imacros") 
iret = imacros.iimOpen() 
iret = imacros.iimPlay("mymacro") 

Play some on-the-fly generated code (Visual Basic Script example):

Dim imacros, iret, mycode, myURL 

myURL = "http://www.iopus.com"  

mycode = "URL GOTO=" + myURL + vbNewLine 
mycode = mycode + "TAG POS=1 TYPE=FONT ATTR=TXT:<SP><SP>Online<SP>Store" 

Set imacros = CreateObject("imacros") 
iret = imacros.iimOpen() 
iret = imacros.iimPlayCode(mycode)

Note

Relative path

  • You have the option to use the relative path to the iimPlay command. For example, if your macro is in a subfolder “test” of the iMacros Macros folder, you may use iimPlay(“test\yourmacro”). The same is valid for the iMacros for Firefox built-in Javascript Scripting Interface.

CODE:

  • The recommended method for playing a macro generated-on-the fly is to assign the entire macro to a single string and then use one call to iimPlayCode to play the macro. While it is possible to use multiple calls to iimPlayCode to play each line of your macro separately, keep in mind that each time you call iimPlay or iimPlayCode, all of the iMacros internal variables are reset, and this can produce undesired results if you call each line of your macro this way.
  • Several commands in a macro generated on-the-fly must be separated by the CR (carriage return) symbol. These are vbNewLine or vbCrLf in Visual Basic or \r\n in C, C++ or C#.
  • iimPlayCode is not yet supported in iMacros for Firefox. Please continue to use iimPlay(“CODE:…”) instead.
  • Use the iMacros Editor “Code Generator” (in the File menu) for converting your macro to inline code.

Drop-down list boxes

  • If you start a macro via iimPlay which contains a TAG TYPE=SELECT… statement and the specified value is not in the drop down list then the iimPlay command returns an -1700 error code. In the corresponding error message (see iimGetErrorText) the maximum index is given. You can use this value, for example, to always select the last entry of a changing drop down list.

Playing iMacros for Firefox Javascript (.js) files

  • The version of iimPlay provided with the iMacros Enterprise Edition supports the playback of Javascript (.js) scripts in iMacros for Firefox. For example:

ret = iim1.iimOpen("-fx")
ret = iim1.iimPlay("MyScript.js")
  • The version of iimPlay provided with the built-in Javascript scripting interface in iMacros for Firefox only supports the playback of macro (.iim) files. However, there is a workaround as described in the following forum post:

iimGetLastExtract()

Name change: Please use iimGetExtract instead. See API enhancements for details.

Returns the contents of the !EXTRACT variable. If the last command was iimPlay and if EXTRACT is used inside a macro iimGetLastExtract returns the extracted text. If the EXTRACT command could not find the extraction anchor then an #EANF# (Extraction Anchor Not Found) message is returned. If there is no EXTRACT command in the macro which was just played then iimGetLastExtract returns an empty string (“”).

If in one macro several EXTRACT commands appear then the results are separated by the string [EXTRACT]. If complete tables where extracted, adjacent table elements are separated by the string #NEXT# and ends of table rows are delimited by the string #NEWLINE#.

Syntax

String extract = iimGetLastExtract ( [int index_of_extracted_text]   )

Parameters

Since version 6 this command supports the option to return the extracted information separately, so no further parsing and splitting is required:

iimGetLastExtract () – returns all extracted information at once

iimGetLastExtract (0) – returns all extracted information at once

iimGetLastExtract (1) – returns 1st extracted data

iimGetLastExtract (2) – returns 2nd extracted data (and so on)

Examples

Display the extracted results from a macro (Visual Basic Script example):

Dim imacros, iret 
Set imacros = CreateObject("imacros") 
iret = imacros.iimInit() 
iret = imacros.iimPlay("myextractmacro") 
MsgBox "The extract was: "+ vbNewline + _ 
  imacros.iimGetLastExtract() 
iret = imacros.iimExit()

See Also

Related forum posts:

iimInitiimPlayiimDisplayiimExitiimGetLastErroriimTakeBrowserScreenshot

iimGetLastError()

Returns the text associated with the last error.

Name change: Please use iimGetErrorText instead. See API enhancements for details.

Syntax

String err_message = iimGetLastError()

Parameters

None

Examples

Display a dialog if iMacros cannot be initialized (Visual Basic Script example):

Dim imacros, iret 
Set imacros = CreateObject("imacros") 
iret = imacros.iimInit() 
If iret < 0 Then 
  MsgBox "An error occured: " + vbNewline + _ 
    imacros.iimGetLastError() 
End If

 

Integrate Python and Eclipse IDE

There are two main ways you can work with Python: through the command line or through an IDE. I’ve chosen the Eclipse IDE.

  1. Eclipse requires Java Virtual Machine (JVM) – Download and install the Java Developer Kit.
  2. Download and install the 32-bit Kepler version of Eclipse.
  3. Install plug-ins to integrate Eclipse and Python:
    • Mylyn:
      1. Help -> Install New Software
      2. Beside “Work with” – Add “Mylyn” – “http://download.eclipse.org/mylyn/releases/latest”
      3. Select All and follow prompts to install – restart Eclipse
    • Pydev:
      1. Help -> Install New Software
      2. Beside “Work with” – Add “Pydev” – “http://pydev.org/updatest”
      3. Select All and follow prompts to install – restart Eclipse
  4. Configure Pydev – within Eclipse, select Window -> Preferences -> Pydev -> Interpreters -> Python Interpreter :: Enter c:\Python34\python.exe in top panel and click way through to set up.

Setting up Python in Windows 8.1

Set up Python on Windows 8.1

1. Visit the official Python download page and grab the Windows installer. Choose the 32-bit version.

2. Run the installer and accept all the default settings, including the “C:\Python34″ directory it creates.


3. Next, set the system’s PATH variable to include directories that include Python components and packages we’ll add later. To do this:

  • Open the Control Panel (you can find it using Search on the Charms Bar).
  • In the Control Panel, search for and open System.
  • In the dialog box, select Advanced System Settings.
  • In the next dialog, select Environment Variables.
  • In the User Variables section, edit the PATH statement to include this (if there is no PATH variable, click NEW to create one):
C:\Python34;C:\Python34\Lib\site-packages\;C:\Python34\Scripts\;

4. Now, you can open a command prompt (Charms Bar | Search | cmd) and type:

C:\> python

That will load the Python interpreter:

Python 3.4.1  etc etc
Type "help", "copyright", "credits" or license for more information.
>>>

Because of the settings you included in your PATH variable, you can now run this interpreter — and, more important, a script — from any directory on your system.

Press Control-Z plus Return to exit the interpreter and get back to a C: prompt.

Set up useful Python packages

setuptools and pip are installed with python 3.4.1 -  they will cover most of your installation needs, so go ahead and add pip. MechanizeRequests and BeautifulSoup are must-have utilities for web scraping, and we’ll add those next:

C:\> pip install mechanize
C:\> pip install requests
C:\> pip install beautifulsoup4

4. csvkit, which was covered here, is a great tool for dealing with comma-delimited text files. Add it:

C:\> pip install csvkit

You’re now set to get started using and learning Python under Windows 8.1. If you’re looking for a handy guide, start with the Official Python tutorial.

BUT FIRST, … install Eclipse IDE to support working in Python.

Eclipse & JVM

How to install Python 3.4.1 on CentOS 6

CentOS 6 ships with Python 2.6.6 and several critical system utilities, for example yum, will break if the default Python interpreter is upgraded. The trick is to install new versions of Python in /usr/local so that they can live side-by-side with the system version.

Execute all the commands below as root either by logging in as root or by using sudo.

Preparations – install prerequisites

In order to compile Python you must first install the development tools and a few extra libs. The extra libs are not strictly needed to compile Python but without them your new Python interpreter will be quite useless.

 

Things to consider

Before you compile and install Python there are a few things you should know and/or consider:

Unicode

Python has a long and complicated history when it comes to Unicode support. In Python 3.4 the Unicode support has been completely rewritten and strings are automatically stored using the most efficient encoding possible.

Shared library

You should probably compile Python as a shared library. If you compile Python as a shared library you must also tell it how to find the library. Our option:

  • Compile the path into the executable by adding this to the end of the configure command:LDFLAGS="-Wl,-rpath /usr/local/lib"

Use “make altinstall” to prevent problems

It is critical that you use make altinstall when you install your custom version of Python. If you use the normal make install you will end up with two different versions of Python in the filesystem both namedpython. This can lead to problems that are very hard to diagnose.

Download, compile and install Python

Here are the commands to download, compile and install Python.

After running the commands above your newly installed Python interpreter will be available as /usr/local/bin/python3.4. The system version of Python 2.6.6 will continue to be available as /usr/bin/python/usr/bin/python2 and /usr/bin/python2.6.

Setuptools + pip

Setuptools has replaced Distribute as the official package manager used for installing packages from the Python Package Index. Setuptools and pip are installed with Python 3.4.1. It builds on top of Setuptools and provides a few extra functions that are useful when you manage your packages.

The packages will end up in /usr/local/lib/pythonX.Y/site-packages/ (where X.Y is the Python version).

What’s next?

Since you are using Python 3.4 you don’t need to install virtualenv because that functionality is already built in.

Each isolated Python environment (also called sandbox) can have its own Python version and packages. This is very useful when you work on multiple projects or on different versions of the same project.

Create your first isolated Python environment

When you use pyvenv to create a sandbox you must install setuptools and pip inside the sandbox. You can reuse the ez_setup.py file you downloaded earlier and just run it after you activate your new sandbox.

Unix (wget)

Most Linux distributions come with wget.

Download ez_setup.py and run it using the target Python version. The script will download the appropriate version and install it for you:

> wget https://bootstrap.pypa.io/ez_setup.py -O - | python

Note that you will may need to invoke the command with superuser privileges to install to the system Python:

> wget https://bootstrap.pypa.io/ez_setup.py -O - | sudo python

Alternatively, Setuptools may be installed to a user-local path:

> wget https://bootstrap.pypa.io/ez_setup.py -O - | python - --user

Unix including Mac OS X (curl)

If your system has curl installed, follow the wget instructions but replace wget with curl and -O with -o. For example:

> curl https://bootstrap.pypa.io/ez_setup.py -o - | python

Advanced Installation

For more advanced installation options, such as installing to custom locations or prefixes, download and extract the source tarball from Setuptools on PyPIand run setup.py with any supported distutils and Setuptools options. For example:

setuptools-x.x$ python setup.py install --prefix=/opt/setuptools

Use --help to get a full options list, but we recommend consulting the EasyInstall manual for detailed instructions, especially the section on custom installation locations.

PHP, jQuery, Javascript Notes

.serializeArray() – jQuery

The .serializeArray() method creates a JavaScript array of objects, ready to be encoded as a JSON string. It operates on a jQuery object representing a set of form elements. The .serializeArray() method uses the standard W3C rules for successful controls to determine which elements it should include; in particular the element cannot be disabled and must contain a name attribute. No submit button value is serialized since the form was not submitted using a button. Data from file select elements is not serialized. This method can act on a jQuery object that has selected individual form elements, such as <input><textarea>, and <select>. However, it is typically easier to select the <form> tag itself for serialization. This produces the following data structure (provided that the browser supports console.log):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
{
name: "a",
value: "1"
},
{
name: "b",
value: "2"
},
{
name: "c",
value: "3"
},
{
name: "d",
value: "4"
},
{
name: "e",
value: "5"
}
]

 

.append() – jQuery

The .append() method inserts the specified content as the last child of each element in the jQuery collection (To insert it as the first child, use .prepend()). With .append(), the selector expression preceding the method is the container into which the content is inserted. Similar to other content-adding methods such as .prepend() and .before().append() also supports passing in multiple arguments as input. Supported input includes DOM elements, jQuery objects, HTML strings, and arrays of DOM elements.

$(“el”) – - name selector – jQuery

An element to search for – by Name. Refers to the tagName of DOM nodes. JavaScript’s getElementsByTagName() function is called to return the appropriate elements when this expression is used.

$(“#el”) – id selector – jQuery

An ID to search for, specified via the id attribute of an element. For id selectors, jQuery uses the JavaScript function document.getElementById(), which is extremely efficient. Calling jQuery() (or $()) with an id selector as its argument will return a jQuery object containing a collection of either zero or one DOM element. Each id value must be used only once within a document. If more than one element has been assigned the same ID, queries that use that ID will only select the first matched element in the DOM. This behavior should not be relied on, however; a document with more than one element using the same ID is invalid. If the id contains characters like periods or colons you have to escape those characters with backslashes.

 

jQuery()


Return a collection of matched elements either found in the DOM based on passed argument(s) or created by passing an HTML string.

jQuery( selector [, context ] )Returns: jQuery

Description: Accepts a string containing a CSS selector which is then used to match a set of elements.

In the first formulation listed above, jQuery() — which can also be written as $() — searches through the DOM for any elements that match the provided selector and creates a new jQuery object that references these elements:

1
$( "div.foo" );

If no elements match the provided selector, the new jQuery object is “empty”; that is, it contains no elements and has .lengthproperty of 0.

Selector Context

By default, selectors perform their searches within the DOM starting at the document root. However, an alternate context can be given for the search by using the optional second parameter to the $() function. For example, to do a search within an event handler, the search can be restricted like so:

1
2
3
$( "div.foo" ).click(function() {
$( "span", this ).addClass( "bar" );
});

When the search for the span selector is restricted to the context of this, only spans within the clicked element will get the additional class.

Internally, selector context is implemented with the .find() method, so $( "span", this ) is equivalent to $( this ).find( "span" ).

Using DOM elements

The second and third formulations of this function create a jQuery object using one or more DOM elements that were already selected in some other way. When passing an array, each element must be a DOM element; mixed data is not supported. A jQuery object is created from the array elements in the order they appeared in the array; unlike most other multi-element jQuery operations, the elements are not sorted in DOM order.

A common use of single-DOM-element construction is to call jQuery methods on an element that has been passed to a callback function through the keyword this:

1
2
3
$( "div.foo" ).click(function() {
$( this ).slideUp();
});

This example causes elements to be hidden with a sliding animation when clicked. Because the handler receives the clicked item in the this keyword as a bare DOM element, the element must be passed to the $() function before applying jQuery methods to it.

XML data returned from an Ajax call can be passed to the $() function so individual elements of the XML structure can be retrieved using .find() and other DOM traversal methods.

1
2
3
$.post( "url.xml", function( data ) {
var $child = $( data ).find( "child" );
});

 

When a jQuery object is passed to the $() function, a clone of the object is created. This new jQuery object references the same DOM elements as the initial one.

As of jQuery 1.4, calling the jQuery() method with no arguments returns an empty jQuery set (with a .length property of 0). In previous versions of jQuery, this would return a set containing the document node.

At present, the only operations supported on plain JavaScript objects wrapped in jQuery are: .data(),.prop(),.on().off().trigger() and .triggerHandler(). The use of .data() (or any method requiring .data()) on a plain object will result in a new property on the object called jQuery{randomNumber} (eg. jQuery123456789). Should .trigger( "eventName" ) be used, it will search for an “eventName” property on the object and attempt to execute it after any attached jQuery handlers are executed. It does not check whether the property is a function or not. To avoid this behavior, .triggerHandler( "eventName" ) should be used instead.

Chosen

Chosen (v1.1.0)

Chosen has a number of options and attributes that allow you to have full control of your select boxes.

Options

The following options are available to pass into Chosen on instantiation.

Example:

  $(".my_select_box").chosen({
    disable_search_threshold: 10,
    no_results_text: "Oops, nothing found!",
    width: "95%"
  });
Option Default Description
allow_single_deselect false When set to true on a single select, Chosen adds a UI element which selects the first elment (if it is blank).
disable_search false When set to true, Chosen will not display the search field (single selects only).
disable_search_threshold 0 Hide the search input on single selects if there are fewer than (n) options.
enable_split_word_search true By default, searching will match on any word within an option tag. Set this option to false if you want to only match on the entire text of an option tag.
inherit_select_classes false When set to true, Chosen will grab any classes on the original select field and add them to Chosen’s container div.
max_selected_options Infinity Limits how many options the user can select. When the limit is reached, the chosen:maxselected event is triggered.
no_results_text “No results match” The text to be displayed when no matching results are found. The current search is shown at the end of the text (e.g., No results match “Bad Search”).
placeholder_text_multiple “Select Some Options” The text to be displayed as a placeholder when no options are selected for a multiple select.
placeholder_text_single “Select an Option” The text to be displayed as a placeholder when no options are selected for a single select.
search_contains false By default, Chosen’s search matches starting at the beginning of a word. Setting this option to trueallows matches starting from anywhere within a word. This is especially useful for options that include a lot of special characters or phrases in ()s and []s.
single_backstroke_delete true By default, pressing delete/backspace on multiple selects will remove a selected choice. When false, pressing delete/backspace will highlight the last choice, and a second press deselects it.
width Original select width. The width of the Chosen select box. By default, Chosen attempts to match the width of the select box you are replacing. If your select is hidden when Chosen is instantiated, you must specify a width or the select will show up with a width of 0.
display_disabled_options true By default, Chosen includes disabled options in search results with a special styling. Setting this option to false will hide disabled results and exclude them from searches.
display_selected_options true

By default, Chosen includes selected options in search results with a special styling. Setting this option to false will hide selected results and exclude them from searches.

Note: this is for multiple selects only. In single selects, the selected result will always be displayed.

Attributes

Certain attributes placed on the select tag or its options can be used to configure Chosen.

Example:

  <select class="my_select_box" data-placeholder="Select Your Options">
    <option value="1">Option 1</option>
    <option value="2" selected>Option 2</option>
    <option value="3" disabled>Option 3</option>
  </select>
Attribute Description
data-placeholder

The text to be displayed as a placeholder when no options are selected for a select. Defaults to “Select an Option” for single selects or “Select Some Options” for multiple selects.

Note:This attribute overrides anything set in the placeholder_text_multiple orplaceholder_text_single options.

multiple The attribute multiple on your select box dictates whether Chosen will render a multiple or single select.
selected, disabled Chosen automatically highlights selected options and disables disabled options.

Classes

Classes placed on the select tag can be used to configure Chosen.

Example:

  <select class="my_select_box chosen-rtl">
    <option value="1">Option 1</option>
    <option value="2">Option 2</option>
    <option value="3">Option 3</option>
  </select>
Classname Description
chosen-rtl

Chosen supports right-to-left text in select boxes. Add the class chosen-rtl to your select tag to support right-to-left text options.

Note: The chosen-rtl class will pass through to the Chosen select even when theinherit_select_classes option is set to false.

Triggered Events

Chosen triggers a number of standard and custom events on the original select field.

Example:

  $('.my_select_box').on('change', function(evt, params) {
    do_something(evt, params);
  });
Event Description
change

Chosen triggers the standard DOM event whenever a selection is made (it also sends a selected or deselected parameter that tells you which option was changed).

Note: in order to use change in the Prototype version, you have to include the Event.simulate class. The selected and deselected parameters are not available for Prototype.

chosen:ready Triggered after Chosen has been fully instantiated.
chosen:maxselected Triggered if max_selected_options is set and that total is broken.
chosen:showing_dropdown Triggered when Chosen’s dropdown is opened.
chosen:hiding_dropdown Triggered when Chosen’s dropdown is closed.
chosen:no_results Triggered when a search returns no matching results.

Note: all custom Chosen events (those that being with chosen:) also include the chosen object as a parameter.

Triggerable Events

You can trigger several events on the original select field to invoke a behavior in Chosen.

Example:

  // tell Chosen that a select has changed
    $('.my_select_box').trigger('chosen:updated');
Event Description
chosen:updated This event should be triggered whenever Chosen’s underlying select element changes (such as a change in selected options).
chosen:activate This is the equivalant of focusing a standard HTML select field. When activated, Chosen will capure keypress events as if you had clicked the field directly.
chosen:open This event activates Chosen and also displays the search results.
chosen:close This event deactivates Chosen and hides the search results.

Simple PHP Scraper Class

I gave a presentation entitled “The SEOs Guide to Scraping Everything” on May 10th at the SEOmoz and SEER Interactive Meetup in Philadelphia, PA.  Since I only had 8 minutes to present, I figured I’d augment my presentation by providing a simple PHP scraper class that people can use (and extend) to get started with scraping.

You can download the scraper class here.

There’s a quick sample for how to use the scraper class below my slide deck from the meetup:

Using the Scraper:

setProxies($proxies);

$scraper->scrape('http://www.cnn.com');
var_dump($scraper);
?>

 

And here’s the actual scraper class:

 

class Eppie_Service_Scraper{

    public function __construct(){
        // set proxies -- you can add your own here or use the setProxies method
        $this->_proxies = array();
    }

    public function scrape($url)
    {
        $this->_url = $url;
        $dom = new DOMDocument();
    	$proxy = $this->_pickProxy();

    	$ch = curl_init();
    	curl_setopt($ch, CURLOPT_URL, $url);
    	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    	curl_setopt($ch, CURLOPT_REFERER, "");
    	curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
    	curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_6) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19");
    	if($proxy){
        	curl_setopt($ch, CURLOPT_PROXY, $proxy);
        	curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
    	}
    	$body = curl_exec($ch);
    	curl_close($ch);

    	$this->_curl_result = $body;
    	@$dom->loadHTML($body);
    	$this->_dom = $dom;

    	$this->_parseDOM();
    }

    public function setProxies($proxies)
    {
        $this->_proxies = $proxies;
    }

    private function _pickProxy()
    {
        if(count($this->_proxies) > 0)
            return $this->_proxies[rand(0, count($this->_proxies) - 1)];
        else return false;
    }

    public function setKeyword($keyword)
    {
        $this->_keyword = $keyword;
    }

    private function _parseDOM()
    {
        $xpath = new DOMXPath($this->_dom);
        $title = $xpath->query("//head/title");
        $meta_desc = $xpath->query("//head/meta[@name='description']/@content");
        $meta_kw = $xpath->query("//head/meta[@name='keywords']/@content");
        $h1 = $xpath->query("//h1");
        $h2 = $xpath->query("//h2");
        $h3 = $xpath->query("//h3");
	$h4 = $xpath->query("//h4");
	$h5 = $xpath->query("//h5");
	$h6 = $xpath->query("//h6");
        $img = $xpath->query("//img");
        $img_alt = $xpath->query("//img[@alt!='']/@alt");
        $strong = $xpath->query("//strong | //b");
        $body = $xpath->query("//body");

        if($title->length > 0)
            $this->_title = $title->item(0)->nodeValue;

        if($meta_desc->length > 0)
            $this->_meta_desc = $meta_desc->item(0)->nodeValue;

        if($meta_kw->length > 0)
            $this->_meta_kw = $meta_kw->item(0)->nodeValue;

        if($h1->length > 0)
        {
            for($i=0; $i < $h1->length; $i++)
                $this->_h1[] = $h1->item($i)->nodeValue;
        }

        if($h2->length > 0)
        {
            for($i=0; $i < $h2->length; $i++)
                $this->_h2[] = $h2->item($i)->nodeValue;
        }

        if($h3->length > 0)
        {
            for($i=0; $i < $h3->length; $i++)
                $this->_h3[] = $h3->item($i)->nodeValue;
        }

	if($h4->length > 0)
        {
            for($i=0; $i < $h4->length; $i++)
                $this->_h4[] = $h4->item($i)->nodeValue;
        }

	if($h5->length > 0)
        {
            for($i=0; $i < $h5->length; $i++)
                $this->_h5[] = $h5->item($i)->nodeValue;
        }

	if($h6->length > 0)
        {
            for($i=0; $i < $h6->length; $i++)
                $this->_h6[] = $h6->item($i)->nodeValue;
        }

        if($img_alt->length > 0)
        {
            for($i=0; $i < $img_alt->length; $i++)
                $this->_img_alt[] = $img_alt->item($i)->nodeValue;
        }

        $this->_img_alt_pct = ($img_alt->length / $img->length)*100;

        if($strong->length > 0)
        {
            for($i=0; $i < $strong->length; $i++)
                $this->_strong[] = $strong->item($i)->nodeValue;
        }

    }

}

Using DOMXPath for Parsing Page Content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
If you have a small set of HTML pages that you want to scrape data from and then to stuff into a database, Regexes might work fine… this works well for a limited, one-time job (from community Wiki).

If we are to apply XPath methods then, after we upload a content, we had better brush it up to prepare for export into DOM and DOMXPath objects.

Here I’ve summed the basic steps to be done with DOMXPath class usage:
  1. Initialize a DOMDocument class instance from page content (work with HTML as with XML)
  2. Initialize a DOMXPath class instance from DOMDocument class instance.
  3. Parse the DOMXPath object.

1. Initializing a DOMDocument  class instance from page content

  • create a new DOMDocument class instance
1
$DOM = new DOMDocument;
1
libxml_use_internal_errors(true);
When using this function be sure to clear your internal error buffer ( libxml_clear_errors() ). If you don’t and you use this in a long running process, you may find that all your memory is used up. Outsourced from here. See the ‘enable user error handling’ bullet point.
  • load the HTML text into the DOMDocument object
1
if (!$DOM-&gt;loadHTML($page))
  • enable user error handling
1
2
3
4
5
6
7
8
    {   $errors=&amp;quot;&amp;quot;;
        foreach (libxml_get_errors() as $error)  {
           $errors.=$error-&gt;message.’&lt;br/&gt;’;
        }
        libxml_clear_errors();
        print “libxml errors:&lt;br&gt;$errors”;
        return;
    }

Now the DOMDocument object (named ‘$DOM’) contains all the target text as a HTML DOM structure. It’s ready for different methods and properties to be applied.

2. Initializing a DOMXPath object from the DOMDocument object

  • Initialize DOMXPath object for further parse
1
$xpath = new DOMXPath($DOM);

Now XPath methods are applicable to the content

Parsing the DOMXPath object

As a test page I took the Blocks Testing Ground page and wrote a code using XPath to retrieve data.

1
2
3
4
5
6
$case1 = $xpath-&gt;query(‘//*[@id="case1"]‘)-&gt;item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath-&gt;query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry-&gt;firstChild-&gt;nodeValue} &lt;br /&gt; “;
}

 

How libxml library reacts to a malformed HTML

The libxml library gave no warning about a malformed HTML non-related to the direct DOM structure parse, yet the library has issued an error for the malformed HTML instance that is the subject of a direct parse:

  • No warning for this case: <p><p><p>
  • For a missed bracket: <div prod=’name1′ <div …> and then for the extra opened tag: <div prod=’name1′ ><div>  the library has issued an exception for the DOMXPath ‘query’ method.

The whole Scraper Listing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?php
$curl = curl_init(‘http://testing-ground.scraping.pro/blocks’);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
    echo ‘Scraper error: ‘ . curl_error($curl);
    exit;
}
curl_close($curl);
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
if (!$DOM->loadHTML($page))
    {
        $errors=””;
        foreach (libxml_get_errors() as $error)  {
            $errors.=$error->message.”<br/>”;
        }
        libxml_clear_errors();
        print “libxml errors:<br>$errors”;
        return;
    }
$xpath = new DOMXPath($DOM);
$case1 = $xpath->query(‘//*[@id="case1"]‘)->item(0);
$query = ‘div[not (@class="ads")]/span[1]‘;
$entries = $xpath->query($query, $case1);
foreach ($entries as $entry) {
    echo ” {$entry->firstChild->nodeValue} <br /> “;
}
?>

http://scraping.pro/5-best-xpath-cheat-sheets-and-quick-references/#more-5731

 

A Simple Guide to Five Normal Forms in Relational Database Theory

William Kent, “A Simple Guide to Five Normal Forms in Relational Database Theory”, Communications of the ACM 26(2), Feb. 1983, 120-125. Also IBM Technical Report TR03.159, Aug. 1981. Also presented at SHARE 62, March 1984, Anaheim, California. Also in A.R. Hurson, L.L. Miller and S.H. Pakzad, Parallel Architectures for Database Systems, IEEE Computer Society Press, 1989. [12 pp]


Copyright 1996 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.


A Simple Guide to Five Normal Forms in Relational Database Theory

William Kent
Sept 1982



> 1 INTRODUCTION . . . 2
> 2 FIRST NORMAL FORM . . . 2
> 3 SECOND AND THIRD NORMAL FORMS . . . 2
>> 3.1 Second Normal Form . . . 2
>> 3.2 Third Normal Form . . . 3
>> 3.3 Functional Dependencies . . . 4
> 4 FOURTH AND FIFTH NORMAL FORMS . . . 5
>> 4.1 Fourth Normal Form . . . 6
>>> 4.1.1 Independence . . . 8
>>> 4.1.2 Multivalued Dependencies . . . 9
>> 4.2 Fifth Normal Form . . . 9
> 5 UNAVOIDABLE REDUNDANCIES . . . 12
> 6 INTER-RECORD REDUNDANCY . . . 13
> 7 CONCLUSION . . . 13
> 8 ACKNOWLEDGMENT . . . 14
> 9 REFERENCES . . . 14


1 INTRODUCTION

The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date [4].

The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

2 FIRST NORMAL FORM

First normal form [1] deals with the “shape” of a record type.

Under first normal form, all occurrences of a record type must contain the same number of fields.

First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn’t deal with records having a variable number of fields.

3 SECOND AND THIRD NORMAL FORMS

Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.

Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.

We deal now only with “single-valued” facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase “Y is a fact about X” signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.

3.1 Second Normal Form

Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:

---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------

The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:

  • The warehouse address is repeated in every record that refers to a part stored in that warehouse.
  • If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.
  • Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.
  • If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse’s address.

To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:

-------------------------------  --------------------------------- 
| PART | WAREHOUSE | QUANTITY |  | WAREHOUSE | WAREHOUSE-ADDRESS |
====================-----------  =============--------------------

When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term “normalization” is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third.

The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs.

3.2 Third Normal Form

Third normal form is violated when a non-key field is a fact about another non-key field, as in

------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------

The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT — in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:

  • The department’s location is repeated in the record of every employee assigned to that department.
  • If the location of the department changes, every such record must be updated.
  • Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.
  • If a department has no employees, there may be no record in which to keep the department’s location.

To satisfy third normal form, the record shown above should be decomposed into the two records:

-------------------------  -------------------------
| EMPLOYEE | DEPARTMENT |  | DEPARTMENT | LOCATION |
============-------------  ==============-----------

To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.

3.3 Functional Dependencies

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is “functionally dependent” on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can’t be two records having the same X value.

There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person’s address is a single-valued fact, i.e., a person has only one address. If we don’t provide unique identifiers for people, then there will not be a functional dependency in the data:

----------------------------------------------
|   PERSON   |       ADDRESS                 |
-------------+--------------------------------
| John Smith | 123 Main St., New York        |
| John Smith | 321 Center St., San Francisco |
----------------------------------------------

Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.

Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.

---------------------------------------
|   PERSON   |       ADDRESS          |
-------------+-------------------------
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC   |
---------------------------------------

We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.

----------------------------------------------------------
| EMPLOYEE  |  FATHER    |  FATHER'S-ADDRESS             |
|============------------+-------------------------------|
| Art Smith | John Smith | 123 Main St., New York        |
| Bob Smith | John Smith | 123 Main Street, NYC          |
| Cal Smith | John Smith | 321 Center St., San Francisco |
----------------------------------------------------------

However, in formal terms, there is no functional dependency here between FATHER’S-ADDRESS and FATHER, and hence no violation of third normal form.

4 FOURTH AND FIFTH NORMAL FORMS

Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By “many-to-many” we mean that an employee may have several skills, and a skill may belong to several employees.

Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.

In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.

4.1 Fourth Normal Form

Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

The term “independent” will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================

Instead, they should be represented in the two records

--------------------  -----------------------
| EMPLOYEE | SKILL |  | EMPLOYEE | LANGUAGE |
====================  =======================

Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.

The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record:

(1) A disjoint format, in which a record contains either a skill or a language, but not both:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith    | cook  |          |   
| Smith    | type  |          |
| Smith    |       | French   |
| Smith    |       | German   |
| Smith    |       | Greek    |
-------------------------------

This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)

(2) A random mix, with three variations:

(a) Minimal number of records, with repetitions:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith    | cook  | French   |   
| Smith    | type  | German   |
| Smith    | type  | Greek    |
-------------------------------

(b) Minimal number of records, with null values:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith    | cook  | French   |   
| Smith    | type  | German   |
| Smith    |       | Greek    |
-------------------------------

(c) Unrestricted:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith    | cook  | French   |   
| Smith    | type  |          |
| Smith    |       | German   |
| Smith    | type  | Greek    |
-------------------------------

(3) A “cross-product” form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith    | cook  | French   |
| Smith    | cook  | German   |
| Smith    | cook  | Greek    |
| Smith    | type  | French   |
| Smith    | type  | German   |
| Smith    | type  | Greek    |
-------------------------------

Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:

  • If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.
  • Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.
  • Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn’t leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn’t also been deleted.

Fourth normal form minimizes such update problems.

4.1.1 Independence

We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are “independent” in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That’s precisely why all the maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith    | cook  | French   |
| Smith    | type  | French   |
| Smith    | type  | German   |
| Smith    | type  | Greek    |
-------------------------------

Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.

4.1.2 Multivalued Dependencies

For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the “cross-product” maintenance policy mentioned above. That is, for our example, every one of an employee’s skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no “information” in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.

We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:

  • An employee uses certain skills on certain projects.
  • An employee uses certain languages on certain projects.

If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where “EP” represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.

4.2 Fifth Normal Form

Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.

We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford    | car     | 
| Smith | GM      | truck   | 
-----------------------------

This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.

But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford    | car     | 
| Smith | Ford    | truck   | 
| Smith | GM      | car     | 
| Smith | GM      | truck   | 
| Jones | Ford    | car     | 
-----------------------------

In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:

-------------------   ---------------------   ------------------- 
| AGENT | COMPANY |   | COMPANY | PRODUCT |   | AGENT | PRODUCT |
|-------+---------|   |---------+---------|   |-------+---------|
| Smith | Ford    |   | Ford    | car     |   | Smith | car     |
| Smith | GM      |   | Ford    | truck   |   | Smith | truck   |
| Jones | Ford    |   | GM      | car     |   | Jones | car     |
-------------------   | GM      | truck   |   -------------------
                      ---------------------

These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.

One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times.

It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form.

It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can’t determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.

The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford    | car     | 
| Smith | Ford    | truck   | 
| Smith | GM      | car     | 
| Smith | GM      | truck   | 
| Jones | Ford    | car     | 
| Jones | Ford    | truck   | 
| Brown | Ford    | car     | 
| Brown | GM      | car     | 
| Brown | Totota  | car     | 
| Brown | Totota  | bus     | 
-----------------------------
-------------------   ---------------------   ------------------- 
| AGENT | COMPANY |   | COMPANY | PRODUCT |   | AGENT | PRODUCT |
|-------+---------|   |---------+---------|   |-------+---------|
| Smith | Ford    |   | Ford    | car     |   | Smith | car     | Fifth
| Smith | GM      |   | Ford    | truck   |   | Smith | truck   | Normal
| Jones | Ford    |   | GM      | car     |   | Jones | car     | Form
| Brown | Ford    |   | GM      | truck   |   | Jones | truck   |
| Brown | GM      |   | Toyota  | car     |   | Brown | car     |
| Brown | Toyota  |   | Toyota  | bus     |   | Brown | bus     |
-------------------   ---------------------   -------------------

Observe that:

  • Jones sells cars and GM makes cars, but Jones does not represent GM.
  • Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
  • Brown represents Ford and Brown sells buses, but Ford does not make buses.

Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.

In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.

5 UNAVOIDABLE REDUNDANCIES

Normalization certainly doesn’t remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown Section 4.1.1, it seems unavoidable that we record the fact that “Smith can type” several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that “Smith sells cars” several times.

6 INTER-RECORD REDUNDANCY

The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the “ultimate” normal form with respect to such redundanciesæ.

Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:

-------------------------  -------------------------
| EMPLOYEE | DEPARTMENT |  | DEPARTMENT | LOCATION |
============-------------  ==============-----------
-----------------------
| EMPLOYEE | LOCATION |
============-----------

In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.

Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].

7 CONCLUSION

While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed:

  • Single-valued vs. multi-valued facts.
  • Dependency on the entire key.
  • Independent vs. dependent facts.
  • The presence of mutual constraints.
  • The presence of non-unique or non-singular representations.

And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.

8 ACKNOWLEDGMENTS

I am very grateful to Ted Codd and Ron Fagin for reading earlier drafts and making valuable comments, and especially to Chris Date for helping clarify some key points.

9 REFERENCES

  1. E.F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Comm. ACM 13 (6), June 1970, pp. 377-387.The original paper introducing the relational data model.
  2. E.F. Codd, “Normalized Data Base Structure: A Brief Tutorial”, ACM SIGFIDET Workshop on Data Description, Access, and Control, Nov. 11-12, 1971, San Diego, California, E.F. Codd and A.L. Dean (eds.).An early tutorial on the relational model and normalization.
  3. E.F. Codd, “Further Normalization of the Data Base Relational Model”, R. Rustin (ed.), Data Base Systems (Courant Computer Science Symposia 6), Prentice-Hall, 1972. Also IBM Research Report RJ909.The first formal treatment of second and third normal forms.
  4. C.J. Date, An Introduction to Database Systems (third edition), Addison-Wesley, 1981.An excellent introduction to database systems, with emphasis on the relational.
  5. R. Fagin, “Multivalued Dependencies and a New Normal Form for Relational Databases”, ACM Transactions on Database Systems 2 (3), Sept. 1977. Also IBM Research Report RJ1812.The introduction of fourth normal form.
  6. R. Fagin, “Normal Forms and Relational Database Operators”, ACM SIGMOD International Conference on Management of Data, May 31-June 1, 1979, Boston, Mass. Also IBM Research Report RJ2471, Feb. 1979.The introduction of fifth normal form.
  7. W. Kent, “A Primer of Normal Forms”, IBM Technical Report TR02.600, Dec. 1973.An early, formal tutorial on first, second, and third normal forms.
  8. T.-W. Ling, F.W. Tompa, and T. Kameda, “An Improved Third Normal Form for Relational Databases”, ACM Transactions on Database Systems, 6(2), June 1981, 329-346.One of the first treatments of inter-relational dependencies.