Understand the Web as a Community

To mash the web, we must first understand it. Mash Maker uses a collaboratively-edited shared database of extractors to extract meaning from pages on the web.

You can teach Mash Maker how to understand pages like the one you are currently browsing by expanding [#The Extractor Editor]] and using a simple point-and-click interface to pick out things of interest.

If you find that the extractor for a page is broken then you can edit it and save your changes - much like a wiki. Although it is possible for a user to vandalise extractors, it is also easy for another user to revert such vandalism when it occurs by browsing the extractor's History.

Extractors as Categories

Mash Maker categorizes pages based on the extractor that extracts data for them. When a user saves a Mashups for a particular page, it will be suggested for all other pages that are managed by the same extractor.

It is important to be aware of this when editing an extractor. If you create a new extractor to manage pages that were previously managed by a different extractor, then Mash Maker will no longer suggest Mashups associated with the previous extractor.

The Data Tree

You can see the information that Mash Maker currently has for a page by opening the data tree panel in The Expert Sidebar. This panel shows both the data that the current extractor has extracted from the current page, and also any information that has been added by Widgets.

The data tree shows exactly the same information that is visible to the widgets on the page.

The Extractor Editor

To edit the extractor for the current page, open The Expert Sidebar and switch to the extractor tab. The extractor editor interface shows you the data tree that the current extractor has extracted from the page, and provides controls that allow you to edit the extractor. Clicking on any item in the data tree will highlight the place on the page that that data was extracted from.

The buttons at the top of the sidebar allow one to pick something new from the page or edit the URL Handler for the page. The panel allows one to tell Mash Maker more about the item that is currently selected in the data tree. Buttons at the bottom allow one to refresh the tree, publish the extractor, or browse past versions and other extractors for the same domain.

Pick from the Page

To add new information from the page to the data tree, click on the "Pick From Page" button, then click on the information on the page that you are interested in. This can be either text, or an image.

When you have picked something out, Mash Maker will ask you whether it is a property of the entire page, or a property of some sub-item on the page. If you say it is a sub-item then you will be prompted to chose either a previously identified kind of sub-item, or give a name for a new kind of sub-item.

If the selection is a property of an item that Mash Maker is not currently familiar with, then it will ask you to show it the shape of the item by clicking buttons marked "bigger", "smaller", "down more", and "down less". The different between "bigger" and "down more" is that bigger expands outwards through parent nodes on the page, while "down more" expands to include more nodes that are successors of the current selection.

Once you have picked out the property you are interested in, and identified the item it is part of, the interface returns to the normal data tree and the newly identified property is selected. You can now use the details panel to describe this property in more detail.

Every property needs to be given a name. Since these names are used by widgets to identify information, you should try to make property names consistent with those used for other similar pages. To help you do this, Mash Maker provides a Property Browser that allows you to browse the property names defined by other extractors, and set up sub-property relationships between properties.

Match Rules

The property details panel allows one to enable one or more match rules that fine-tune the way that Mash Maker extracts data from the page. They are as follows:

match rules are as follows
multi allow there to be multiple instances of this property. E.g. a person may have multiple phone numbers.
required ignore any items that do not have this property. This is used to help identify which things really are items, rather than just things that look like items.
no subs only include the next of the selected node itself, and not the text of nodes inside it.
expert turn on a number of lower-level controls, including direct XPath editing
prefix require that the selection contain a particular text prefix
postfix require that the selectino contain a particular text postfix
position identify the correct property based on its position on the page
regexp apply the given regular expression to the text. The result value is the first matching group

The easiest way to learn how these work is probably to play with them and see how they behave.

Extractor Settings

Click on the root node of the data tree to edit the properties of the extractor itself. An extractor has a name, a type, a URL regexp, a priority, and a list of example URLs.

The name is the name that will appear in the Mashup Gallery, and should be chosen with this in mind. One should not change the name of an existing extractor since the name is used as the key that Mashups are associated with.

The type is used to help suggest widgets that can be applied to this extractor's pages. Click the "Types" button to browse the types that have previously been associated with extractors.

The URL regexp determines which pages this extractor will be applied to. You should take some care when writing this regexp to ensure that it matches all the pages this extractor works for, but does not match pages that are better handled by other extractors.

In some cases you will want to have extractors whose regular expressions match overlapping sets of pages. In these cases, you should use the extractor prority to determine which extractor is used. For example, the "Craigslist Apartments" extractor has a higher priority than the "Craigslist Listings" extractor, since it implements a special case. The extractor with the highest priority number wins.

An extractor also has a list of example URLs. When you edit an extractor, the current URL is automatically added to this list. When editing an extractor, you are strongly encouraged to verify that the extractor still works on all the other example URLs before saving it.

History

One of the weaknesses of allowing anyone to edit an extractor is that sometimes someone will break an extractor that used to work, either through deliberate vandalism, or through an honest mistake. When this happens, you can easily revert back to a previous version of the extractor using the history browser.

The history browser also allows you to browse all extractors associated with the current domain. This is particularly useful for the times when you want to get back to an extractor you wrote a while ago, but can't find a page that is matched by its URL regexp.

URL Handlers

In addition to to understanding the meaning of the content on a page, it is also useful to be able to understand the meaning of it's URL. This is used in particular by the Copy and Paste widget, which creates a URL for one site using information from the current page.

The URL Handler is also used by the Note widget, which uses the URL Handler to determine when two URLs refer to the same resource and thus should have the same note associated with them.

Clicking on the "Edit URL Handler" button will open the URL Handler Editor:

The URL Handler Editor identifies strings in the URL with named properties in the query that the URL corresponds to.