Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revisionBoth sides next revision | ||
intertwingle [2008-11-15 17:13] – 81.188.78.24 | intertwingle [2008-11-15 17:39] – 81.188.78.24 | ||
---|---|---|---|
Line 7: | Line 7: | ||
May 18th | May 18th | ||
- | Submitted by Jamie Zawinski | + | Submitted by Jamie Zawinski to Miscellaneous. |
" | " | ||
Line 37: | Line 37: | ||
The sheer multitude of representations-of-objects yields a colossal number of potential links to follow, which is why I anticipate link-chasing to be a (usually) far easier method of excavation than searching. For example, here are the headers of a typical message: | The sheer multitude of representations-of-objects yields a colossal number of potential links to follow, which is why I anticipate link-chasing to be a (usually) far easier method of excavation than searching. For example, here are the headers of a typical message: | ||
- | Date: Sun, 3 Jul 94 16:40:07 PDT | + | |
- | From: Jamie Zawinski < | + | |
- | To: eng | + | From: Jamie Zawinski < |
- | Subject: | + | To: eng |
- | In-Reply-To: | + | Subject: |
- | Message-ID: | + | In-Reply-To: |
- | References: | + | Message-ID: |
+ | References: | ||
There is a great deal of structure there: | There is a great deal of structure there: | ||
- | | + | **Sun, 3 Jul 94 16:40:07 PDT** |
- | This is a representation of a point in time. From here one can envision traversing to a list of other messages within some range of that moment: that hour, that day, that month, that year. | + | This is a representation of a point in time. From here one can envision traversing to a list of other messages within some range of that moment: that hour, that day, that month, that year. |
- | | + | **Jamie Zawinski < |
- | This is a description of a particular person. From here one should be able to easily get to information related to that person: an address book entry, or a list of all messages sent by them, or sent to them, or any number of other annotations. | + | This is a description of a particular person. From here one should be able to easily get to information related to that person: an address book entry, or a list of all messages sent by them, or sent to them, or any number of other annotations. |
- | | + | **Jamie Zawinski** |
- | This is a name, not a person, and names are notoriously non-unique. From here it would be useful to get to a list of all known people who have claimed that name (from the set of people who are message senders or recipients.) | + | This is a name, not a person, and names are notoriously non-unique. From here it would be useful to get to a list of all known people who have claimed that name (from the set of people who are message senders or recipients.) |
- | | + | **jwz@mcom.com** |
- | This is an email address, not a person, and while one email address is usually not used by more than one person, it's quite common for one person to have many email addresses (or many variations on the same address.) From here it would be useful to get to a list of all known people who have used that address (from the set of people who are message senders or recipients) and from there to the set of other addresses used by that person or those people. One might also find it useful to get a list of messages associated with this address (while excluding messages from other addresses of the same person.) | + | This is an email address, not a person, and while one email address is usually not used by more than one person, it's quite common for one person to have many email addresses (or many variations on the same address.) From here it would be useful to get to a list of all known people who have used that address (from the set of people who are message senders or recipients) and from there to the set of other addresses used by that person or those people. One might also find it useful to get a list of messages associated with this address (while excluding messages from other addresses of the same person.) |
- | | + | **eng** |
- | This is an email address, yet it happens to be a mailing list. There is no one person associated with it, yet the set of operations one might like to perform on it is very similar. | + | This is an email address, yet it happens to be a mailing list. There is no one person associated with it, yet the set of operations one might like to perform on it is very similar. |
- | | + | **printing** |
- | This is unstructured text, and what one does with unstructured text is attempt to match patterns in it. There are any number of other properties associated with this particular piece of text: it is in a header field called Subject in a message from Jamie Zawinski, on Sunday, July 3rd, and so on. All of these are interesting properties that are within one or two link-hops of the text itself. Their proximity is what makes them interesting. | + | This is unstructured text, and what one does with unstructured text is attempt to match patterns in it. There are any number of other properties associated with this particular piece of text: it is in a header field called Subject in a message from Jamie Zawinski, on Sunday, July 3rd, and so on. All of these are interesting properties that are within one or two link-hops of the text itself. Their proximity is what makes them interesting. |
- | | + | **Chris Houck** |
- | A name, as above. | + | A name, as above. |
- | | + | **Chris Houck' |
- | An ambiguous reference to a message. From here, one should be able to get to the set of all messages from someone who claimed the name Chris Houck. | + | An ambiguous reference to a message. From here, one should be able to get to the set of all messages from someone who claimed the name Chris Houck. |
- | | + | **Chris Houck' |
- | Another reference to a message, probably less ambiguous. | + | Another reference to a message, probably less ambiguous. |
- | | + | **< |
- | < | + | **< |
- | These also are references to particular messages, the least ambiguous representations so far; however, they are still slightly ambiguous, since message IDs refer to original messages: there could be multiple copies of these messages with slightly different headers or other annotations within the message-store. | + | These also are references to particular messages, the least ambiguous representations so far; however, they are still slightly ambiguous, since message IDs refer to original messages: there could be multiple copies of these messages with slightly different headers or other annotations within the message-store. |
Any any time there is a link, one can imagine an equal but opposite counter-link: | Any any time there is a link, one can imagine an equal but opposite counter-link: | ||
Line 84: | Line 85: | ||
Further structure exists outside of the message headers themselves: | Further structure exists outside of the message headers themselves: | ||
- | * Messages live in folders. | + | |
+ | * Folders have names. | ||
+ | * Folders are sometimes arranged in a hierarchy. | ||
+ | * Folders tend to store messages linearly, in a particular order: thus, each message has ``previous'' | ||
+ | * Messages can contain other messages (forwarded messages, or digests.) Each such message is a message in its own right, but the containment relationship can be important. | ||
+ | * Messages have bodies. | ||
+ | * The bodies can contain unstructured text. | ||
+ | * The bodies can contain text that is named, for example, an attached text file which has a file name or description specified in its attachment headers. | ||
+ | * The bodies can contain binary objects which, while not textually searchable, are named and described. | ||
+ | * Bodies can contain hyperlinks. Plain-text messages might happen to have detectable URLs in them, and HTML messages have many mechanisms for referring to other objects. This implies that it would be interesting to traverse from a message, to information about a web page that it refers to, and back to a set of messages which refer to objects on that server. | ||
- | o Folders have names. | + | ====searches are intersections.==== |
- | + | ||
- | o Folders are sometimes arranged in a hierarchy. | + | |
- | + | ||
- | o Folders tend to store messages linearly, in a particular order: thus, each message has ``previous'' | + | |
- | + | ||
- | * Messages can contain other messages (forwarded messages, or digests.) Each such message is a message in its own right, but the containment relationship can be important. | + | |
- | + | ||
- | * Messages have bodies. | + | |
- | + | ||
- | o The bodies can contain unstructured text. | + | |
- | + | ||
- | o The bodies can contain text that is named, for example, an attached text file which has a file name or description specified in its attachment headers. | + | |
- | + | ||
- | o The bodies can contain binary objects which, while not textually searchable, are named and described. | + | |
- | + | ||
- | o Bodies can contain hyperlinks. Plain-text messages might happen to have detectable URLs in them, and HTML messages have many mechanisms for referring to other objects. This implies that it would be interesting to traverse from a message, to information about a web page that it refers to, and back to a set of messages which refer to objects on that server. | + | |
- | + | ||
- | searches are intersections. | + | |
Following a link only gives you one dimension of mobility. A search can be seen as following multiple links, and finding the intersection (or union) of the results of those links. | Following a link only gives you one dimension of mobility. A search can be seen as following multiple links, and finding the intersection (or union) of the results of those links. | ||
Line 110: | Line 102: | ||
Any link-relationship should be searchable. For example: | Any link-relationship should be searchable. For example: | ||
- | * All messages from person between date and date that have pattern in the body. | + | |
- | + | * All messages from person which contain a message from person. | |
- | * All messages from person which contain a message from person. | + | * All messages to mailing-list which refer to URL. |
- | + | * All messages containing text in the main body, but not in an attachment. | |
- | * All messages to mailing-list which refer to URL. | + | * All messages with an attachment whose file name contains string. |
- | + | ||
- | * All messages containing text in the main body, but not in an attachment. | + | |
- | + | ||
- | * All messages with an attachment whose file name contains string. | + | |
- | implementation. | + | ====implementation.==== |
The basic components of this system are: | The basic components of this system are: | ||
- | 1. parser. | + | ====1. parser.==== |
- | | + | The module which reads the existing message store (directories of BSD mbox files, or news spool directories, |
- | | + | It needs to understand where messages begin and end, understand how to descend into MIME structures, how to translate HTML into indexable text, how to recognise URLs, and so on, and so on. |
- | | + | It will presumably generate an intermediate data representation which can be more easily fed to the database. A pretty-printed version of the representation of a message might look like this (if you will excuse my lisp-centric upbringing; here in the modern world, this would presumably be done with XML): |
- | | + | <code lisp> |
- | (:db-id " | + | (:message |
- | (: | + | (:db-id " |
- | | + | (: |
- | (: | + | (:addr "Jamie Zawinski" |
- | | + | (: |
- | (: | + | (:news " |
- | | + | (: |
- | (:link " | + | (:text " |
+ | (:link " | ||
(: | (: | ||
- | | + | (:type " |
- | (:body " | + | |
- | (:link " | + | |
- | (: | + | |
- | (: | + | |
(: | (: | ||
- | | + | (:type " |
- | (:name " | + | |
- | (: | + | |
- | (:text " | + | |
- | (:link " | + | |
- | (:link " | + | |
(: | (: | ||
- | | + | (:type " |
(: | (: | ||
- | | + | (:type " |
- | (: | + | |
- | These objects are shallow: that last " | + | </code> |
- | Deeply nested MIME structures (multipart/ forms) | + | These objects |
- | A more formal representation might be | + | Deeply nested MIME structures (multipart/ forms) are also flattened. Content-Disposition is always assumed to be inline for purposes of indexing; we index the body of any part that is of a text type. There is no special handling for multipart/ |
- | msg_desc | ||
- | | ||
- | *msg_body | ||
- | msg_header= | ||
- | msg_body | ||
- | | ||
- | header_name | ||
- | | ||
- | *newsgroup / *msg_id / date | ||
- | mailbox | ||
- | name = keyword | ||
- | address | ||
- | newsgroup | ||
- | msg_id | ||
- | date = < | ||
- | text_part | ||
- | content_type | ||
- | link_part | ||
- | addr_id_part | ||
- | url | ||
- | attach_part | ||
- | | ||
- | | ||
- | | ||
- | *link_part *addr_id_part | ||
- | attach_name | ||
- | attach_desc | ||
- | attach_value | ||
- | db_id | ||
- | keyword | ||
- | text = <an uninterned, | ||
- | | ||
- | (Note: I've actually already written this parser; it's not a lot of code, but it seems to work fairly well. If anyone is seriously interested in taking this project and running with it, I'll see about getting permission to release that code.) | + | A more formal representation might be |
- | 2. database. | + | < |
+ | msg_desc | ||
+ | *link_part *addr_id_part | ||
+ | *msg_body | ||
+ | msg_header | ||
+ | msg_body | ||
+ | | ||
+ | header_name | ||
+ | header_body | ||
+ | *newsgroup / *msg_id / date | ||
+ | mailbox | ||
+ | name = keyword | ||
+ | address | ||
+ | newsgroup | ||
+ | msg_id | ||
+ | date = < | ||
+ | text_part | ||
+ | content_type | ||
+ | link_part | ||
+ | addr_id_part | ||
+ | url | ||
+ | attach_part | ||
+ | | ||
+ | | ||
+ | | ||
+ | *link_part *addr_id_part | ||
+ | attach_name | ||
+ | attach_desc | ||
+ | attach_value | ||
+ | db_id | ||
+ | keyword | ||
+ | text = <an uninterned, | ||
+ | | ||
- | The module which stores the output of the parser on disk in some quickly-retrievable format. It needs to have both relational and full-text-indexing properties; many of the searches we want to do could be accomplished with a database that was nothing but a glorified set of hash tables; but body searches need to be done in some more clever way. (Perhaps simply putting every word in a hash table would be sufficient, but I doubt it.) And more to the point, the text searches have to take advantage of the tagging of the data, so that, for example, constraining a search to be in the subject and not the body actually makes the search go faster instead of slower. | + | </ |
- | Incremental updates are probably pretty important. | + | (Note: |
- | It seems clear that RDF would be the way go go here. | ||
- | 3. query tool. | + | ==== 2. database.==== |
+ | |||
+ | The module which stores the output of the parser on disk in some quickly-retrievable format. It needs to have both relational and full-text-indexing properties; many of the searches we want to do could be accomplished with a database that was nothing but a glorified set of hash tables; but body searches need to be done in some more clever way. (Perhaps simply putting every word in a hash table would be sufficient, but I doubt it.) And more to the point, the text searches have to take advantage of the tagging of the data, so that, for example, constraining a search to be in the subject and not the body actually makes the search go faster instead of slower. | ||
- | All of the web search engines force the user to type in boolean expressions. Sometimes that's ok, but we should do something better, that lets the user construct expressions | + | Incremental updates are probably pretty important. I doubt we could get away with a setup that required a nightly update. |
- | Drawing on the notion | + | It seems clear that RDF would be the way go go here. |
- | 4. presentation tools. | + | ====3. query tool.==== |
- | There are objects, sets of objects, and presentation tools. There is a presentation tool for each kind of object; and one for each kind of object set. | + | All of the web search engines force the user to type in boolean expressions. Sometimes that's ok, but we should do something better, that lets the user construct expressions with a GUI. |
- | names, addresses, or people. | + | Drawing on the notion that searches are really set operations, perhaps one aspect of the search tool could be drag-and-drop: |
- | The presentation tools for these kinds of objects needn' | + | ==== 4. presentation tools.==== |
- | user = "Jamie Zawinski < | + | There are objects, sets of objects, and presentation tools. There is a presentation tool for each kind of object; and one for each kind of object set. |
- | Getting back to the drag-and-drop idea, dragging that button onto an existing search tool could expand the search to include that term. | + | =====names, addresses, or people.===== |
- | One should | + | The presentation tools for these kinds of objects needn' |
- | BBDB convinces me that this is an absolute requirement. | + | user = "Jamie Zawinski < |
- | The problem with the annotation notion is that it's the first time that we consider a piece of data which is not merely a projection of data already present in the message store: it is out-of-band data that needs to be stored somewhere. In the address book? In LDAP? I have no idea. | + | Getting back to the drag-and-drop idea, dragging |
- | sets of people. | + | One should be able to store annotations on people: even something as simple as a single text field would add a great deal of power. These annotations should themselves be searchable. These annotations should be able to contain (clickable!) references to other people |
- | Perhaps a simple list is sufficient, with options to sort in various ways (by last name, first name, email, host-name, or host-domain.) | + | BBDB convinces me that this is an absolute requirement. |
- | messages. | + | The problem with the annotation notion is that it's the first time that we consider a piece of data which is not merely a projection of data already present in the message store: it is out-of-band data that needs to be stored somewhere. In the address book? In LDAP? I have no idea. |
- | Presenting a single message is straightforward: | + | =====sets of people.===== |
- | Annotations of messages would be interesting as well. For example, one might want to make a note to one's self that two messages from different people refer to the same issue and should be dealt with at the same time. | + | Perhaps |
- | sets of messages. | + | =====messages.===== |
- | This presentation has to be fairly powerful; it needs to present | + | Presenting a single message is straightforward: |
- | It should also be able to incrementally update | + | Annotations of messages would be interesting |
- | Note that, to this view, the concept | + | =====sets |
- | Today, I can point my ``message set browser'' | + | This presentation has to be fairly powerful; |
- | Annotating a message-set could mean manually including and excluding specific messages: a message-set could be considered a ``bucket'' | + | It should also be able to incrementally update as results are coming back from the database, so that the user can see the results they're getting |
- | Presentation tools should be linked as well: one should be able to pick up the sets displayed in one tool and project them into another. | + | Note that, to this view, the concept of ``folder'' |
- | * Show me all messages with word in body. | + | Today, I can point my ``message set browser'' |
- | * Drag the sender column away: that' | + | Annotating a message-set could mean manually including and excluding specific messages: a message-set could be considered a ``bucket'' |
- | * In the people browser, click on an address: refine | + | Presentation tools should be linked as well: one should be able to pick up the sets displayed |
- | | + | * Show me all messages with word in body. |
+ | * Drag the sender column away: that's a set of people, therefore it is displayed using a ``people browser'' | ||
+ | * In the people browser, click on an address: refine the search to contain only those in the same domain as that address. A new, smaller list of people is presented. | ||
+ | | ||
Perhaps the message-set presentation is a simulated IMAP folder. Perhaps the message and message-set presentation tools are a mail reader. | Perhaps the message-set presentation is a simulated IMAP folder. Perhaps the message and message-set presentation tools are a mail reader. | ||
Line 273: | Line 266: | ||
The other components are server-side, | The other components are server-side, | ||
- | future. | + | ====future.==== |
- | + | ||
- | There are other interesting data-visualization possibilities here as well; since really what we have is nodes and connections between them, tools like graphers and histogram charts might be applicable as well, to answer questions like | + | |
- | * show me a graph of the age-distribution of my unanswered mail, or, | + | There are other interesting data-visualization possibilities here as well; since really what we have is nodes and connections between them, tools like graphers and histogram charts might be applicable as well, to answer questions like |
- | | + | * show me a graph of the age-distribution of my unanswered mail, or, |
+ | | ||
- | | + | The object/ |
- | | + | This sort of model is not applicable merely to the domain of messages; it applies equally well to any corpus which has structured, potentially-ambiguous references (or rather, representations of references.) |
- | | + | For example, source code. |
Copyright © 1998-2003 The Mozilla Organization. Last modified November 10, 1998 | Copyright © 1998-2003 The Mozilla Organization. Last modified November 10, 1998 | ||