Natural Language Processing For Multi-User Virtual Worlds
[June 18, 2000: This paper and all concepts and code it contains are hereby placed in the public domain. You may use and modify them for any reason. If you find value in this paper, a small donation would be appreciated.]
Text-based multi-user virtual world systems involve many natural language processing issues. Many of the systems accept commands in a somewhat English-like form, and most output English text to describe the results of the commands. Here's a sample transcript of an interaction with Waterpoint [Fox97], a MOO [Cur97] [Fox98] running JHCore [Fox98] (user input is prefaced by ">"):
look The Gull Point Lighthouse A basic circular room forms the main entrance and the base of the light tower of the lighthouse. Some supplies are on a shelf. The door out is closed. Dark stairs spiral down into the basement. An old narrow ladder leads upward. A door to the east leads to the house. It is open. You see a games chest here. Gus is here, off in another world. >look at stairs The stairs lead down into the dark basement. >close house door You close the house door. >look at chest A wooden chest with a hinged door on top. Contents: an Auction game box an Abalone board a Rack-O! game box >get game from chest You haven't specified which "game" you mean. >get abalone from chest You remove an Abalone board from the games chest. >give abalone to gus You give the Abalone board to Gus. >look at gus Clearly, a warrior princess. Gus wears scuba gear. She is awake, but has been staring off into space for 8 hours. Carrying: an Abalone board >go east You open the house door. The Hallway Worn hardwood floor and a dim ceiling light adorns this hallway in the old house attached to the lighthouse. The door to the lighthouse tower is open.
Some parts of this text are fixed, such as the first two sentences of the lighthouse's description or the description of the stairs; other parts are generated, either to describe the underlying representation of the state of the world, such as the fact that the house door was open and that the games chest was in the room, or to describe an event that occurred, such as getting the Abalone board from the chest and giving it to Gus. Much of the generated text is done in an ad-hoc way, however, usually involving pattern matching and replacement; for example, the chest has a "remove" message property of the form:
Here, "%Nd" is replaced by the (capitalized) definite name of the actor (in this case "You", since the actor is the user); "%n:(removes)" is replaced by the form of the verb "removes" that agrees with the actor (in this case "remove" to agree with "you"); "%di" is replaced by the indefinite form of the direct object of the action (in this case the Abalone board); and "%id" is replaced by the definite form of the indirect object of the action (the chest itself).
Note that this message property could actually have been much simpler:
The reason for using "%id" instead of "the games chest" is for reuse: there is a generic container object that has this message property, and all container objects inherit the property from it (the MOO object system is object-based, rather than class-based). So the chest object need only have a name property set on itself, and this message (and others like it) will substitute that name into the appropriate place.
The reason for using "%Nd %n:(removes)" instead of "You remove", however, is more essential: since this is multi-user system, different users see different text representations of the same events. Thus, the user connected to the Gus character saw
The command parsing is done in a similarly ad-hoc pattern-matching fashion; all commands must be of one of the following forms:
where <dobj> and <iobj> must refer to objects in the room or otherwise accessible to the user's character, and <prep> must be one of 28 prepositions or prepositional phrases (such as "on top of"), grouped into 15 synonym sets; for example, "with" and "using" are considered synonyms. Object methods (called "verbs" for obvious reasons) may have dobj, prep, and iobj arguments attached to them; dobj and iobj arguments may be "this", in which case the command must refer to the object in that position; "any", in which case the command may refer to any object in that position; or "none", in which case the command must not have a phrase in that position. The prep argument may be "any", "none", or one of the 15 preposition synonym sets. For example, the house door object has a method "close this none none"; the games chest object has a method "get any from this"; the Abalone board object has a method "give this to any"; and the lighthouse room object has methods "look none none none" and "look none at any". The command parser must decide from the command and the current state of the world which method to run with which arguments; as shown above with the command "get game from chest", if there is any ambiguity it prints an error message and aborts.
2. The Task
As you can see, language processing in this system is rather complicated yet still fairly limited. The parser cannot understand commands of the form "give Gus abalone board", let alone more complex constructions such as "make Gus give abalone board to me", "put abalone board in Auction game box in games chest", or "close games chest then get it and give it to Gus". And while the text generation system tries to look nice by allowing you to specify whether a word should use a definite or indefinite article, the use of definite articles really ought to depend on the discourse situation. For example, "Ragnar gives you the Abalone board" happened to be appropriate for Gus in this case, because she had just seen me remove it from the chest, but if I had walked into the room carrying it and given it to her, she should have seen "Ragnar gives you an Abalone board". Ideally, it would have made use of anaphora, and just said "Ragnar removes an Abalone board from the games chest. He gives it to you."
Clearly the system needs to be more oriented towards natural language processing and knowledge representation in order to implement these kinds of improvements. As I was reading about unification grammars in [Nor92], I realized that since the Prolog-like deduction system was inherently reversible, unification grammars could feasibly handle both command parsing and text generation. So, as a proof of concept, I decided to implement a toy system involving a command parser and a text generator to describe events to multiple users, using Norvig's Common Lisp unification grammar system. In particular, it should be able to take as input:
and produce four descriptions of the resulting event:
The description from the book's perspective is perhaps irrelevant, but in the general case, any participant is capable of observing the event and needs to be notified-- imagine instead Fred carrying Mary and giving her to John. In any case, it's a simple matter for descriptions given to inanimate objects to just be ignored.
3. The Implementation
My first attempt tried to use the full English grammar of [Nor92] chapter 21 to parse a command into a semantic data structure and then reverse the deduction to generate text. However, I soon discovered that that grammar system was not, in fact, reversible. In particular, the "and*" predicate was implemented with a function that assumed that its first argument was bound; also, several rules used the "if" predicate, which was not reversible in the "else" case because a test involving unbound variables was always considered true, and the cut prevented backtracking into the else clause. The quantifier metavariable mechanism also seemed problematic, and not necessary for the task at hand.
So I moved back a few sections to 20.3, pages 694-5 in particular. That simple grammar and lexicon had most of what I needed, and the example even showed how it could be used to generate text, so I used that as my starting point instead. Here is my final grammar (to run, replace "../paip" with the path to the paip source code directory relative to your current directory):
The main additions I made are the distinction between finite and nonfinite verb inflections, as well as adding a VP rules for ditransitive, transitive+to, and the "make" form with a VP complement. I also beefed up the semantic representation somewhat to account for participants in events; for example, in the event
it's not obvious how to determine which entities are the participants in the event-- it's not the leaves of the tree, because we want to use "the(book)" rather than just "book" (allowing the definite article to have some disambiguation meaning); also, "Mary" should only be considered as a participant once, not twice. Instead this event is represented as
in order to explicitly flag which objects are the participants in the event.
Here is the code to process a command, given the above grammar:
Here's the system in action:
The symbol 'third-party is meant to represent all observers not participating in the event; the observer may also be specified explicitly:
More complex examples:
4. Other Work
The idea of using unification grammars for text generation is not new; in fact, one of the most popular text generation systems, FUF/SURGE [Elh97], is based on previous work in syntactic realization with FUGs (functional unification grammars). FUF is a language for writing unification grammars, while SURGE is a large English grammar written in FUF. While the DCG (definite clause grammar) implementation in [Nor92] uses conjunctions of quantified first-order terms as its semantic representation, FUF uses unordered sets of features, or attribute-value pairs; these sets can be viewed as constraints. The input to the FUF/SURGE system is a set of constraints on what is to be said, which is unified with the grammar, to determine how it is to be said; this unified feature set is then sent to a linearizer which produces the English text. One of the main emphases of FUF, as described in [Elh93], is the ability to tailor the output to the hearer's knowledge and desire state. While this is similar to the multi-user virtual world situation, where we want to tailor the output to the hearer's knowledge state and discourse history, FUF is more concerned with resolving lexical choice directed by the speaker's argumentative intent, i.e. choosing words that are more likely to persuade the hearer. It does appear to take discourse history into account, though, so it is probably a superset of the desired functionality.
Another approach to text generation system is taken in RealPro [Lav97]. Unlike FUF, RealPro performs no lexical choice; its input is a "Deep Syntactic Structure", or DSyntS, which is a fully lexicalized dependency syntax tree, i.e. a tree of words whose arcs indicate syntactic roles (like "subject") rather than semantic roles (like "agent"). Rather than using unification for linearizing, it performs more straightforward tree-structure modification, using a cascading series of independent modules. The system was designed to be small, fast, and portable (available in C++ and Java), rather than having the broad coverage and complex text planning features of FUF/SURGE; however, its features do not at all address our task of tailoring text output to the hearer's discourse state, so would only solve a small part of our problem.
A third approach is taken by PENMAN and NIGEL (summarized in [Elh93]; a successor, KOMET-PENMAN(ML), is briefly described in [Bat94]). PENMAN traverses a systemic grammar (basically an annotated DAG), choosing features for each choice point and realizing by side-effect the detailed linguistic structure, which is then passed to NIGEL to be realized. The choices made encompass both discourse planning and lexical choice, and can be the result of querying a knowledge base or interacting with a user who is guiding the text generation manually. This system can be thought of as the function-oriented dual to FUF's structure-oriented system; traversal of the systemic grammar is done procedurally, with input from the choice/query mechanism, while FUF's search is implicit in unification and directed by the feature structures and the declarative grammar.
5. Conclusion And Future Directions
As a toy example, this project was a sucessful proof of concept that both command parsing and text generation for a multi-user virtual world can be handled using a unification grammar. However, there are a number of improvements to be made, both small scale and large, before this type of system could replace what is currently being used.
One obvious difficulty with the current system is the need to add new rules for every verb and object. A more flexible lexicon system such as the one described in sections 20.11 and 20.12 in [Nor92] would come in handy, or perhaps a simpler system that used the local environment of methods and objects as the lexicon.
Another needed feature is pronouns, including reflexives (yourself, himself,
herself). Currently, if an event involves the same person as subject and
direct or indirect object, it looks a little funny:
Pronouns are also needed in sentences like "Fred makes you give Fred the book", as well as cross-sentence anaphora in more static texts such as room descriptions. The problem of definite reference also needs to be addressed, both in parsing and in generation.
On a larger scale, it might be possible to use FUF and SURGE directly, in which case you get most of these kinds of things for free. However, such a large, general-purpose system might be overkill, and performance might become a problem. If nothing else, it would serve as a useful experiment and a source of ideas for how to better represent semantic and discourse knowledge in the current system.