Inductive Logic Programming for Classification of Interrogative Sentences and Frame Grammar

Inductive Logic Programming for Classification of Interrogative Sentences

and Frame Grammars over a Restricted Domain

Cameron A. Hughes Tracey Hughes

Youngstown State University Youngstown State University cahughes@https://www.360docs.net/doc/bc2285440.html, tdhughes@https://www.360docs.net/doc/bc2285440.html,

CTEST Laboratories CTEST Laboratories

ctestlabs@https://www.360docs.net/doc/bc2285440.html, ctestlabs@https://www.360docs.net/doc/bc2285440.html,

Abstract

Inductive Logic Programming (ILP) is a form of machine learning. It is an approach to machine learning where relations are induced from examples. The relations take the form of predicate logic. In ILP logic is used as the hypothesis language and the primary artifact of learning is a set of predicate logic formulas that constitute a logic program. Because the primary artifact of learning in ILP is a logic program, ILP is often described as the intersection between machine learning and logic programming.

One of the primary advantages of using ILP for machine learning is that ILP provides a very general way of specifying apriori knowledge or domain knowledge. In addition to providing a convenient method of specifying background knowledge, ILP focuses more on the learning of relations as opposed to attribute-value learning. The support for recursive structures in an ILP context is of particular utility in learning the relations. In this paper we demonstrate a technique for using ILP to learn sentence frame grammars to be used in interrogative sentence processing. A sentence frame is one of the simplest techniques in natural language processing for capturing grammatical description. The grammatical description that we focus on in this paper is the interrogative sentence.

1. Introduction

Inductive Logic Programming (ILP) is a method of machine learning. It is a technique in machine learning for inducing relations from examples. The relations take the form of predicate logic. The predicate logic is often represented as Horn clauses. In ILP, predicate logic is used as the hypothesis language and the result of learning is also represented as logical formulae in predicate logic. This makes logic programming languages such as Prolog and Mercury particularly suitable for implementing ILP methods. ILP methods present positive examples of relations and negative examples to the learner. The learner is the program that will induce the unknown relations. The positive and negative examples are presented to the learner and the learner must produce a relation or set of relations that capture the positive examples and exclude the negative examples. If the hypothesis (artifact) that the learner produces covers the positive examples the learner is said to be complete. If the artifact that the learner produces does not cover the negative examples the learner is said to be consistent. One of the primary goals of an ILP method is to be complete and consistent. We can now define the basic form of the ILP problem:

In this paper we are interested in finding a logic program P that learns the appropriate sentence frame

grammar from the set of positive examples and negative examples. An example of a sentence frame grammar for the sentence:

Who hit the boy with the sword

[IPN, V, Det,N,Prep,Det,N]

Each word in the sentence is associated by position with the appropriate part of speech. A sentence frame describes an acceptable grammatical description. The ILP method that we are looking for will induce an appropriate sentence frame grammar that describes the primary interrogative sentence types. Sentence Frame Grammars as High Level Filters

The complexity and scope of developing question & answer systems is well known and has been established for quite some time (Lehnert, 1978), (Cohen, Perrault, Allen, 1982). At Ctest Laboratories we are investigating question and answering using the ISIS (Interrogative Sentence Interface Subsystem) system. With ISIS we attack the problem of answering questions at several levels. Questions that are well known and frequently asked are stored in a knowledge base and one of the first tasks of ISIS is to determine if it recognizes the question from some previous interaction. This is a matter of simple pattern matching. If the question can be found in the FAQ no additional parsing is done. ISIS uses the FAQ as one of its high level filter. The results of our investigation into machine learning methods have allowed us to add sentence frame grammars as another high level filter. The sentence frame grammar is represented as horn clauses. If the question that is being posed matches a sentence frame then no further parsing is required and the next of level of processing is immediately entered. Questions that make it beyond our initial filters are prepared for deeper analysis. Sentence Frame Grammars can be used as a preprocessing phase in the complex enterprise of building question & answer systems.

Our Approach

Rather than focusing on the formal rules for how to construct interrogative sentence (questions), we decided to look at what forms questions actually take in a variety of practical situations. We took random questions from a number of sources. We used some FAQ from popular computer help files. We took some questions from Internet news groups. In addition to these, we randomly took questions from:

The Morning of the Day Appointed for a General Thanksgiving (William Wordsworth)

The Two April Mornings (William Wordswordth)

To The Cuckoo (William Wordswordth)

The Two Thieves or the Last Stage of Avarice (William Wordswordth)

The Three Wishes (The Brothers Grimm)

Ayala's Angel (Anthony Trollope)

A Christmas Carol (Charles Dickens)

Dracula (Bram Stoker)

The Republic (PLATO)

Son Of The Wolf (Jack London)

The Spectacles (Edgar Allen Poe)

Sister Carrie (Theodore Dreiser)

A VINDICATION OF THE RIGHTS OF WOMAN (MARY WOLLSTONECRAFT)

William Wilson (Edgar Allen Poe)

Our set of positive examples and negative examples were taken from these sources. The form of our question examples ranged in complexity from questions in the form:

"And you?"

to questions in the form:

What devil or what witch was ever so great as Attila, whose blood is in these veins?"

Our goal was to present these random questions from random sources to a inductive logic programming so that it could generalize the question forms and give us a set of interrogative sentence frames. Instead of using the well known rules of grammar to generate the frames, we decided to see what rules would surface as a result of inductive logic programming against random questions.

Our process

A set of 1000 random questions were taken from various sources. The random questions were then filtered to remove noise. e.g. duplicate questions, non questions, duplicate forms, misspelled words, etc. Our primary processing tools for this phase were tools commonly found in Unix/Linux operating system environments. Among these were grep, sed, awk. We developed a couple of C++ programs also to help remove some of the noise. The noise removal process left us with 144 positive examples and 50

negative examples.

We then used the W.E.KA data mining tool to help us do some further analysis. The goal of this analysis was to provide us with rules or patterns that we could use for the lexicon component. The lexicon is part of the background knowledge for our ILP process. Since many English words can be used as more than one part of speech, it is not always clear usage of a word is applicable. Since we are using machine learning to identity sentence frames we decided to recursively use machine learning to help us with the lexicon background knowledge. We supplied the data for six attributes to W.E.K.A: Attribute 1 Interrogative Pronoun Present (Yes / or No)

Attribute 2 Auxiliary Verb Present (Yes / or No)

Attribute 3 Question Indicator At the Beginning of the Sentence (Yes /or No)

Attribute 4 Length of the Sentence

Attribute 5 Number of Question Indicators in the Sentence

Attribute 6 Distance of the Question Object from the Question Indicator

Some Observations

We noted several interesting things about our data in the preprocess phase of the W.E.K.A tool.

First for attributes 1 and 2 the interrogative pronoun or auxiliary verb indicators for missing in 58% of the examples. These indicators make it easier to identify a question. If the indicators are missing that will make the results of the ILP even more interesting. The second interesting result from the W.E.K.A tool was the mean and standard deviation for the radius of the question object relative to the question indicator. This radius was captured in our attribute 6. W.E.K.A reported this distance with a mean of

3.524 and a standard deviation of 3.508. This result would be useful for the lexicon. Our lexicon has

a function part_of_speech(X,Y...). Given a word X it returns the part of speech for that word. If a word can be used as more than one part of speech we use rules to determine which part of speech to return.

Since we know that we are dealing with interrogatives the distance can be useful when we have to guess what the part of speech is. In a situation where we have ambiguity about which part of speech to return, we can use the distance to either rule in or rule out certain alternatives. We ran W.E.K.A's rules.JRIP algorithm. rules. JRIP is an implementation of the KRipper rule induction algorithm. This algorithm gave us further relationships between the distance attribute and the question indicators that will be investigated in our ISIS system. The third interesting thing to note was that in 31% of the sentences that had a question indicator, it was not at the beginning of the sentence. Our positive examples had a maximum sentence length of 38 words and a minimum sentence length of 1 word. These observations have heuristic potential for improving the lexicon component of the background knowledge.

Our Program

The primary components of an inductive logic program are the background knowledge, the positive examples, negative examples, and the IIM (Inductive Inference Machine). In our investigation we used Prolog to represent each component. The predicate to be learned is called the target predicate. Our target predicate:

Target Predicate:

hypothesis(X)

The IIM:

logic_machine([],Structure,Learned) :-

reverse(Structure,Learned),

not(hypothesis(Learned)),

assert(hypothesis(Learned)),!.

logic_machine([],Structure,Learned) :-

reverse(Structure,Learned),

hypothesis(Learned),!.

logic_machine([Word | List],Structure,Learned) :-

background(Predicate),

Clause =..[Predicate,Word],

call(Clause),

Term =.. [Predicate,X],

append([Term],Structure,Grammar),

logic_machine(List,Grammar,Learned).

Sample Background Knowledge Predicates:

background(noun).

background(ip).

background(det).

background(pn).

background(adj).

background(adv).

background(proper_noun).

background(conj).

background(interj).

background(prep).

background(tverb).

background(iverb).

background(verb).

Applying the learned program to 144 positive examples and 50 negative examples to a test set 30 questions from the original 1000 questions the program correctly recognized 25 out of 30 questions 83% and did not cover any negative examples. When applying a parser written using the formal rules for interrogative sentences 40% of the sentences were correctly identified as interrogatives. Conclusion

ILP can be used to generate simple interrogative frame grammars from real world data. Since the frame grammars are from how questions are actually formed instead of how they should be formed (based on rules of grammar) , the template frame grammar is likely to recognize more questions than the traditional syntax based approach. Further the use of the W.E.K.A tool for data analysis and inductive rule generation had a productive impact on the construction of our lexicon background component. For background information, pivot tables, spreadsheets and table data, go to

https://www.360docs.net/doc/bc2285440.html,/ilp.

References

Lehnert, Wendy G . The Process of Question Answering, 1978.

Cohen, Philip R C., Raymond Perrault, and James F. Allen. Beyond Question Answering, 1982. Bergadano, Francesco, and Daniele Guneetti. Inductive Logic Programming From Machine Learning to Software Engineering, 1996.

Covington, Michael. Natural Language Processing for Prolog Programmers, 1994.

Lehnert, Wendy and Matin Ringle. Strategies for Natural Language Processing, 1982.

Mathews, Clive. An Introduction to Natural Language Processing Through Prolog, 1998.

Huth, Michael and mark Ryan. Logic in Computer Science Modeling and Reasoning about Systems, 2005.

Bratko, Ivan and Stephen Muggleton. Applications of Inductive Logic Programming, 1995. Sirinivasan, Ashwin . ILP: A Short Look Back a Longer Look Forward David Page, 2003.

Vitor Santos Costa et al . Query Transformations for Improving the Efficiency of ILP Systems, 2003. Claveau, Vincent and Pascale Sebillot. Learning Semantic Lexicons from a part of Speech and Semantically Tagged Corpus Using Inductive Logic Programming, 2003.