Tuesday, July 15, 2008

Quite an Intelligent Fellow

I've been tinkering with OpenCyc yet again, along with putting together the framework for a chatbot system.My idea is as follows:
Get a word list that allows me to reference the parts of speech for any english word.
Create a system which parses words, sentences, paragraphs, pages, chapters, and books (conceptual, just a large number of related chapters.)
Record the valid structures for each construct (sentences, paragraphs, etc.) Each type of sentence will be correlated to a particular Cyc microtheory. This will allow me to assign a meta-data tag to any given sentence that is known. A Microtheory is a concept in the Cyc ontology system that encapsulates particular ideas that may be factually or semantically different than other concepts. Think of inheritance and polymorphism, but utilized to create a heirarchical view of the world, using hundreds of thousands of basic common sense concepts and millions of assertions related to those concepts. It allows an AI system to use the proper granularity for arbitrary contexts, by drawing on inferences, hypotheses, and static data.The problem with most chatbots is lack of real intelligent flexibility. You can create a billion template/response structures and your bot will imitate intelligent conversation, but what you're really doing is an if/then routine over and over again. It takes only 3 degrees of separation between the original input and confusing the 'natural' train of thought before you hit the limitations of most chatbots based on this structure.However, training a bot over time on actual conversation has some serious drawbacks as well. You end up with rules being created on the fly that may be entirely wrong, with no way of preventing it unless you edit the data manually.My solution is to semantically tag valid english sentence structures, and assign a particular microtheory to each tag, or class of tags. By using those tags in Cyc microtheories, telling the system what questions it has to answer about particular sentences in order to 'understand' them, I can create an input parser that takes any given english sentences and store them as semantically valid data constructs.Once stored, I will again turn to Cyc and classify sentence types, and potential 'proper' responses for particular sentence classes. Statements, queries, imperatives, and so on will each have responses appropriate to their semantic data. I will program responses and link the response categories to each sentence class.So I have a concept for a chatbot system which can take arbitrary (syntactically valid) english sentences, understand them by linking them to an ontological database and creating new concepts as necessary, and return a response based on the actual content of the sentence.I've gotten a few Cyc microtheories drawn up. I have a 210,000 word 'parts of speech' database. What I need now is ideas for large quantities of text which I can parse to get as broad a range as possible for sentence, paragraph, page, chapter, and book structures.Does anyone know where I can find really large corpora?A potential offshoot of this is classification of corpora, and using each class as inputs in a neural net, in order to train patterns specific to styles of writing and genre... so you could create a microtheory that described a story, and have the chatbot output a novel. Given a large enough corpus, you could train on particular authors' styles of writing. I would of course market this software, make millions of dollars, and take over the world.Anyway, what I'm looking for is ideas as to where to look for parsable data. I'd need structured content, like news articles, books, and so on. The only hardcoding I'm going to do is for things like predictive spellchecking for unknown words, and dealing with broad classes of inputs and responses. I'm hoping that such a system would be able to handle specific inputs and outputs dynamically, and easily pass the Turing test.I'm also considering IRC logs, chatroom logs, and other "conversation" corpora, but those present problems such as slang, deliberate misspellings, horrible grammar, and extreme ambiguity. I think I should leap one hurdle at a time... so the first is a consistent, pre-edited, dry corpus.

No comments: