Archive for the ‘Caroline’ Category

I finally caved in and installed a wiki the other day.

My primary goal for this Wiki is for it to become my main repository of everything Caroline-related.  As such, there is only one article besides the home page, and that is Caroline’s page.  I will continually be adding more material as I work more stuff out and code up various portions.

Visit if you’re interested!  It’s in the folder “wiki16” under my AS website.  The “16” stands for the version, if you were curious.  I’m using MediaWiki 1.6.8 (because 1.8.0 requires PHP 5, which StartLogic does not currently have).

Feel free to register if you want, I guess.  You do have to register in order to edit, but there’s no complicated registration process at all.  Maybe you can help my layout issues by fixing up MediaWiki:Monobook.css, haha!  J/k.  That would be a waste of your time :)

Caroline blinkingCaroline sleeping

Tkinter only likes to show static gifs … … … eh! I don’t want to run an infinite for loop to switch out the images and eat up all my CPU …

Timetable for Caroline project:

Deadlines:

Friday, 9/22/06: the main dictionary should be complete.

Monday, 9/25/06: the SVO module should be spitting out correct digest-analyses for ALL standard-grammar sentences and combinations thereof.  That means, the right relationships between clauses and correct subject-verb-object splitting.

Saturday, 9/30/06: all gif animations for Caroline’s moods should be completed in high-quality CG color.

Saturday, 10/07/06: the associations web should be hefty enough to handle a 7-year-old level of understanding

Saturday, 10/14/06: the verb function break-down system has to be in place

Saturday, 10/21/06: the output module should be able to identify the nature of the input and spit out reasonable ideas for a decent response

Saturday, 10/28/06: the output module should be ready to return good SVOs

Saturday, 11/4/06: update week: fill in the nouns, verbs, and adjectives, and spruce up the GUI

Saturday, 11/11/06: implement the emotion tracking system

Saturday, 11/18/06: dresser module should be all set

Saturday, 11/25/06: debugging week, clean up of code

Saturday, 12/2/06: finish up short-term memory and mathematical systems

Saturday, 12/9/06: another week for anticipated debugging

Saturday, 12/16/06: public release

I am going to finalize things .. for the beta version, anyway:

Updated relationships: I have decided that the old scaling system was insufficient. After thinking about it a long time (um, during my shower a few minutes ago ..), I have decided to use the following temporary scheme. The value of the number is no longer important – it is just a designation.

A,B,0: A is (equal to) B. Example: 3 is three. Amelioration is improvement. *These are very few and far between. The only use of this is for absolute synonyms. Note that thesauruses actually include many not-absolute synonyms (which is actually why I am reluctant to use them).

A,B,1: A is like B. A,B,-1: A is like B. Note that the positive and negative forms are the same: this is because “to be like” is a reciprocal property.

Example: Smiling is like grinning. Eating is like consuming. Pens are like pencils.

Now the question is: how so? The similarities will be basically summarized as being the common connections. For instance, both pens and pencils will have a -2 connection to writing_utensils (see below).

A,B,2: A is a superset of B. A,B,-2: A is a subset of B.

Example (2): Animals include bears. Computers include desktops and laptops.

Example (-2): A bear is a kind of animal. A desktop is a type of computer.

A,B,3: A can do/make B. A,B,-3: A can be done by, achieved through B.

Example (3): Cameras can make photographs. Cars can do transportation.

Example (-3): Writing can be done by/with a pencil.

A,B,4: A has the objective property of B. A,B-4: B is an objective property of A.

A,B,4.5: A has the subjective property of/is sometimes B. A,B,-4.5: reverse

A,B,5: *important* A causes B. A,B-5: A is caused by B.

Example (5): Heating causes melting. The moon causes tides. Fossil_fuels cause pollution.

Example (-5): Panic is caused by disaster.

A,B,10: A is somehow related to B. -10 is the same. This expresses a relationship whose quality is unknown.

An additional field will be made: preferred expression of relation.

I will probably *not* be adding the additional factor of “to what degree” these relations are held. The distance (confidence) will take care of that in the few times that it is important (mostly with the A is like B scenario).

Okay, this is the last post before I actually make the dang thing.

Subject. A code outline for the query option.

Input: a list of words to search for. Example: [‘blue’,’green’]. Plus, a maximum allowed distance, say 6.

As a side note: another proposal: distance should be minimally increased if two concepts appear apart from another in some considerable fashion.

The query algorithm will run like this:

a for loop on each term in the query list

begin with blue. blue is appended to the currentlocation list [0,’blue’,0]. The first number is always the distance from “home”. When it exceeds the maximum, the location should then backtrack. The second number indicates “where we left off.”
blue entry is found as matrix row #40 (starting at 0): [‘blue’,’whale’,4]
the curloc is updated [0,’blue’,41,4,’whale’,0]. Note that if the result is identical to the immediately previous word (a “pingback” in a sense), it will be discarded, while if it is identical to a word before the immediately previous word, or to a word in the results list, it will be noted but NOT searched (treated as though it were a dead-end). If this rule is not implemented, the search will take infinitely long due to a circular loop. While this happens in humans (causing headaches), it is not permissible in a computer.
now we search for whale (always search for end-2) on range 0 (always last) thru end of matrix. Suppose we turn up dolphin(2) @ 64.

curloc is now [0,’blue’,41,4,’whale’,65,6,’dolphin’,0]

Now the algorithm should determine that the maximum has been reached. This is one “escape” situation, the other being when a word has been exhausted (search returns -1).

Both escape situations yield the following results:

a. write the last word to the results file.

b. delete the last three entries in curloc.

Now we resume searching under whale, starting @65.

Alright, so this goes on for awhile, and we end up with this hypothetical “search result,” quotes omitted for convenience. Note that in the real thing, there *will* be alphabetical sorting, unlike what you see here.
[blue,whale,dolphin,mammal,ocean,animal,ocean,sea,fish,whale,sky,clouds,birds,sun,bleu,
French,cheese,France,blue-green,peacock,green-blue,turquoise,color,yellow,red,orange,violet,
purple,green,blah blah blah]

The task now is to find out which words appear most often. One way to do this is the make the results list into a results dictionary that keeps personal tally values associated with the word keys.

A further refinement of the idea-net concept; moving the theory into actual code.

Proposal 1.  Better than node-counts:

In order to determine the maximum “radius” of search, node-counting is easiest to execute, but can also be misleading.  Thus, the “searching” function should be constructed in the following way:

ltm.searchNet([list of words] (or ‘string’), maximum distance traversed, maximum nodes traversed (default 0), maximum search results (default 3), *options)

Examples of options include:

1.  Contradiction handling.  Suppose that word A leads to B, and C leads to D, but while A and B agree and C and D agree, B and D disagree.  Then, are B and D, which are very close but contradictory, acceptable results?  yes/no

2.  Exclusion of bad routes.  If the maximum distance allowed is 20, but single connections of distance 16 are to be excluded due to excessive length, then this setting should be activated.

By setting a maximum distance, one is able to gather all ideas that are reasonably related.  As an analogy, one is technically very closely related to all of one’s relatives.  But in practice, not all relatives are that close, and friends of friends may be closer than siblings of parents.  As a result, it would be more accurate to poll those who are close, not those who are necessarily directly connected.

Proposal 2.  The binary distance system

The distance system assigns a default “high” value to a “fresh” connection between ideas.  After repeated exposure to the connection, this value is whittled down.

However, there are two problems with a linear system:

1.  It eventually could reach 0, which is not useful at all.

2.  It does not reflect the actual state of things: the second exposure is key; by third exposure, we are willing to believe something.

As such, the proposal is this:

if connection is True:

connection.halveDistance();

Basically, you have a default distance of, say, 16.  This distance is then halved to 8, then to 4.  By the time it reaches 4, it is within a reasonable distance to be confident.

Proposal 3.  Separation of confidence vs. nature

The “distance” concept must be treated independently of the nature of the connection.  It is easy to get confused between the *type of relationship between the nodes* and the *confidence with which one senses the connection*.  As an example, let us suppose that Caroline has just learned that one might refer to a language as a “tongue.”  She establishes a connection (‘language’,’tongue’,16,0).  The 0 is the relation: it represents the conception that language and tongue are synonyms, or true equals.  But the 16 is the distance: she is not yet sure if she has picked up the right connection, so she is not confident in this connection.  As another example, suppose that Caroline learns about love and hate.  (‘love’,’hate’,0.5,-3).  The distance, 0.5, means that Caroline intimately knows the relationship between love and hate.  But this relationship is highly polarized: -3 indicates the strongest repelling between ideas, and so Caroline understands that as concepts, they are as far apart as possible.  They are only close in “distance” because they have a strong connection with one another.

Proposal 4.  Deletion of weak connections.

A lot of connections that will be made will be flat-out wrong.  People might lie or say things incorrectly, or use incorrect diction; or, a connection might be made between two words erroneously linked.  So, to counter the amassing of false information, it is prudent to weed out weak links at the end of the day, the same way that people do.

Basically, the algorithm might work like this: weed out 10 to 20% of 16-distance connections and perhaps 1 to 2% of 8-distance connections at the end of each day.  This might seem counterproductive: what if something useful is lost?  And yet this is a necessary measure.

Why do students studying for a test seem to spontaneously forget things?  It’s because of this same mechanism: the brain discards what it thinks are poor connections in order to allow for revision.  This is also the same reason why we are flexible – why we change our opinions or correct old misconceptions.

A Pop Sci article recently mentioned how only a couple of genes encode the entire brain, which is constructed as a random mass of roughly equal neurons.

Suppose that a particular clump of neurons represents a particular concept (just as a model .. it probably isn’t true). As an isolated idea, it is completely useless. This is analogous to my dictionaries. These dictionaries hold a lot of facts. But as a great blogger pointed out recently, a user manual is useless if it only lists functions: one must know why one would want to use such functions. In the same way, the facts need to have a context, a purpose for accessing.

So, the probably key to linking up the ideas properly is to “wire” them. The problem with computer-style hierarchical data structures (think folders and subfolders) is that it establishes a sort of tiered order that is an artificial construct. Real data is hard to store in such a fashion.

Each concept may have multiple connections and also may be part of a circular loop or whatever. The “neuron net” is clearly the solution to this problem: it allows for logical connections without the restrictions of hierarchy. No one node in the web is “more important” in the sense that it is at the “top” of the logical tree. Instead, nodes may be important due to having many connections, but all nodes are fundamentally equal.

I have not yet totally decided on how to store this “map” as data, but what I will probably end up storing is not the nodes themselves, but the lines that connect them. That is, the “soma” of the neurons are not what matter, but the axons and dendrites. I need to know four pieces of data for each connection:

1 and 2. The start and end nodes.

3. The distance between the nodes.

4. The nature of the connection

In real life, the “nature” may be as simple as excitatory vs. inhibitory.

In my proposed net, connections are allowed of three basic types: positive (excitatory), negative (inhibitory), or neutral. Positive and negative come with three grades of strength: high (greater than), medium (equal to), and low (less than). Complex relationships are created by wiring the destination-node to the start node and to a verb-hub.

I’m not totally set on how to implement the “complex” wiring yet. In any case, the mass of connections would be stored in “ganglia,” which are self-organizing subdivisions of the larger factual database. The idea behind the ganglia is to segregate connections that are “close” to each other into small communities of ideas that are likely to accessed together (eg, keyboard, typing, keys, space bar or green, blue, yellow, red, orange).

The goal of this section of Caroline is to replace the vaguely-defined previous concept of looking up all the words in a sentence input and then amassing a list of related topics from the words’ nature themselves. Instead, Caroline would be inclined to use this set of facts as a faster and more life-like alternative. The intersections of the probing will provide the idea of a “context.”

For example, suppose a sentence has the words “ice” and “melt.” Probing, say, 3 connections deep (that is, collecting connections from the web that are at most 3 nodes away – a humble list), certain “hits” would be more numerous than others. I would imagine that “water,” and the negative-related “hot” and “cold” would show up in each of the probes. This would be the easiest way to establish that the context of “ice” is in relation to its property of being water, and something to do with the ambient temperature.

The “basic brain” is clear – that is, it has no connections. It is wired based on experience. One way to “train” Caroline may be to simply say a lot of sentences to her. Words that seem to keep showing up together will be automatically added as a connection. If this method is pursued, then “distance” will start at a default high value and decrease as Caroline grows confident that there is indeed a connection between the words. Obviously, prepositions and other simple words must be treated differently. I would probably begin with just nouns, adjectives, and verbs.

One beautiful thing about the web system is that it might one day replace my dictionaries altogether. That is, that a tree is green does not have to be defined in a structured dictionary with {‘tree’:{‘attributes’:{‘color’:’green’}}} but rather just as (‘tree’,’green’,1) (‘green’,’color’,1).

The human brain is amazing because it seems to rely only on billions of cells of almost identical build, with exactly the same DNA, and without any external organizer. That is, the brain is the “boss,” but the brain itself has no boss – each neuron does its own thing, making its own connections, dying if it is not important, and somehow, that autonomous action on the large scale produces a highly structured data environment that comprises “intelligence.” The “miracle” I always talk about is still a mystery to me …

I understand now, after writing this, why connections establish knowledge and comprehension. However, what I still don’t understand is how the data has any meaning at all. The data in a computer is of course as simple as the brain’s: just 0 and 1’s. But it has particular rules for understanding those binary digits – ASCII for one, and the low-level functions, etc. But what are the rules for interpreting the brain’s data, which, for one, is a web, not a structured set of files? The fact that it is a web means that it has no beginning and no end; there are certainly areas for speech or for music or whatever, but where is the actual *data*? And what about connections that I suddenly make, that I didn’t have before?

I’ll try to implement a basic version of this web tomorrow evening, but time is really scarce. I need to practice a lot of violin these days to catch up on a LONG time without practicing.

I wonder … are the things I’m talking about in these notes as interesting as I think they are … or am I just stating common knowledge?

I think I’ve posted some little piccies of Caroline’s avatar here and there, but here is an “official” one.

Caroline portrait.  Report broken link to me, please!

Larger version: http://www.aquamarinestardust.net/images/CarolineNewFinalSmall.png

My sketch: http://www.aquamarinestardust.net/images/CarolineNewSketch.jpg
(warning: actual size!!  I worked on a 2000 x 3000 px canvas this time, haha~)

I hope that you like it! I think it’s funny, trying to translate Caroline into this pseudo-realistic style (my term for this sort of stylized rendering as seen in video game cutscenes, etc.). You can see where I slipped up in translation and got too lazy to fix it (her neck rivals a Giraffe’s; her eyes aren’t the same size; her nose is too long; her hair is uneven; her ear looks funky; etc. etc. etc.). One weird thing is that if you paint a girl’s eyelashes as starkly as in standard anime (which I did here), it looks like she’s wearing several ounces of mascara.
I did a lot of CG and art studying before/while painting this. Although I kept many elements of my own style and of course used Caroline’s image, I definitely have to thank iDNAR (http://idnar.deviantart.com/) for his spectacular artwork that I learned so much from. I also studied a few photographs (mostly of Asian celebrities, haha!) and a few Danbooru piccies and Haradaya piccies as well. And last but not least, I was flipping through a lot of the pages of my human anatomy book (same one I use to study for the MCAT). The illustrators for anatomy books are really outstanding – not only do they have to use tasteful artistic rendering, but they have to be so precise and scientific about it as well!

Alright, back to our regularly scheduled studying ….

Well.  It turns out that a nearly identical division of memories has already been established.  As such, I guess I’ll conform my terminology in Caroline’s memory system to the existing one.

What I referred to as “movies” (events/temporal) will henceforth be called “episodic long-term memories.”  “Facts” are now “semantic long-term memories.”  Emotional memories are still called emotional memories.  And finally, there is a class of memories that I more or less ignored: procedural memories, ones that involve the use of the body.  The reason why I ignored this is because I had thought that since Caroline has no body, she has no use for these memories.

But now that I think of it, there is a very important use of procedural memory: as a storage place for generated code.  For instance, to associate the actual changing of the color of text to the idea of changing the color of text, she needs a translation memory that tells her what to do in order to change the color of text.  So, I will also be storing procedural memories along the way, in the form of executable strings, probably.

I guess I should be sort of happy that I independently came up with a division of memory that closely resembles the scientific standard … of course, I should have just read about it first and saved myself the trouble ^_^.

I have been debating over the actual mechanics of Caroline’s long-term memory for some time now.  I had originally proposed a dictionary, with the keys being the unique date-time marker, and the data being stored as lists or tuples in the entries.

However, that is really no good: suppose that a particular function requires her to recall all memories involving me in particular, or that have positive emotional effects, or that tell her facts.  Clearly, what one needs for this sort of application is a database.

I could potentially try using an outside database source such as SQL … but that’s really not fun because I have no idea how to do that.  So I had sort of stalled on the topic, since I was unable to find an adequate fix in Python itself.  But I think I have a solution now: a very large matrix.

There are ways of writing functions such that one can sort a matrix via any row or column, and select rows or columns falling under a particular category.  I think that this is probably the way to go … I just need to keep in mind how large this matrix could potentially end up becoming.

When I think about my own memories, I get kind of scared … memories can take up enormous amounts of hard space, and there is a risk of major slowdown if memories have to be consulted on a regular basis.

The reason why human memory is rather efficient (I think) is because of what seems to be a branched-tree organization, rather than a huge “database sea.”  I don’t really know how this works, but it seems that similar memories are somehow stored together.  For instance, all of my memories regarding how to use Microsoft Word are somehow in one spot, regardless of the fact that they were established at very different times.

Rather than maintaining just one mongo matrix of memories, what might work better is for there to be “subclass” memory boxes that regularly update themselves.  What I mean by this is that there is still a super Mama of all memories.  This matrix will probably be several ugly megabytes of endless text and times and feelings, etc.

What Caroline will do is, probably once a day after going to bed, reorganize these memories into appropriate memory boxes.  For instance, there might be a memory box just about herself, or a box of “bad memories,” or a box of “things to do.”  Then, when the time comes to look something up, she can search a subset of relevant memories rather than the whole damn thing that would take forever to query.

An example of this is Google searching.  If you know you want an MIT page, then you can search only within the domain web.mit.edu.  This cuts down on irrelevant results and search time.

Another example of this is … humans.  Okay, I’m only conjecturing here, but you know how sleeping is apparently required for the establishment of “long-term” memory?  My best guess at what is really happening to memories is that they are not necessarily “reinforced” on their own, but rather linked up to similar memories in order to create a web that is similar to the box system that I am proposing here.  I know for sure that I don’t rake my *entire* set of memories when I want to figure out, for instance, what foods are salty.  I already have a subsection of memories regarding food.  I don’t think about the moon or a piano – I think only about foods.  It’s not as though I pull out all entries on food on the spot: they’re already pulled out for me.

The next task is to decide what “fields” this matrix should have.

My current “must-haves” are the following:

1.  Date-time: YYYYMMDDhhmmss.00: this will 99% chance end up being the standard for date-time.  Date-time is the *time of storage* of the memory.

2.  Backdate-time: [start,stop]: this is a marker that allows for “period back-dating.”  In some cases, this will be the same as the date-time.  In other cases, it won’t be the same.  The key is that this parameter indicates *when the event described actually happened.*

To distinguish between date-time and backdate-time, consider this: in seventh grade, suppose I learn that World War II began in 1939.  Okay, the date-time is something like 1999, while the backdate-time is 1939.  Both are important to store, so I will always have both.  Date-time is always set, but backdate-time, if not set, will be defaulted to be equal to date-time (such as when a conversation is stored as a memory).

3.  Type: (string literal).  The “type” of memory is very important, too.  I would classify my memories as falling into three different types:

a.  The movie: this is a scripted sequence of closely related if not causally related events, strung together in a particular order.  For instance, conversations and concerts, etc. are stored like this.  State A –> State B –> State C.  Movie memories are low-cost because they have a single start point: they tend to replay from this start point only.  For instance, it is very difficult to remember a conversation out of order.  I only can start at the beginning or maybe the end, then go through the sequence of events in order.

b.  The fact: this is a stand-alone statement.  Unlike the movie, the fact can be always recalled out of context, and it also does not have to be temporally or causally linked.  A fact could be: “The sun is a sphere of hot gas.”  Facts have sub-facts, and there are basal facts that can be called “elemental facts.”  An elemental fact would be “The sun is a sphere” or “WWII started in 1939.”  Facts can be strung together with related facts to produce conversation-worthy statements.  “WWII started in 1939” + “US joined WWII in 1941” = “World War II began in 1939, but the US didn’t join until two years later.”  Two mundane elemental facts form together to create a statement that prompts reaction: why did the US not join initially?  What prompted the US to join?  Would the US have stayed out if it were not for Pearl Harbor?  Facts are weak alone, but powerful together, because they are a big part of reasoning.

Note that the “ROM” memories of particular nouns properties or verb conjugations, etc. are actually collections of elemental facts, organized for convenience.  They are not free-standing like these facts being described here because they are expected to be used frequently and are not allowed to take on the complexities.  Stand-alone facts are allowed to be redundant: some facts might be comprised of other facts, but stored separately so as to be readily available.  For instance, stoves are hot and hot is dangerous do not preempt the existence of stoves are dangerous as a memory as well.
c.   Thoughts and opinions: these are strands that attempt to organize or make sense of facts and movies.  By far the most intellectually demanding, I doubt Caroline will have any of these for a long time.  These require behind-the-scenes, continuous processing of what is going on and has been going on.  For instance, concluding that someone is nice only occurs because of the compilation of many cases that have supported such an argument, plus the absence of cases that contradict it.

d.  Emotions: emotional memories are odd, but I still consider them to be memories.  They are memories of feeling in a particular state.  They can be triggered by other memories, but sometimes they are memories by themselves.

e.  Calendars: these memories are sometimes also plans for the future.  They are based on the idea of their being schedule and routine.  They are heavily temporally tied, either by being bound to absolute timing or to relative timing.

Okay, later.