\( \newcommand\Der{\mathrm{D}} \newcommand\dif{\mathrm{d}} \newcommand\Pmf{\mathrm{p}}% probability mass function \newcommand\Prm{\mathrm{P}}% probability measure \)

1Agent

See also the Wikipedia article on intelligent agents.

1.1Stateless agent

Agent input \(x : \Real \to Input\).

Agent output \(y : \Real \to Output\).

Constant behavior \( f \) such that \[ y(t) = f(x(t)) \]

Also called "reflex agent"?

1.2Stateful agent

Is the agent's memory part of the agent or part of the environment?

State \(m\).

Constant behavior \(f\) such that \[ (m(t+\dif t),y(t)) = f(m(t),x(t)) \]

1.3Vectorial

\( Input = \Real^m \)

\( Output = \Real^n \)

1.4Logical, and perhaps probabilistic

Given data/observations/experiments \( x_1 \wedge \ldots \wedge x_n \), infer the simplest hypothesis \( y \) such that \( y \to (x_1 \wedge \ldots \wedge x_n) \).

Example of generalization: \( p(a) \to \forall x [p(x)] \).

1.5Continuous agent

Input: \( x(t) \).

Output: \( y(t) \).

Internal: \( y(t) = f(t,x(t)) \).

1.6Input-output model?

An agent has input and output.

An agent logic is a function \(A : M \times I \to M \times O\) where \(M\) is the memory type, \(I\) is the input type, and \(O\) is the output type.

We assume that the world remembers the agent memory.

1.7Discrete dynamical system model?

Let \(w\) be a world.

Let \(a\) be an agent in world \(w\).

Let \(x~t\) be the input of the agent at time \(t\).

Let \(y~t\) be the output of the agent at time \(t\).

Let \(m~t\) be the memory of the agent at time \(t\).

We assume that the agent needs one time step to compute the output. \[ \begin{aligned} y~(t+1) &= Y~(x~t)~(m~t)~t \\ m~(t+1) &= M~(x~t)~(m~t)~t \\ x~(t+1) &= X~(x~t)~(y~t)~t \end{aligned} \]

1.8Endofunction?

See file:endo.html (The endofunction model of worlds and agents, and its philosophical implications)?

2What

Intelligence model

Legg & Hutter AIXI

Adaptation model

Boredom model

Learning model

Knowledge/thought/reasoning model

Feeling model

Agent A feels feeling F with strength S iff …?

Do people become bored when they meditate?

3Agent intelligence model

We assume that you have read file:endo.html.

Now we define intelligence, but we have to define the required things first.

We define the orbit1 of \(A\) at \(x\) as the sequence \(A^0(x), A^1(x), A^2(x), \ldots\).

We have a function \(judge : (S')^\infty \to \Real\) that judges an orbit.

Now we assume that every state \(x \in S'\) is distributed uniformly. Define \(p(r)\) as the probability of finding a state \(x\) where \(judge(x) \le r\). The shape of the distribution \(p\) describes the intelligence of the agent.

The function \(penalty : S' \to \Real\) defines the undesirability of an agent state. Alternatively, the function \(reward : S' \to \Real\) defines the desirability of an agent state. The function measures how bad or how good the agent performs. This is the agent's hidden objective function. This is hardwired. This is arbitrary. The agent doesn't have to be aware of this. An intelligent agent acts to make its \(penalty(x)\) as close to zero as possible in the long term for as many \(x\) as possible.

The agent displays an intelligent behavior if it can minimize the long-term penalty from lots of starting states. The most intelligent agent is the one that minimizes its lifelong sum of penalty?

4Measuring the intelligence of a phase space trajectory?

We can think of a human as a dynamical system. Given two phase space trajectories, the most intelligent is the most homeostatic, the most stabilizing, the most controlling. (Why?)

5Self

An agent constructs its self model by correlating the output that is fed back to the input.

Self is the extent of control.

6Brain?

The brain is an associative machine and a pattern recognizer?

The brain finds associations across space and associations across time, and find spatial patterns and temporal patterns and spatiotemporal patterns.

What is the relationship between association and pattern?

"Together" means in a space interval under a few meters and a time interval under a few hundred milliseconds.

Each time agent X observes input E and input F having the same value, X increases its association strength of E and F.

The input is X = [x1 … xn], an \( n \)-dimensional vector; or an infinite vector with finite non-zero cells. Each cell is in [0,1].

The correlation matrix is M, an \( n \times m \) matrix; or an infinite matrix with finite non-zero cells.

The output is Y = M X. Each cell is in [0,1]. The output can be interpreted classically as the output itself, or be interpreted probabilistically, in which each output cell is the probability of the firing the corresponding output actuator.

Y0 = M0 X0

W1 = f(W0,Y0)
X1 = g(W1)
Y1 = M1 X1

W2 = f(W1,Y1)
X2 = g(W2)
Y2 = M2 X2

Wk = f(Wk-1,Yk-1)
Xk = g(Wk)
Yk = Mk Xk
...

Some of the output feeds back into the input. Some of the input comes from the environment. This enables the agent to understand itself, that is, to build a model of itself, but not necessarily to know itself.

The agent wants to minimize its approximation error Y - M X. That is, the agent wants to build an accurate self model. The question is: How should it update its M for the next time step?

I think the key to making an AI is to make a machine that can establish association across spacetime.

The brain does not understand causality. It only understands association. The mind understands mathematics and causality.

7AI/ML taxonomy?

7.1What should the categories be?

Artificial intelligence is constrained optimization.

Generate vs discriminative.

Type type of an expert system is \(Facts \to Query \to Answer\). Decision tree. Linearized decision tree.

A learning algorithm is stable iff its generalization error is bounded.

7.2Functions in intelligence models

Some types of functions related to learning
domain codomain name
\(\{0,1\}^*\) \(\{0,1\}^*\) compression (if bijective)
\(\{0,1\}^*\) \(\{0,1\}\) decider
\([0,1]^n\) \([0,1]\) neuron
\(E\) \(C\) finite classification (if surjective)
\(E\) finite \(C\) finite discrete classification (if surjective)
\(E\) finite \(C\) of size 2 discrete binary classification (if surjective)
\(\Nat\) \(E\) sequence
\(E^n\) \(E\) stateless next-value predictor with lag \(n\)
\(C \times E^n\) \(C \times E\) stateful next-value predictor with lag \(n\)

7.3Hyperplane classifier

Let \(h\) be a hyperplane.

Define \(m : \Real^\infty \to \{0,1\}\), the hard linear binary classifier of \(h\), as \(m~x = [h~x \ge 0]\) where \([x]\) is 1 iff \(x\) is true or 0 iff \(x\) is false.

Soft classifier: define \(m~x = \tanh^{-1}~(h~x)\).

7.4Support vector machine

A training point \(x\) is a support of \(h\) iff it is the closest point to \(h\) among all points in the class of \(x\).

Alternative formulation: An upper level is a hyperplane \(h_u\) such that \(\forall a \in U : h_u~a > 0\). A lower level is a hyperplane \(h_l\) such that \(\forall b \in L : h_l~b < 0\). Let \(h_u\) and \(h_l\) be parallel. Maximize the distance between \(h_u\) and \(h_l\). Then \(h_u\) is the upper margin and \(h_l\) is the lower margin. Define \(h\) as the hyperplane exactly between \(h_u\) and \(h_l\).

Define \(m : \Real^\infty \to \{0,1\}\), the support vector machine (SVM) of \(h\), as \(m~x = [h~x \ge 0]\). Such SVM is a binary classifier.

8Ramble?

Intelligence is an ordering (2018-04-26). This idea goes back at least to 2004 in [2]. Intelligence is an ordering of systems. An order is a transitive antisymmetric relation.

How do we decide which of system \(A\) and system \(B\) is more intelligent in task \(T\)?

Let \(T(A)\) denote how well system \(A\) does task \(T\). This is a number. Higher is better. We can invent any measurement. Our definition of "intelligence" is only as good as this measurement.

We say "\(A\) is \(T\)-better than \(B\)" iff \(T(A) > T(B)\).

Let \(S\) be a set of tasks.

We say "\(A\) \(S\)-dominates \(B\)" iff \(T(A) > T(B)\) for every task \(T \in S\).

We define "to be more \(S\)-intelligent than" to mean "to \(S\)-dominate".

The \(S\)-domination relation forms a partial order of all systems.

Example: Which is more intelligent, a dog or a rock? That depends on the task. It's the rock if the task is to sit still. It's the dog if the task is to move around.

Intelligence is function optimization (2018-04-27). Let \(g\) be a goal function. A system's \(g\)-intelligence is how well it optimizes \(g\). What is "how well"? Optimization (extremization) is either minimization or maximization.

Intelligence is what?

Intelligence is a spectrum. Is a human intelligent? Is a rock intelligent? A human is more intelligent than a rock. Is a human pretending to be a rock intelligent?

Can an intelligent system look non-intelligent (hide its intelligence)?

We can measure intelligence as numbers.

Adapting needs learning.

We say X adapts to Y iff Y surprises X less as time goes by. (Whose idea is this?)

Intelligence needs state. State needs time. Intelligence is control. An intelligent system is a special case of control system.

Intelligence relative to something is a real number.

Is a company, which consists of intelligent people, intelligent?

Alan Turing proposed the Turing test.

I think we use the word 'intelligence' to refer to a stabilizing behavior that is complex enough to elude a simple explanation.

I think we agree that we are intelligent.

We cannot know if something is intrinsically intelligent. We can only determine intelligence from what we can observe.

How do we determine how intelligent something is? An intelligent being may elude detection by pretending to be unintelligent.

What is a mathematical theory of intelligence?

(RAMBLE; DELETE)

Here I try an alternative formalization to [4].

Let \(E\) be a set of environments.

Let \(G : E \to \Real\) be a goal function. The value of \(G(e)\) measures how well the agent performs in environment \(e\).

The intelligence of the agent with respect to \(G\) across $E$ is \(\int_E G\).

A performance consists of an agent and an environment.

Assumption: The agent cannot modify \(G\).

Behavior is a function taking an environment and outputing something.

Intelligence is relative to \(G\) and \(E\): goal and environment.

If we see longevity as intelligence test, then an illiterate farmer who lives to 80 is more intelligent than a scientist who dies at 20, but a rock that has been there for 100 years would even be more intelligent than the farmer.

If we see money as intelligence test, then a corrupt politician who steals billions of dollars without getting caught is more intelligent than a honest farmer who only has tens of thousands of dollars.

Gaming the system is a sign of intelligence. It is hard to design a goal function that gives the desired outcome without undesired side effects.

IQ tests are intelligence measures with small environment set.

Lifespan may be an intelligence measure with huge environment set.

A human can optimize several goal functions across the same environment set. A human may be asked to clean a floor, to write a report, to run a company, to cook food, and to find the quickest route between home and office, and optimize them all.

Some goal functions for humans may be:

  • Maximize happiness
  • Minimize pain
  • Optimize the level of a chemical in the brain
  • Optimize the time integral of such chemical
  • Maximize the chance of survival

But I don't know the root goal function that explains all those behaviors.

9<2019-11-27> On adaptive systems and boredom

The law of adaptive systems: "Every adaptive system converges to a state in which all kind of stimulation ceases."2

Corollary: All jobs eventually become boring.

If something excites us, we will eventually get used to it.

But how come we don't get bored of sex? Do we?

10Required mathematics?

Here I try to learn the minimal amount of functional analysis and approximation theory required for learning theory.

10.1Assumed background knowledge

I assume that the reader is a Bachelor of Computer Science who graduated in 2011. As of 2018, functional analysis does not seem to be in any computer science curriculum345. The closest things to functional analysis in such curriculum seems to be ordinary differential equations.

10.2Notations

\( \Real \) is the set of all real numbers. If you are a finitist, just think of a set as a predicate: think of \(\Real\) as a predicate such that \( \Real(x) \) is true iff \(x\) is a real number, and then replace the formula \( x \in \Real \) with the formula \( \Real(x) \) in your mind.

\( [0,1] \) is the unit interval. It is the set of every real number between 0 and 1, including 0 and 1. Formally, \( [0,1] = \{ x ~|~ x \in \Real, 0 \le x \le 1 \} \).

\( C(A) \) is the space of every continuous function whose domain is the set \(A\) and whose codomain is \(\Real\). Formally, \( C(A) = \{ f ~|~ f : A \to \Real, ~ f \text{ continuous} \} \).

10.3In what sequence should I learn?

Here are the easy things. We need to memorize these definitions.

Relation (a domain, a codomain, and a set of pairs). Function (a special kind of relation). Function space.6 Measure.7 Distance or metric.8 Norm. Inner product. Do not confuse measure with metric.

Here are some rather hard things that need some thinking.

Should we think of a matrix as a rectangle containing numbers or as a linear function?

A real-valued function can be seen as a vector.9 "In modern introductory texts to functional analysis, the subject is seen as the study of vector spaces endowed with a topology"10. Why do we adopt this view?

These slides11 (slide 20: Lagrange multipliers are common.)

It would be nice if this Wikipedia article12 relates those opaquely-named function spaces instead of just dumbly listing them.

Do we need to know these?

Functional analysis.13 Hilbert spaces. Banach spaces. Compact spaces. Continuity. Smoothness. Differentiability.

Reproducing kernel Hilbert space14 is an application of functional analysis to machine learning.15

10.4TODO Name this space

Find the name of the space of every function from unit hypercube to unit interval. Find the name of the space \( \{ f ~|~ f : [0,1]^n \to [0,1] \} \). I guess these keywords: embedding, projection. I guess these areas: functional analysis, approximation theory, topology.

Cybenko 1989 [1] uses the notation \(C(I_n)\) to mean the space of every continuous function from \([0,1]^n\) to .16 He refers to [5] for the notations.

From [1]:

  • "a fundamental result in digital signal processing is the fact that digital filters made from unit delays and constant multipliers can approximate any continuous transfer function arbitrarily well."
  • "The main result of this paper is a demonstration of the fact that sums of the form (1) are dense in the space of continuous functions on the unit cube if \(\sigma\) is any continuous sigmoidal function."
  • "In a well-known resolution of Hilbert's 13th problem, Kolmogorov showed" the Kolmogorov representation theorem.17

Best linear approximation[3]?

universal approximation theorem

10.5What

The phrase "x approximates y" means "x is close to y", which implies distance, which implies metric space.

How close is the approximation? Suppose that the function \(g\) approximates the function \(f\) in interval \(I\). Then:

  • The "approximation error at \(x\)" is \(g(x) - f(x)\).
  • The "maximum absolute error" is \(\max_{x \in I} \abs{g(x) - f(x)}\).

How do we measure the distance between two \(\Real \to \Real\) functions \(f\) and \(g\)? There are several ways. Which should we use?

  • The maximum norm, in interval \(I\) is \(\max_{x \in I} \abs{f(x) - g(x)}\). This norm is also called uniform norm, supremum norm, Chebyshev norm, infinity norm, norm-infinity, \(L_\infty\)-norm. Why is it called "uniform"? WP:Uniform norm.
  • What is this norm called? \(\int_{x \in I} [f(x)-g(x)]^2 ~ dx\).

10.6Courses

10.7Subfields of approximation theory

  • Classical approximation theory deals with univariate real functions \(\Real \to \Real\).
  • Multivariate approximation theory deals with multivariate real functions \(\Real^m \to \Real^n\).

10.8Scenarios

  • Suppose we want to approximate the function \(f\), but we don't know the equation for \(f\); we only have a few input-output samples.
    • Can we approximate \(f\)?
    • How do approximation and curve-fitting relate?

10.9Overview

  • What is a multivariate polynomial?
  • Commonly conflated concepts

10.10What

10.11Why are Chebyshev polynomials important?

  • WP:Chebyshev polynomials
    • Why is it important? How does it relate to best approximation?
      • "Chebyshev polynomials are important in approximation theory because the roots of the Chebyshev polynomials of the first kind, which are also called Chebyshev nodes, are used as nodes in polynomial interpolation. The resulting interpolation polynomial minimizes the problem of Runge's phenomenon and provides an approximation that is close to the polynomial of best approximation to a continuous function under the maximum norm."

10.12Machine learning as relation approximation?

  • Machine learning, statistical modelling, function approximation, and curve fitting are related.
  • Generalize function approximation to relation approximation.
  • A function can be stated as a relation.
  • A relation can be stated as a function.

10.13Least-square approximation of overdetermined system of linear equations?

  • Consider the least-square solution to an overdetermined system of linear equations. Is such solution a kind of approximation?
    • There is no exact solution to begin with?
    • Why is it called "least-squares approximation"?
    • How can we approximate something that does not exist?
      • 1.2 approximates 1.23. Both 1.2 and 1.23 exist.
      • Contrarily, there is no X such that AX = B.

10.14Approximation schemes?

10.15How do we approximate a function?

Is it even possible to approximate arbitrary functions?

10.16Why do we approximate?

  • Because it is practically inevitable.
    • Fundamental reason: Because computers are finite.
    • Practical reason: Trade-off between computation time and precision.
      • The more error we can afford, the faster we can run.
        • May be related: 2013 monograph "Faster Algorithms via Approximation Theory" pdf

10.17Approximation by truncation

We can approximate a series by truncating it.

Suppose that the series \(y = x_0 + x_1 + \ldots\) converges.

Suppose that the sequence \(\langle x_0, x_1, \ldots \rangle\) converges to zero.

Pick where to cut. Pick a natural number \(n\).

Then the series \(x_0 + \ldots + x_n\) approximates the series \(y\). We cut its tail. We take finitely many summands from the beginning.

Here come examples: Truncate all the series!

10.17.1Power series truncation

Below we truncate a power series.

Decimal truncation: \(1.2\) approximates \(1.23\). Remember that a decimal number is a series. For example, the number \(1.23\) is the power series \[ \ldots 01.230 \ldots = \ldots + 0 \cdot 10^1 + 1 \cdot 10^0 + 2 \cdot 10^{-1} + 3 \cdot 10^{-2} + 0 \cdot 10^{-3} + \ldots. \]

Polynomial truncation: \(1 + x\) approximates \(1 + x + x^2\) for \(x\) near zero.

Taylor series truncation: \(1 + x + \frac{x^2}{2}\) approximates \(e^x\) for \(x\) near zero. Remember the Taylor series expansion \(e^x = \sum_{n \in \Nat} \frac{x^n}{n!}\).

Below we truncate the ratio of two power series.

Rational truncation: \(12/23\) approximates \(123/234\).

WP:Padé approximation is a truncation of a ratio of series.

Fourier series truncation: The Wikipedia example animates how a Fourier series converges to the sawtooth function as more terms are added.

Digression: Is a (complex) Fourier series a power series? Reminder: A Fourier series looks like \(\sum_{k=0}^{\infty} c_k e^{ikt}\).

WP:Laurent series truncation?

  1. Digression: What is an analytic function?

    A function is analytic iff it can be represented by power series.

    Formally, a function \(f\) is analytic iff for every \(x \in \dom(f)\), we can write \(f(x)\) as a power series.

    See also WP:Definition of "analytic function".

    Taylor series expansion is illustrated in the 2015 slides "Taylor Series: Expansions, Approximations and Error" (pdf)

  2. Digression: What is the relationship between polynomial and power series?

    A polynomial is an algebraic expression. It is not a function.

    Power series is a kind of infinite polynomial.

    WP:Formal power series: "A formal power series is a generalization of a polynomial, where the number of terms is allowed to be infinite."

10.17.2Iteration truncation

Example: Let \(f(x) = x + \frac{1}{x}\).

Continued fraction truncation: We know that \[ 1 + \frac{1}{1 + \frac{1}{1 + \ldots}} = \frac{1 + \sqrt{5}}{2} = \Phi. \] We can truncate that continued fraction to approximate \(\Phi\).

Seeing those examples makes me wonder whether all approximations are truncation.

10.18Approximation vs estimation

Differences:

  • Approximation is part of analysis. Estimation is part of statistics.
  • Approximation does not involve sampling. Estimation involves sampling.
  • Epistemology: Approximation converges to a knowable value. Estimation may converge to a possibly unknowable value (the value exists but it is impractical for us to know what it actually is). Example: we approximate pi, and we estimate the height of all living people on Earth.
  • Epistemology: Approximation does not guess. Estimation does.

Similarities:

  • Both has a notion of "error". Approximation has error. Estimation has bias and uncertainty.
  • Both are instances of modeling (simplification).

11Bibliography

[1] Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems. 2, 4 (1989), 303–314. url: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.7873&rep=rep1&type=pdf>.

[2] Hutter, M. 2004. Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science & Business Media.

[3] Khavinson, S.Y. 1997. Best approximation by linear superpositions (approximate nomography). American Mathematical Soc.

[4] Legg, S. and Hutter, M. 2007. Universal intelligence: A definition of machine intelligence. (2007).

[5] Rudin, W. 1973. Functional analysis, mcgraw-hill series in higher mathematics. (1973).


  1. "Orbit" is a standard mono-unary algebra terminology.

  2. <2019-11-27> https://en.wikipedia.org/wiki/Adaptive_system

  3. https://functionalcs.github.io/curriculum/

  4. https://www.csd.cs.cmu.edu/academic/undergraduate/bachelors-curriculum-admitted-2017

  5. https://cs.stanford.edu/degrees/ug/Requirements.shtml

  6. https://en.wikipedia.org/wiki/Function_space

  7. https://en.wikipedia.org/wiki/Measure_(mathematics)

  8. https://en.wikipedia.org/wiki/Metric_(mathematics)

  9. https://en.wikipedia.org/wiki/Function_space#In_linear_algebra

  10. https://en.wikipedia.org/wiki/Functional_analysis

  11. https://courses.cs.washington.edu/courses/cse590a/09wi/mathfoundation.pdf

  12. https://en.wikipedia.org/wiki/Function_space#Functional_analysis

  13. https://en.wikipedia.org/wiki/Functional_analysis

  14. https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space

  15. https://www.quora.com/What-are-the-most-notable-applications-of-functional-analysis-to-computer-science

  16. https://math.stackexchange.com/questions/84238/is-there-a-shorthand-or-symbolic-notation-for-differentiable-or-continuous

  17. https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Arnold_representation_theorem