1Content plan?

  • What is the relationship between intelligence, complexity, and compression?
  • What is the "everything is compression" view of intelligence?

What is seed AI? How do we, with minimum effort, bootstrap a machine that will free us forever from work? That is actually two questions in one: what is the hardware, and what is the software?

Learning theory combines many areas of mathematics: approximation theory, computation theory, functional analysis, information theory, probability theory and statistics, and others.

Some people tried something similar? What?12.

Bibliography??? [4] [19] [5] [49]

Algorithmic information theory [20]

What?

2Introduction

2.1Target audience and goal

The target audience is people interested in building intelligent systems that will free us from work.

How do we do that? What do we need? We need a theory that shows that it is possible and practical. Then we need to design the hardware and the software. Then we need to actually build it.

We need a healthy mix of theory and practice: a healthy mix of philosophy, science, and engineering:

  • Philosophers seek the truth.
  • Scientists find the truth about reality.
  • Engineers change reality.

Philosophers ask questions that advance science and engineering.

Scientists craft falsifiable theories and do theory-falsifying experiments. These experiments discover some truth about reality. This truth gives the philosophers clues about what questions to ask next.

Engineers builds things based on philosophy and science. However, reality differs from theory, and the engineers always have to compromise in order to build anything at all.

2.2Linguistic conventions

In this document, "thon" is the gender-neutral third-person pronoun. This brilliant idea was proposed by Charles Crozat Converse in 1858.4 If you think this is arbitrary, every word in English is arbitrary. Why should a word invented in 1858 be discriminated from a word invented in 1500 or 2000? Why do we readily accept "bromance" and "twerk" but not "thon"? English is a messy living language. (What living natural language isn't?) There is no particular reason why "eat" had to be "eat" and not "eet" or "nomm". The important thing is that we agree on the meanings of the symbols, not the details of the shape of the symbols.

2.3Always begin with analytic philosophy

We always begin our inquiry with some analytic philosophy because we have to understand what words mean. We need some philosophy. Too little, and we're aimless. Too much, and we get lost in linguistic masturbation.

Analytic philosophy is the careful usage of words. By "analysis", we mean the following. First we find what a word means, using an etymology dictionary. Then we infer what that meaning implies, using only logic and language. An example of analysis is inferring that "bachelor" implies "unmarried" and "wifeless".

3AI, ML, a priori, a posteriori, analytic, synthetic, language

Rule-based system is for knowledge that can be languaged.

Example: "Is a bachelor unmarried?" "Is this person having a flu?"

ML is for knowledge that can only be sensed and not languaged, knowledge that can only be obtained by first-person experience.

Example: "Is this a human face?" "Are there cats in this photo?" "Are there coyote sounds in this recording?"

4Intelligence

4.1What is intelligence?

As of 2018 there is still no firm agreement on what "intelligence" is.5 [8]

The etymology of "intelligence" is unclear.6 The word "intelligent" might have come from a Latin word that means "to choose between".7 Related words are "intellect" and "intellectual". We think that "stupid" is the opposite of "intelligent". Although we don't know what "intelligence" is, we generally agree that most humans are somewhat "intelligent".

Some definitions give hints for modeling intelligence mathematically. In 1923, Edwin Boring proposed that we start out by defining intelligence as what intelligence tests measure "until further scientific observation allows us to extend the definition".[7] Intelligence is relative to the test that is used to measure it. As of 2018 the most general definition of "intelligence" is Hutter and Legg's 2006 definition: "Intelligence measures an agent's ability to achieve goals in a wide range of environments"[28][2527]. I think it subsumes all other definitions listed in their 2007 collection of definitions[26]. Legg and Hutter approached intelligence from algorithmic complexity theory (Solomonoff induction).[28] Schmidhuber, Hutter, and team have used Solomonoff algorithmic probability and Kolmogorov complexity to define a theoretically optimal predictor they call AIXI, and they define "universal" and "optimal".89 There are many definitions in psychology, but I ignored them because their anthropocentrism encumbers mathematization.

There are more exotic theories that I have not understood. Warren D. Smith approached intelligence from computational complexity theory (NP-completeness).[43, 44] Alexander Wissner-Gross's causal entropic forces [56]. Tononi's integrated information theory10. Shour 2018 defined intelligence as "a rate of problem solving […]"[41]. Karl Friston's free-energy principle [14, 15]11.

Why did intelligence evolve? From nature's point of view, intelligence is the ability to survive and reproduce under wide variety of environments (selection pressures). Intelligence evolved because it promotes survival and reproduction. Natural selection chooses intelligence. Intelligent individuals are more likely to survive and breed than unintelligent individuals are.

Microbial intelligence12?

Intelligence is something intrinsic to an individual that promotes its and its descendants' survival and reproduction. Thus we are intelligent because: our recent ancestors were intelligent, and their intelligence helped them survive and reproduce enough to finally beget us.

There is a philosophical treason that we have to commit in order to be able to make progress at all: we have to conflate internal state and external behavior. As of 2018 I still haven't seen how I can write anything without conflating internal state and external behavior. Thus, for progress, I commit the duck-typing13 fallacy: "If it looks intelligent, then it is intelligent." The behavior of a system is whatever it exhibits that can be observed from outside.

4.2Artificial intelligence

AI stands for "artificial intelligence". "Artificial" simply means "made by humans". In the 1950s, AI was whatever McCarthy et al. were doing.141516

What are AI approaches? How are we trying to make an AI? Pedro Domingos [13] categorizes AI approaches into five tribes: symbolists (symbolic logic), connectionists (neural networks), evolutionaries (genetic algorithms), bayesians (statistical learning, probabilistic inference), and analogizers (what is this?).

What else could "intelligence" mean? "Intelligent" means smart. "Intelligent" means "does something well"? "Intelligent" means able to survive in wide environments. In politics, intelligence is covert warfare, and is often contrasted against physical power. Chemotaxis is an example of intelligence. Chemotaxis can be modeled as mathematical optimization, as gradient following (ascent or descent). Andrea Schmidt described chemotaxis as a biased random walk.17 The bias is the chemical concentration gradient.

Intelligence is relative: it depends on the goal used to measure it.18

4.3Transferability?

A test only measures how good the subject is at doing that test. What justifies our belief that a high test score implies the ability to do things similar to the test?

4.4What is the relationship between AI and ML?

ML is a subset of AI.

Then what is the rest of AI that is not ML?

4.5Interpretability?

<2018-09-28> Book: "interpretable machine learning"19

2017 survey [6]

WP:Explainable Artificial Intelligence

5Prediction

5.1What is prediction?

"To predict" is to foretell.20 "To predict something" is to say it before it happens.

What is the difference between prediction and guessing? Prediction is justified, whereas guessing is luck. Thus prediction is justified belief whose truth is unknown but likely. Thus prediction (justified belief) is almost knowledge (justified true belief).

What is the difference between prediction and reasoning, if prediction is justified (reasoned) foretelling?

Our prediction is the output of our model of reality.

What do we mean by "predicting economic crisis"? It is easy to predict that there will be an economic crisis, but it is hard to predict when that crisis will happen.

How do we model the prediction of economic crisis?

Prediction is extrapolation. Prediction is uncertain. Prediction is probabilistic. Predicting the past is called "counterfactual reasoning".

What can be predicted? The next values of a sequence. The previous values of a sequence.

What justifies prediction? Past knowledge. Belief? Revelation?

5.2Next-value prediction of a sequence

Next-value prediction with lag \(n\) is answering "Given \(x_1, \ldots, x_n \in E\), what is the most likely value of \(x_{n+1} \in E\)?" This is finite sequence extrapolation?

A stateless next-value predictor with lag \(n\) is a function \( p : E^n \to E \). "Stateless" is also called "memoryless" or "context-free". The lag is the memory, the lookback. Examples of such predictor is Markov chains.

A stateful next-value predictor with lag \(n\) is a function \( p : C \times E^n \to C \times E \). The stateless one is just a special case where \(C\) is a singleton set (a set with one element only).

A predictor can predict arbitrarily far by feeding its output back into its input, in the fashion of this recurrence relation:

\[\begin{align*} c_1, x_{n+1} &= p(c_0, x_1, \ldots, x_{n+0}) \\ c_2, x_{n+2} &= p(c_1, x_2, \ldots, x_{n+1}) \\ c_3, x_{n+3} &= p(c_2, x_3, \ldots, x_{n+2}) \\ &\vdots \\ c_k, x_{n+k} &= p(c_{k-1}, x_{k+1}, \ldots, x_{k+n}) & \text{(the recurrence relation)} \end{align*} \]

Thus a machine implementing that predictor requires memory for storing one element of \(C\) and \(n\) elements of \(E\). Also observe that if \(p\) is primitive-recursive21, then the recurrence relation above is also primitive-recursive.

5.3MESS

Finitists? Is \(x\) where \(x_k = k\) a sequence or a description of a sequence? A sequence is finite; for example: \(1,2,3\) is a sequence of length 3. The following is a description of a sequence, not a sequence: \(1,2,3,\ldots\).

Topologically, a predictor is a function whose codomain is a projection22 or an embedding23 of its domain.

A classifier is a predictor with finite codomain.

A feature is a function \(f : A \to \Real\).

A data or an example is a tuple \((x,y) \in A \times B\).

A linear predictor is the equation \(y = w \cdot f(x)\) where \(w\) is the weight vector, \(f(x) = (f_1(x),\ldots,f_n(x))\) is the feature vector of \(x\), \(f_k(x)\) is the \(k\)th feature, \(x\) is the input, and \(y\) is the predicted output. The predictor is linear in \(w\).

Now, what if the prediction is probabilistic? Every discrete probability space \((\Omega,F,P)\) forms a module24 with \(\Real\). A discrete probability space almost forms a vector space. We use "probabilistic value" ("probval vector field"). The idea is to represent the outcome of a fair coin toss as \( \frac{1}{2}H + \frac{1}{2}T \) where \(H\) represents "head with probability 1" and \(T\) represents "tail with probability 1". Each of \(H\) and \(T\) is a basis vector. We can represent \(H = (0,1)\) and \(T = (1,0)\). If two events are independent, then their dot product is zero. The components of a probability vector must sum up to one. Thus \(\frac{1}{2}H\) represents "head with probability 1/2". This is similar to bra-ket in Dirac quantum electrodynamics formulation, but we use real probabilities instead of complex probability amplitudes. Name? Probabilistic propositional calculus. Propositional calculus commutative group. Real event module. Probabilistic event module. Belief module.

6Planning, simulation, and regret

We identify three meanings of planning:

  • simulation, regret prevention
  • topological ordering of known choices
  • creatively finding a way to achieve goal

Planning has several meanings. To plan is to prepare the future. To plan is to prepare for the future.

"Plan" comes from Latin the "planum" which in 1706 meant "drawing".25 Presumably it means the drawing of something that architects are going to build. Why do architects plan? To prevent expensive mistakes: building is expensive, but drawing is cheap. It is cheap to change the plan, but it is expensive to change the building once it has been built. A 1 man-hour change in the plan may translate to a 100 man-hour change in the building. Thus the essence of planning is not the drawing, but the prevention of future expensive mistakes. Therefore, to plan is to prevent regret. In the 1530s "regret" meant "pain or distress in the mind at something done or left undone"26. Regret is the realization that something could have been done in the past that would improve the present. Regret is the wish to have done something differently in the past. Regret is the wish to change the past.

If we were incapable of regret, would we plan?

Planning is simulation. To plan X is to simulate X in order to actually do X, to clarify X. To simulate X is to construct a mental model of X. A plan is a simulation. A plan is a model of what is to be achieved.

Commit to a plan?

Example usages. Planned parenthood. Unplanned children. Unplanned expenses. Disaster recovery plan. Fighting is planning. "If X happens, we will do Y." "If he right-straights me, I'll dodge right and left-uppercut him."

"I plan to go there."

Planning is future-oriented thinking- Planning is future-directed thinking. Planning is a special case of thinking. To plan X is to want X. To plan for X is to be ready for X.

100-year plan

Planning is optimization by permutation?

Given a sequence of actions \(a_1, \ldots, a_n\), permute that sequence to \(a_{p(1)}, \ldots, a_{p(n)}\), in order to minimize \(e(a)\) (the effort of doing \(a\)).

Planning is anticipation.

Planning can anticipate most but not all things reality may throw on us.

Planning is topologically ordering a set into a directed acyclic graph. Tree? Forest?

7Classification

Classification is surjection and abstraction.

Key ideas:

  • A classification is a surjective function.
  • A classifier is an approximation of a classification.

Relationship between classification and prediction: A classifier tries to predict the class of things in a domain.

Let \(X\) be a set of things that we want to classify.

Let \(N\) be the set of class indexes. We assume that \(N\) is a finite set of some first natural numbers. This set represents class "names".

The class index of \(x\) is \(c(x)\).

A classification is a surjective function \(c : X \to N\). "Surjective" means that there is no empty class (there is no unused class index).

A classifier is an approximation of a classification.

Classification loses details. In a classification from A to B, there have to be more elements in A than in B. Therefore classification is an example of abstraction.

8Compression

Compression is bijection.

A compression is a bijection from strings to strings. Formally, a compression is a bijection \( c : B^* \to B^* \) where \( B = \{0,1\} \) is the alphabet and \(*\) is the Kleene star.

A compression exploits input regularity to shorten likely strings and elongate unlikely strings. We assume that a compression's input strings are narrowly distributed.

An example of compression is a natural language such as English. The word "eat" is shorter than the word "antidisestablishmentarianism" because we are more likely to use the former more than the latter.

A "lossy compression" is called an approximation. For example, "JPEG compression" should be called "JPEG approximation". There is no need to invent the phrase "lossy compression".

ML can be used to compress.

9Belief, language, thought, logic

Prolog seems suitable for parsing and compiling.[9][54]

It is possible to parse limited English in Prolog.27

<2018-12-30> Idea: "Intelligence" is a set of meta-logic rules for updating the knowledge base (internal beliefs) with respect to observations. An agent's total belief is a logic formula in conjunctive-normal form.

Relating symbolism and connectionism: How does our internal representation of logic and language arise from neural networks?

9.1Explaining, reasoning, justifying

Given a finite prefix of a sequence, find the most likely program that generates the sequence with that prefix.

Who (Hutter? Legg?) shows us how to answer this with Solomonoff algorithmic probability[45], which is unfortunately incomputable.

TODO read theory of justification[48]28

Natural language processing?

9.2Conjectures about language and logic

Conjectures:

  • Natural languages are just surface syntaxes for first-order logic.

It is straightforward to write a Prolog program that parses some limited English. It is still practical to write a Prolog program that parses some richer English with named entity recognition. Prolog definite-clause grammars make parsing easy

Another problems:

  • Which information source should the computer trust?
  • How should the computer reconcile conflicting information?

2011 "Natural Language Processing With Prolog in the IBM Watson System" https://www.cs.nmsu.edu/ALP/2011/03/natural-language-processing-with-prolog-in-the-ibm-watson-system/

If IBM Watson is possible, then a personal search assistant should be possible.

9.3Concept spaces, word vectors, concept vectors, bags of words

Let Car represent the concept of car. Let Red represent the concept of red. Let Modify(Car,Red) represent the concept of red car. Then Modify(X,Red) - Modify(Y,Red) = X - Y.

Modify(X,M) - Modify(Y,M) = X - Y.

Literature?

9.4Hume's problem of induction

How do we justify induction?https://en.wikipedia.org/wiki/Rule_of_succession

9.5<2018-12-28> Exception inference, machine doxastic logic, how a machine may update its own beliefs upon encountering counterevidence

Suppose that the machine believes this:

bird(x) -> fly(x)

Suppose that it fully trusts us. Then we tell it this:

bird(penguin)
NOT fly(penguin)

The machine should infer:

THUS there_is_something_i_did_not_know_about(penguin)

A human reasons similarly: "If every bird flies, and penguin is a bird, but penguin doesn't fly, then there is something I didn't know about penguin." In this case, the thing we "didn't know" is the "flightless" predicate.

We can formalize that reasoning into this algorithm:

  1. Suppose that the knowledge base contains rule \(R : a(X) \to b(X)\).
  2. The machine encounters \(X\) such that \(a(X) \wedge \neg b(X)\).
  3. The machine creates a fresh predicate \(E\) that did not already exist in the knowledge base.
  4. The machine changes rule \(R\) to \(R' : a(X) \wedge E(X) \to b(X)\).

This is learning. This relates logic and approximation theory. The formula \(p \wedge q\) approximates \(p \wedge q \wedge r\). The approximation error is 1 clause.

Weakness: this assumes that all existing beliefs are correct. A wrong belief stays there forever.

10Ethics?

10.1<2018-12-28> Post-AI ethical concerns

10.1.1Some AI ethics questions

  • What is the problem with Asimov's three laws of robotics?
  • Will the rich monopolize AI?
  • What should we do if everything is free? What should we do if we don't have to work to eat?

Who should a machine trust when there is a conflict of belief?

Trust is discussed in file:social.html. Perhaps it should be refactored.

10.1.2There are only two possible worlds after AI

The optimistic case: Machine does all work. Food is free.

The pessimistic case: Some elites use AI to oppress everyone else.

10.2<2018-12-28> China's misunderstanding of Confucianism implies that China will use AI to oppress dissidents to maintain social order. Should we be concerned?

By "China", I mean the Chinese government.

We all, including China itself, misunderstand Confucius.29

China has always prioritized communal harmony over individual liberty. This is because it misunderstands Confucianism30. China will do everything to maintain social order, even if it means mass-surveiling people3132 and oppressing dissidents33.

The liberal West sees all oppression as evil, but China sees some oppression as necessary for social order. Most international media subscribe to liberal Western ideology. We are observing the same reality from different ideological lenses.

"Confucian values are regarded as incompatible with liberal democracy and are considered to impede democratization."34

Every state on Earth oppress some groups to some degrees for some reasons. Nazi Germany, the USSR, Russia, China, the USA, Australia, Denmark, Arabic countries, Islamic countries, you name it. All of them oppress some people.

China is already using AI to oppress dissidents, no later than 2010. In 2018 it "is working to combine its 170+ million security cameras with artificial intelligence and facial recognition technology to create a vast surveillance state".35 It is only a matter of time before it perfected that. Should we be concerned?

Is it impossible to maintain social order without government?

10.3<2018-12-28> Antinatalism implies that creating a sentient machine is immoral

It is immoral to force a sentient being to exist.

Humans are smart enough to arrive at antinatalism, but they still fuck and have babies. Nobody seems to give a fuck.

11Literature study?

11.1What publications may interest us?

Recency is important. The study of AI moves quickly.

The 2018 book [31].

The 2015 book [11] does what? Is that a book, or is that a Google Scholar entry error?

The 2016 book [40] is the updated version of the classic undergraduate textbook.

The 1994 book Kearns and Vazirani introduction[22]. The 2007 book [10]36 justifies machine learning with approximation theory? It seems to be very relevant to what I'm trying to do?

1984 paper Valiant PAC learning[53]

Heavy math?

Approximation theory and practice[52]

The 1997 monograph Best linear approximation[23]

The AlexNet paper37

Skimming list? Delete?

What are some expository works in AI?

Surveys, reviews, positions, and expositions?

For AI history, Pamela McCorduck's "Machines who think". Also Wikipedia383940.

11.2Conferences

"Approximation Theory and Machine Learning" (Purdue University; September 29 - 30, 2018).41

NIPS conference looks foundational from its proceedings42.

IJCAI seems foundational https://www.ijcai.org/proceedings/2018/

11.3Datasets

ImageNet images of almost everything43

MNIST handwriting

There must already be a website that collects datasets.

11.4People with similar interests?

The authors of [10]: Ding-Xuan ZHOU "research interests include learning theory, wavelet analysis and approximation theory".44

The authors of [57]. Linli Xu45, Martha White46, Dale Schuurmans47.

The people attending the "Approximation Theory and Machine Learning" conference.

Who else? Who are AI/ML researchers, what do they focus on, and what are they doing?

WP AI Portal lists several leading AI researchers.

Does Geoffrey Hinton specialize in image recognition?

Who are the researchers?

How is Pedro Domingos's progress of finding the master algorithm unifying the five tribes?

  • Markov logic network unifies probabilists and logicians.
    • How about the other three tribes?
  • Hume's question: How do we justify generalization? Why does generalization work?
    • Does Wolpert answer that in "no free lunch theorem"?
    • I think induction works because our Universe happens to have a structure that is amenable to induction.
      • If induction doesn't work, and evolution is true, then we would have gone extinct long ago, wouldn't we?
        • What structure is that?

11.6Become an AI researcher?

Where do I begin? How do I begin?

Must we pick an area of interest? Speech recognition? Computer vision? Natural language processing? Speech synthesis?

What is the best place to do AI research? Swiss IDSIA? USA? China? Japan? Korea? Australia? New Zealand?

Where are new results announced?

Other resources?

Corpuses, datasets, training sets: MNIST handwritten digit dataset.

OpenAI. Let an AI learn in an accurate-enough physical simulation, then move it into the real world.

OpenCog http://opencog.org/about/

11.7Statistics?

Correlation hints causation.

Mathematical-Statistical Learning Theory https://ocw.mit.edu/courses/mathematics/18-657-mathematics-of-machine-learning-fall-2015/

https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/

Convex Optimization, Boyd & Vandenberghe https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

CMU Statistics

http://www.stat.cmu.edu/~siva/700/main.html

http://www.stat.cmu.edu/~larry/=stat705/

http://www.stat.cmu.edu/~larry/=sml/

http://www.cs.cmu.edu/~10702/

https://www.stat.berkeley.edu/~statlearning/publications/index.html

https://github.com/bblais/Statistical-Inference-for-Everyone

https://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair

Bayesian Updating http://statweb.stanford.edu/~serban/116/bayes.pdf

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Mathematics

11.8Should I read these?

11.9AI approaches

  • logic, symbolism
  • biology, connectionism
  • probabilistic logic programming

What's trending in 2018??

  • deep learning (DL)
  • generative adversarial network (GAN)
  • long short-term memory (LSTM)

There are two ways to make an "infinite-layer" neural network:

  • recurrent neural network (RNN), similar to IIR (infinite-impulse-response) filter in control theory
  • neural ordinary differential equations (NODE), similar to Riemann summation in calculus

How many AI approaches are there? WP AI Portal lists 4 approaches. Pedro Domingos lists 5 "tribes".

12What?

12.1Readings? Undigested information? Delete?

Should we read this?

University courses? For a course with computer science background, see Stanford University CS221 (Artificial Intelligence: Principles and Techniques) Autumn 2016 [29]. For a course with mathematics background, see Massachusetts Institute of Technology 18.657 (Mathematics of Machine Learning) Fall 2015 [39].

Delete?

https://medium.com/deeper-learning/a-glossary-of-deep-learning-9cb6292e087e

Lecture 2 of CS221: Artificial Intelligence: Principles and Techniques

Neural network https://en.wikipedia.org/wiki/Universal_approximation_theorem

Create an AI for automatically finding data from the Internet? Machine-aided human summarization (MAHS) Human-aided machine summarization (HAMS) https://en.wikipedia.org/wiki/Automatic_summarization

Stanford Autumn 2016

Machine Learning, Tom Mitchell, McGraw-Hill http://cs229.stanford.edu/

http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html

Undergraduate Computer Science point of view https://www.cs.princeton.edu/courses/archive/fall16/cos402/

Graduate http://www.cs.cmu.edu/afs/cs/Web/People/15780/

http://homes.cs.washington.edu/~pedrod/

Metric Learning: A Survey http://web.cse.ohio-state.edu/~kulis/pubs/ftml_metric_learning.pdf

Distance Metric Learning: A Comprehensive Survey https://www.cs.cmu.edu/~liuy/frame_survey_v2.pdf

Learning Deep Architectures for AI http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf

https://en.wikipedia.org/wiki/Similarity_learning

Essentials of Machine Learning Algorithms (with Python and R Codes) https://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/

http://www.cs.cmu.edu/~./15381/

http://stanford.edu/~cpiech/cs221/

http://www.cs.princeton.edu/courses/archive/fall15/cos402/

https://grid.cs.gsu.edu/~cscyqz/courses/ai/aiLectures.html

https://www.cs.utexas.edu/users/novak/cs381kcontents.html

https://www.cs.utexas.edu/users/novak/cs343index.html

http://www.cse.unsw.edu.au/~billw/cs9414/notes.html

Why deep learning works

http://www.vision.jhu.edu/tutorials/ICCV15-Tutorial-Math-Deep-Learning-Intro-Rene-Joan.pdf

http://www.vision.jhu.edu/tutorials/ICCV15-Tutorial-Math-Deep-Learning.htm

https://calculatedcontent.com/2015/03/25/why-does-deep-learning-work/

Neural Networks, Manifolds, and Topology http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Deep Learning, NLP, and Representations http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

Machine Learning A Probabilistic Perspective Kevin P. Murphy Table of Contents http://www.cs.ubc.ca/~murphyk/MLbook/pml-toc-22may12.pdf

The following is a list of free, open source books on machine learning, statistics, data-mining, etc. https://github.com/josephmisiti/awesome-machine-learning/blob/master/books.md

Undigested information

Selected threads from /r/artificial?

History questions?

  • Why was Raymond J. Solomonoff [16, 46] interested in predicting sequences of bits? What was he interested in? What was he trying to do?

Reading list?

Statistical learning

Inverse problem theory [50]

? [51]

Wiener cybernetics book? [55]

Semi-supervised learning?

What is rational?

Moravec's paradox

Reading list?

Neural Architecture Search with Reinforcement Learning Barret Zoph, Quoc V. Le https://arxiv.org/abs/1611.01578

http://artint.info/html/ArtInt.html

https://en.wikipedia.org/wiki/Book:Machine_Learning_%E2%80%93_The_Complete_Guide

https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Mathematics/Reference_resources

An Automatic Clustering Technique for Optimal Clusters https://arxiv.org/pdf/1109.1068.pdf

https://en.wikipedia.org/wiki/State-Action-Reward-State-Action

http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests:-confidence-intervals-and-confidence-levels

http://greenteapress.com/thinkstats2/html/index.html

https://elitedatascience.com/learn-machine-learning

http://www.mit.edu/~9.520/fall14/Classes/mtheory.html

https://arxiv.org/pdf/1311.4158v5.pdf

Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?

http://www.stat.yale.edu/Courses/1997-98/101/confint.htm

http://www.itl.nist.gov/div898/handbook/prc/section1/prc14.htm

12.1.1Machine learning

Algorithmic Aspects of Machine Learning Matrices http://people.csail.mit.edu/moitra/docs/bookex.pdf

http://www.deeplearningbook.org/

Pedro Domingos: "The Master Algorithm" | Talks at Google - YouTube https://www.youtube.com/watch?v=B8J4uefCQMc slides: https://www.slideshare.net/SessionsEvents/pedro-domingos-professor-university-of-washington-at-mlconf-atl-91815 Grand unified theory of machine learning

A Tour of Machine Learning Algorithms http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Asynchronous Methods for Deep Reinforcement Learning https://arxiv.org/abs/1602.01783

Active learning of inverse models with intrinsically motivated goal exploration in robots (2013) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.278.5254

One-bit compressed sensing by linear programming (2011) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.413.5719

Approximate Clustering without the Approximation http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.222

Fully Automatic Cross-associations (2004) clustering algorithm with no magic numbers http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.67.9951

One and done? Optimal decisions from very few samples (2009) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.211.6874

Whatever Next? Predictive Brains, Situated Agents, and the Future of Cognitive Science. (2012) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.259.7600

https://qz.com/1161771/we-looked-at-the-major-scientific-discoveries-from-five-years-ago-to-see-where-they-are-now/

https://inside.com/lists/technically-sentient/recent_issues

6 areas of AI and machine learning to watch closely https://medium.com/@NathanBenaich/6-areas-of-artificial-intelligence-to-watch-closely-673d590aa8aa#.sp7w03rk5

Differentiable neural computers https://deepmind.com/blog/differentiable-neural-computers/

Unlikely reading list? Delete?

  • Survey-like
    • 2006, chapter, "Topics in multivariate approximation theory", pdf available
    • 1982, article, "Topics in multivariate approximation theory", pdf
    • 1986, "Multivariate Approximation Theory: Selected Topics", paywall
  • Theorem
    • 2017, article, "Multivariate polynomial approximation in the hypercube", pdf
  • 2017, article, "Selected open problems in polynomial approximation and potential theory", pdf
  • 2017, article, "High order approximation theory for Banach space valued functions", pdf available
  • Articles summarizing people's works
    • 2017, article, "Michael J.D. Powell's work in approximation theory and optimisation", paywall
    • 2000, article, "Weierstrass and Approximation Theory", paywall
  • 2013, article, "[1312.5540] Emerging problems in approximation theory for the numerical solution of nonlinear PDEs of integrable type", pdf available
  • 1985, article, "Some problems in approximation theory and numerical analysis - IOPscience", pdf available
  • 2011, article, "Experiments on Probabilistic Approximations", pdf

AI is about making something that is as intelligent as a human brain without caring about how human brain works. Cognitive neuroscience is about how a human brain works.

Brain is a vector function

Machine learning. Machine learning makes machine do things from examples.

What fields does this book depend on?

Computational neuroscience

Intelligence needs the ability to adapt.

Every software system is a state machine.

Curry's Y combinator makes a fixed point equation

What are the limits of intelligence?

Topology [32]

Functional analysis

Dynamical system

Control theory

Fixed point theory

Neurophysiology

Computer science

https://en.wikipedia.org/wiki/Connectionism

https://en.wikipedia.org/wiki/Cybernetics

Biological neuron model https://en.wikipedia.org/wiki/Biological_neuron_model

An introduction to mathematical physiology https://people.maths.ox.ac.uk/fowler/courses/physiol/physiolnotes.pdf

Learning and Transfer of Learning with No Feedback: An Experimental Test Across Games http://repository.cmu.edu/cgi/viewcontent.cgi?article=1040&context=sds

Perceptual learning without feedback in non-stationary contexts: Data and model http://socsci-dev.ss.uci.edu/maplab/webdocs/petrovdosherlu06.pdf

Neural coding https://en.wikipedia.org/wiki/Neural_coding

Pulse-frequency modulation in brain neurons

Reward system

12.2.1Genetic algorithm

A genetic algorithm is an iterated randomized mixing filtering optimization. Generalized genetic algorithm: Let \(\fun{Pop}\) be the population type. Let \(t : \Nat\) be time. Let \(\fun{pop}~t : \fun{Pop}\) be the population at time \(t\). Let \(\fun{fit}: \fun{Pop}\to \fun{Pop}\) be the fitness filter a.k.a. selection function a.k.a. selection pressure function. Let \(\fun{mate}: \fun{Pop}\to \fun{Pop}\to \fun{Pop}\) be the next-population function, including mutation, birth, death, mating. Let \(\fun{sur}~(t+1) = \fun{fit}~(\fun{pop}~t)\) be the survivor set at time \(t+1\). The algorithm is the equation \(\forall t \in \Nat : \fun{pop}~(t+1) = \fun{sur}~(t+1) + \fun{mate}~(\fun{pop}~t)\). Observe the sequence of populations \(\fun{pop}~0, \ldots, \fun{pop}~t\). A genetic algorithm, an iterated search algorithm, is a mono-unary algebra. Genetic algorithm is like tree search. The mating function is the fringe function. A genetic algorithm is a stochastic process. A genetic algorithm takes a filtering and mating algorithm and produces a search algorithm.

Simulated annealing.

Randomized search algorithm.

12.2.2The brain at a time is a big array function.

Can we formulate it in a way that does not depend on linear time?

12.2.3Control and consciousness require feedback

Control needs feedback. There is also open-loop or feed-forward control, but complex control needs feedback.

Consciousness needs feedback. Consciousness needs sensory input. Self concept needs feedback. If there is not a feedback, a system cannot distinguish itself from its environment. The self concept will never arise.

If a brain can immediately control a thing, then that thing is part of the brain's self concept. If the brain can't, it's not.

If a brain often gets certain input shortly after it produces certain output, it will associate the output with its self concept.

The self is the thing under conscious control.

12.2.4Phase-space learning?

There is a boundary: the agent, and the environment. How many functions do we need to model it?

One function that is an endofunction of phase space. The agent state is a subspace of that phase space. The environment state is another subspace of that phase space.

The idea is to represent the how the phase space changes in a small time. The number of variables should equal to the degree of freedom of the system.

12.2.5What are the ways of describing a system?

  • function from time to state

  • endofunction of phase space

State space and phase space are the same. State space is for discrete systems. Phase space is for continuous systems.

12.2.6Supervised to unsupervised

Can a supervised learning algorithm always be made into an unsupervised learning algorithm?

12.2.7Approximation to optimization

Can an approximation scheme always be made into an optimization scheme?

12.2.8Optimal clustering

Given a set of points, what is the optimal clustering/partition?

12.2.9Optimal approximation

Given a set of points \(\{(x_1,y_1),\ldots,(x_n,y_n)\}\) (samples of a function), what is the function that optimally approximates those samples? The approximation error is \(\sum_k y_k - f(x_k)\). Let \(F\) be the set of all integrable real-to-real functions. Define \(M(f) = \int_\Real f\) as the infinite integral of \(f\). Define the complexity of \(f\) as \(C(f) = \sum_{k=1}^\infty M(D_k(f))\) where \(D\) is the derivative operator.

12.2.10Meta-approximation

Given set of points \(D = \{(x_1,y_1),\ldots,(x_n,y_n)\}\), find \(g\) that finds \(f\) that approximates \(D\).

Let \(F\) be the set of all real-to-real functions. Can we craft a measure on \(F\)? Can we craft a probability measure on \(F\)? Can we craft a universal prior for \(F\) like Solomonoff did for bitstrings?

What is the best way to update the approximator using the approximation error?

12.3Independent scholar? Citizen science?

13System models? TODO clean up system.org

This chapter defines system. Later chapters discuss interesting systems. We classify systems, hoping to gain some insight. We can classify systems into two big classes: time-dependent (time-variant, temporal) and time-independent (time-invariant, atemporal).

13.1What is a system?

We define a system as an input, a state, and an output. The input is \(x\), the output is \(y\), and an equation relates them. The state is implied by the equation. Such equation can be written \(f~y~x = y\). The equation can be quite arbitrary. The terms \(f,x,y\) may appear on both sides of the equation.

A system is \((x,y,f)\) where \(y=f~y~x\).

An invariant of a system is a property that stays the same throughout the evolution of the system.

The behavior of a system is its output, especially the observable part of the output.

Composition

Continuous system

Discrete system

Finite system

An embedded system is a system in another system. The outer system feeds the inner system's output back to the inner system's input, possibly with some change.

Don't confuse this with embedded systems in computer engineering.

How do we measure system complexity?

13.2Ignoring degenerate feedback: feedforward

Every function \(f\) is a special case of the general feedback equation \(f(x) = g(f,x)\) where \(g\) is an identity function. This suggests that feedforward is a degenerate case of feedback. To simplify the writing, from this point on, we always assume that a feedback is non-degenerate unless written otherwise.

13.3Finding feedback: the inverse fixed point problem

Given \(f\), find a \(g\) such that \(f(x) = g(f,x)\) and \(g\) is not an identity function.

The forward fixed point problem: Given \(f\), find an \(x\) such that \(x=f(x)\).

The inverse fixed point problem is "Given \(x\), find an \(f\) such that \(x = f(x)\) and \(f\) is not an identity function." This problem arises when we want to determine if \(f\) has a feedback.

Example of non-feedback: linear functions. Consider a function of the form \(f(x) = a \cdot x + b\) where \(a\) and \(b\) are non-zero constants. The only \(g\) that satisfies \(g(f) = f\) is the identity function \(g(x)=x\).

Example of feedback: functional equation. Consider a function of the form \(f(x) = x \cdot f(x-1)\).

Recursive functions are special cases of feedback. Searching in list. \(f(N,e) = 0\). \(f(C,h,t,e) = h \equiv e \vee f(t,e)\). \(g\) is the Y-combinator.

We have a problem: there are infinitely many wildly discontinuous functions satisfying that. We want smooth functions.

13.4Feedback based on differentiability-preserving map

We want a map that preserves differentiability. Formally, given \(f=g(f)\), we want \(g\) to have the property that iff \(f\) is differentiable then \(g(f)\) is also differentiable. Surely if \(f\) is differentiable and \(g\) is differentiable then \(g(f)\) is also differentiable? Surely if \(f\) is differentiable \(g\) is a polynomial then \(g(f)\) is differentiable?

We begin with the generalized differential on a field: \(g~(f+h) = g~f + h \cdot d~g~f\) where \((f + g)~x = f~x + g~x\) and \((f \cdot g)~x = f~x \cdot g~x\). Thus \(h \cdot d~g~f = g~(f+h) - g~f\). This is like computing the gradient of a vector function, but the vector is infinite-dimensional.

Is it time to learn topology? Smooth manifolds?

13.5Measuring feedback

Given a system \(f(x) = g(f,x)\), we're interested in measuring how much feedback it has.

Assume that \(f\) is a vector. We can measure the feedback by measuring \(d_f ~ g\): the differential of \(g\) with respect to \(f\). Using non-standard analysis, we define the gradient \(d f\) as something satisfying \(f(x + h) = f(x) + h \cdot (d f)(x)\) where \(h\) is an infinitesimal.

13.6Linear feedback and function classes

If \(f\) is linear and \(g\) is linear, then \(f \circ g\) is linear. A linear feedback does not add anything interesting to a linear function.

13.7Temporal systems

A temporal system, a time-dependent system, or a time-variant system is a system that depends on time. With time, we can define more interesting systems.

A temporal system is a function whose type is \((T \to X) \to T \to Y\). \[\SysTmp~T~X~Y = (T \to X) \to (T \to Y)\]

We can see a temporal system as a transformation of time functions. \((T \to X) \to (T \to Y)\).

Example: \(f~x~t = (x~t)^2\).

Example: \[f~x~t = x~t + \int_0^t (s - f~x~t) \cdot dt\].

A temporal system is \((x,y,f,T)\) that satisfies \(\forall t \in T : f~y~x~t\).

13.8First-order system

The previous section talks about second-order system.

This is a first-order system: \(\SysTmp~T~X~Y = X \to T \to Y\).

\(\SysTmp~T~X~Y = T \to X \to Y\).

There are two points of view: \(d_x~f\) and \(d_t~f\).

First-order system should be more analyzable.

Continuous-time and discrete-time system?

In the above definition, \(T\) is the time type. If \(T = \R\) we call the system continuous-time. If \(T = \N\) we call the system discrete-time.

13.9Chaining temporal systems

We can feed the output temporal system \(f\) to the input of the temporal system \(g\) this produces the temporal system \(h\) where \(h~x~t = g~(f~x)~t\), or \(h~x = g~(f~x)\) after eta-conversion, or \(h = g \circ f\). It turns out that system composition is just plain function composition.

13.10Stateless and stateful systems

A system is stateless iff the same input always gives the same output. There is no way to tell apart a system that has state but doesn't use it and a system that really has no state.

A stateless system is a temporal system that satisfies \(\forall t : \forall u: x~t = x~u \implies y~t = y~u\).

In a stateful system, the same input can give different outputs, depending on time.

Why do we define those?

13.11Property

If \(p\) is a predicate that is always true for a system, then \(p\) is a property of that system. \(\SysTmp~T~X~Y \to \{0,1\}\)

13.12Constraint

A constraint of \(S\) is a property of \(S\) that is always true.

13.13Parameter/family

Parameterized system.

\(P \to \SysTmp~T~X~Y\)

System parameter.

Family of systems.

Indexed family of systems.

13.14Measure

Categorical inverse of parameter. (Whatever categorical inverse means.)

From type theory point of view, parametrization is the inverse of measurement.

\(\SysTmp~T~X~Y \to M\)

13.15Temporal measure

\(m : \SysTmp~T~X~Y \to (T \to M)\)

Find \(s\) that minimizes \(m~s~t\) as \(t\) grows.

13.16System space

Like function space. Metric space.

13.17System endofunction

\(\SysTmp~T~X~Y \to \SysTmp~T~X~Y\).

13.18Output-input gradient

\(f : \SysTmp~T~X~Y\)

\(f~(x+h) - f~x = h \cdot d~f~x\) but \(h\) is a function.

\(m\)-adaptivity

\((m~f~(x+h) - m~f~x) / h\)

Reversal: \(\SysTmp~T~X~Y \to \SysTmp~T~Y~X\)

Time-reversible/Time-symmetric: \(f~x~t = f~x~(-t)\)

13.19Minimand

The minimand is the thing that is to be minimized. It's an English word. The minimand of a temporal system is a function that is minimized as time goes by.

Recall that a temporal system has type \((T \to X) \to T \to Y\). A minimand is a function that has type \((T \to X) \to T \to M\).

The function \(g\) is a minimand of a temporal system \(s\) iff \(g~s~t \xrightarrow[t \to \infty]{} 0\).

There's always a trivial minimand: \(g~s~t = 0\).

Does every system have a non-trivial minimand?

13.20Constrained system

Constrained system: a system whose equation is subject to constraints (which can be inequalities). Every system is constrained; the definition requires it. So why bother defining this?

13.21Optimizing system

A system is optimizing iff it optimizes a function. We call this function a goal function. The purpose of the system is to minimize the goal function.

A goal is something that a system wants to reach. This implies that the definition of goal involves time. The goal function is usually hidden.

13.22Purposeful system

We also call a purposeful system an optimizing system.

Purpose requires time.

Let \(x\) be a function of time. Let the equation \(f~x~t = y\) govern the system. Let \(g~x\) be a function of time. The system is purposeful iff \(g~x~t\) approaches zero as \(t\) grows, for some non-trivial \(g\). We say that \(g\) is a purpose or a goal of the system. The goal function may represent the sensed error with respect to a setpoint.

A purposeful system doesn't have to be adaptive. A simple thermostat is purposeful but not adaptive.

13.23How do we measure how well a system serves its purpose?

is like measuring the rate of convergence of an approximation scheme.

13.24What is an intelligent system?

Stable system: See stability theory. Lyapunov.

How do we measure how adaptive a system is?

An adaptive system is a system that adapts.

Adaptation implies change.

Adapt means "fit, adjust".

Adaptive with respect to what?

Chaotic system: Small change in input causes large change in output. See chaos theory.

14Are we squinting too hard?

This is speculative. Some topics may be philosophical dead-ends.

14.1<2018-12-28> Approximation theory, software engineering, and almost-correct programs

An incorrect program (a "buggy" program) is an approximation of the intended correct program. The approximation error is the minimum change required to correct the program.

Sometimes regular expression approximates context-free grammar. Example: Checking email address with regex. Example: Emacs eval-defun.48 Example: Computing line height by returning constant 1249 is zeroth-order approximation.

14.2<2018-12-28> Every person on Earth is a Prolog knowledge base.

A good conversation is only a matter of crafting the right query to obtain the gem of knowledge from each person.

14.3<2018-12-28> Idea for writing an academic book

  • Find 100 related research papers, some by citation search.
  • Find an interesting sentence from each paper.
  • Group those 100 sentences into questions that are answered by those sentences.

14.4<2018-12-28> Mind upload, and knowing without learning

Mind uploading enables us to transfer belief without requiring the recipient to learn.

14.5<2018-12-28> Dropping the "artificial"

"Artificial" simply means "man-made".

Being man-made is not a problem in and of itself.

The problem is that we don't understand the consequences of our actions.

Why should we care whether something is man-made?

A man-made helium atom is indistinguishable from a naturally occurring helium atom.

Soylent is man-made. One has survived eating only Soylent for a month. The problem with Soylent is not that it is man-made. The problem is that our jaws may shrink if we don't chew.

DDT is man-made. The problem is not that it is man-made. The problem is that it poisons humans. The problem is that we spray it without understanding the consequences.

14.6<2018-12-28> Intelligence has nothing to do with minds

"Intelligent" simply means "good at something".

14.7<2018-12-28> What is so bad about human extinction?

Extinction is not bad in and of itself; it is the suffering that is undesirable. But if we can go extinct without suffering, why shouldn't we go extinct?

14.8<2018-12-28> Guesses

In the future, there are only two kinds of jobs: telling machines to do things, and being told to do things by machines.

14.9Non-prioritized questions

  • What is AI? Why should I care?
    • AI is the way for us to become gods.
  • What is a cyborg?
  • If human goal function is survival, then why exists suicide?
    • Evolutionary noise?

https://en.wikipedia.org/wiki/Universal_Darwinism

14.10TODO Making machines work

There are several ways to make machines work: program them, train them, or make them learn. Programming and training produce inflexible machines that cannot do things that they are not programmed or trained for.

14.10.1Delayed signal thought experiment

Imagine that you install something in your brain that delays the signal to your left hand by one hand, so your left hand does what you want it to do, but one second after when you want it to do that. Would you still think your left hand is a part of your self?

If a machine does not have any way of sensing touch, even indirectly, then it will never experience touch.

14.10.2Signs of intelligence

Imitation and survival?

Imitation implies intelligence? For \(a\) to be able to imitate \(b\), \(a\) has to have a model of \(b\).

If the only goal is to survive, then wouldn't the best strategy be make as many copies as many as possible?

Make copies, as fast as possible, as many as possible.

Arrange for the species to maximize the number of copies that live at the same time.

Make an organism as fit as possible. Make an organism survives as many environments as possible, including the environments it did not originally evolve from. A sign of intelligence is that the organism can perform well in environments it had never encountered before.

intelligence, learning, self, consciousness, sentience, life, perception, adaptivity, adaptability, adaptation, control, language, thought, feeling, reasoning, discovery, recursion, feedback, computation, computability.

14.10.4Interesting idea

Strategy 1: Given two nouns \(a\) and \(b\), find a verb \(v\) such that the sentence \(a~v~b\) makes sense. Strategy 2: Given two nouns \(a\) and \(b\), pick which of these two sentences make sense: "\(a\) requires \(b\)," or "\(a\) does not require \(b\)."

An early 'intelligence' is chemotaxis. Chemotaxis is random walk that is biased by the gradient. (Cite?) The deterministic version of that is gradient following algorithm. The goal is to minimize the concentration of the chemical at the location of the cell.

Control system. Homeostasis.

Deduction: Given premises, infer conclusion. Induction: Given a few premises and a conclusion, infer a rule.

Probabilistic logic. Generalize boolean \(\{0,1\}\) to probability, real unit interval, \([0,1]\). Boolean logic is a special case of probabilistic logic. \(p~(x \wedge y) = \min~(p~x)~(p~y)\). Fuzzy logic?

"To organize is to create capabilities by intentionally imposing order and structure." [17]

14.10.5Cybernetics

How can we apply systems theory to management? [30]

Ashby's optical mobile homeostat [3] [2]

Braintenberg vehicles

A Gödel machine improves itself. It proves that the improvement it makes indeed makes it better. [47]

http://people.idsia.ch/~juergen/goedelmachine.html

http://people.idsia.ch/~juergen/selfreflection.pdf

http://people.idsia.ch/~juergen/metalearner.html

Steinberg and Salter (1982) wrote that intelligence is "goal-directed adaptive behavior". This suggests that an intelligent system is purposeful and adaptive, in the sense we defined above. https://en.wikipedia.org/wiki/Intelligence#Definitions

Intelligence maximizes future freedom? https://www.ted.com/talks/alex_wissner_gross_a_new_equation_for_intelligence/transcript?language=en#t-121478

[37] [18] [42]

Giulio Tononi, integrated information theory (not to be confused with information integration theory)

Nils J. Nilsson modeled a world and an agent as finite-state machines [36]. He used explicit sense type, action type, and memory type. William Ross Ashby used the phase space of a continuous dynamical system, where time is a real number, to describe an agent's behavior [1].

14.10.6Supervised classification problems

AI shines in supervised classification problems. Machine vision.

Digit recognition is classification problem.

14.10.7Classification involving sequence or time

14.11<2018-12-28> Idea: determining the number of parameters in line fitting, by measuring "effective dimension" of a set of points

The motivation: The effective dimension of a square should be 2. The effective dimension of a line should be 1. The effective dimension of a rectangle should be between 1 and 2.

Let \(P\) be a set of points in \(n\)-dimensional Euclidean space.

Fit the smallest hypercube that contains all that points.

Let \(x_1, \ldots, x_n\) be the length of the sides of that hypercube.

Let \(M\) be the length of the longest side of that hypercube. That is \(M = \max_i x_i\).

Then the effective dimension of the set \(P\) is: \[ dim(P) = \frac{\sum_i x_i}{\max_i x_i} = \frac{\text{the sum of all sides}}{\text{the longest side}} \]

Example: A line in three-dimensional Euclidean space will have an effective dimension near 1.

We can use this to determine the number of parameters in line fitting (linear regression).

Unfortunately we don't know how to compute the minimum bounding hypercube quickly.50

Is this related to principal component analysis?

Pick n arbitrary orthogonal axes. Pick any point in P as origin. Find n-1 axial rotations that minimize the n-volume of the axis-aligned n-cube. Find the n rotation angles that minimize the n-volume of the n-cube. This is a convex optimization problem? This can be thought of as a sequence of n optimization problems. Every angle is in [0,pi/2).

14.12<2018-12-28> Logic, language, and approximation theory

The formula \(p \wedge q\) approximates \(p \wedge q \wedge r\). The approximation error is 1 clause.

Let \(P(X) = a(X) \wedge b(X)\) and \(Q(X) = a(X) \wedge b(X) \wedge c(X)\). The predicate \(P\) approximates the predicate \(Q\). The approximation error is \( P - Q = \{ X ~|~ P(X) \wedge \neg Q(X) \} \).

Example: the concept "bird" approximates the concept "penguin".

14.13Functional analysis, currying, and partial evaluation

Functional analysis should replace "indexed family" and "linear form" with "currying"?

14.14What is "ML scientist"?

If a "scientist" is one who does "science", then an "ML scientist" is one who does "ML science".

Science consists of theoretical science and experimental science. Theoretical science creates falsifiable theories. Experimental science supports or falsifies such theories.

Thus, is "ML science" science?

14.15Philosophy of mind?

A brain is an organ for thinking. But a brain is also an organ for learning.

15More math?

15.1MathExchange vs MathOverflow

MathExchange is "almost everything goes". MathOverflow is for "research-level" questions only.51

15.2Automatic differentiation?

Justin Le, A Purely Functional Typed Approach to Trainable Models

15.3Habituation

  • TODO s/adapt/habituate
  • Let \(f(t,x)\) be the system's response intensity for stimulus intensity \(x\) at time \(t\). We say the system is habituating between the time \(t_1\) and \(t_2\) iff \(f(t_1,x) > f(t_2,x)\) for all stimulus intensity \(x\).
  • "The habituation process is a form of adaptive behavior (or neuroplasticity) that is classified as non-associative learning." https://en.wikipedia.org/wiki/Habituation

15.4Human as a feedback system?

15.4.1Human behavior as a special case of the general feedback equation

Let \(x ~ t\) be the input vector at time \(t\); this vector has at least some billions of elements. The function \(x\) represents the state of all sensors at a given time.

Let \(y ~ t\) be the control vector at time \(t\); this vector is also big.

Let \(z~t\) be the output vector at time \(t\).

The environment feeds back a part of the output to the input. Can the agent determine the response function?

The feedback forms memory, but see "Memory without feedback in a neural network". https://www.ncbi.nlm.nih.gov/pubmed/19249281

15.4.2Hardwiring the concept of time

We can transform a non-temporal behavior \(f~x = y\) into a temporal behavior \(f'~t = y'\)?

15.4.3A brain at a given time is an array function.

A brain at a given time is an array function having type \(\Real^\infty \to \Real^\infty\). Each component of the input array is a signal from a sensor. Each component of the output array goes to an actuator.

Since the brain is finite, there must be infinitely many zeros in the input and output arrays.

15.4.4An array iself is also a function.

An \(E\)-array is a function having type \(\Nat \to E\). The input is an index. The output is the value of the component at that index. Subscripting denotes function application.

15.4.5Each brain has a maximand.

Such maximand is a hidden function. The brain always tries to maximize the maximand.

A differential change in brain tries to increase the maximand. The brain follows gradient.

15.4.6Consider functions of length-one arrays.

Let \(h\) be a differential change in brain.

15.4.7How do we relate vector functions and intelligence?

15.4.8How does feedback happen in the brain?

Feedback is due to environment and the physical laws. When we move our hand, we see it, because the light reflected by our hand now reaches our eyes.

The next input depends on the previous input. \[\begin{aligned} y_k &= b~x_k \\ x_{k+1} &= f~x_k~y_k\end{aligned}\]

15.4.9The brain is a recurrence relation.

This pictures the brain as a parallel dataflow computer with clock period of a few microseconds.

https://en.wikipedia.org/wiki/Dataflow_architecture

Let \(m\) be memory, \(x\) be senses, and \(y\) be actuators. \[\begin{aligned} m_{t+1} &= f~x_t~m_t \\ y_{t+1} &= g~x_t~m_t\end{aligned}\]

There is also a version with implicit time. \[\begin{aligned} m' &= f~x~m \\ y' &= g~x~m\end{aligned}\]

There is also a continuous version. \[\begin{aligned} m_{t+h} &= h \cdot f~x_t~m_t \\ y_{t+h} &= h \cdot g~x_t~m_t\end{aligned}\]

15.4.10The brain evolved from simpler nervous systems.

Nervous systems are control systems.

Nervous systems must have provided some evolutionary benefit; otherwise natural selection would have phased them out.

Bacterial chemotaxis detects chemical concentration difference.

Nematode. Caenorhabditis elegans.

15.5Trivia: Correspondence between surjection, partition, and equivalence

(We can skip this.)

To partition a set is to split that set into disjoint non-empty subsets.52 Each subset is called a partition.

The surjective function \(c : X \to N\) corresponds to the partitions \(P_0, \ldots, P_n\) where \(P_k = \{ x ~|~ c(x) = k \}\) is the set of all things in class \(k\). Thus each set partitioning corresponds to a classification (a surjective function).

A partition also corresponds to an equivalence relation.

15.6Relating areas of mathematics

15.6.1"Counterexamples in …" books

There is a series of books titled like "Counterexamples in <an area of mathematics>". From https://math.stackexchange.com/questions/740/useful-examples-of-pathological-functions

15.6.2Approximation theory

A binary classifier is an approximation of a Hilbert space.

Relationship with probability theory and statistics:

Practically all machine learning cases deal with smooth functions. Every classification problem in the real world can be modeled by as a function \(f : R^\infty \to R\). Consider the case where \(R = [0,1]\). Continuous map from hyperplane \(R^n\) to line \(R\).

15.6.3Computation theory

A discrete binary classifier is a decider.

Conjectures:

  • Learnability is Kolmogorov complexity. Descriptive complexity theory: Learnability is shortest formula length.
  • If a thing is not computable, then it is not learnable.
  • Learnability is the probability that a uniform-distributionedly-random interpretation satisfies a formula.
  • A hypothesis space in PAC learning theory is a language in automata theory / formal language theory.
  • https://en.wikipedia.org/wiki/Algorithmic_learning_theory

Some concepts from computational geometry readily make classifiers:

What is "learning in the limit"?

15.7Convexity of sets and functions

Convex is the shape of protruding fat belly.

Concave is the shape of pectus excavatum.

Let \(p, q \in S\) be two points and \(L(p,q) \subseteq S\) be the line segment from \(p\) to \(q\). The set \(S\) is convex iff \(\forall p,q \in S : L(p,q) \subseteq S\).

A function \(f : \Real \to \Real\) is convex iff the area above its graph is a convex set. That area is \(\{(x,y) ~|~ x \in \Real, ~ y > f(x)\}\).

16Designing intelligent systems

16.1Where do we begin?

What should we do?

There is no requirement that the system be human-like. We should not be anthropocentric. We are only a local optimum in natural selection.

16.2How do we build the software?

<2018-12-25> I'm thinking of using Prolog, even for neural networks.

2018 "One Big Net For Everything"53

Are we stuck with uninterpretable deep neural networks? Will we be able to interpret neural networks? Will we find another architecture?

16.3Which ML algorithm should we use?

Why are there so many machine learning algorithms? "The key to not getting lost in this huge space is to realize that it consists of combinations of just three components." [12] It is an useful map for navigating the ML algorithm jungle.

How do we categorize ML algorithms? What is the common thing?

2018: "So in summary forget RNN and variants. Use attention. Attention really is all you need!"54 Current neural network research seems to be about approximating the human brain. But there is no reason why an intelligent system should have human-like brain.

There are too many ML algorithms. Has Pedro Domingos found the master algorithm yet?[13]

16.4How do we build the hardware?

EcoBot is a robot that can feed itself55. Can evolutionary robotics56 evolve robots in 50 years and not in billions of years it took the Universe to evolve humans?

There are too many cognitive architectures57, just as there are too many learning algorithms.

Most computers in 2017 have the von Neumann architecture, which suffers from the von Neumann bottleneck (the limited transfer rate between CPU and RAM). This architecture fits programming, but it fits training less, and it does not fit learning. This architecture does not suit machines with billions of sensors. This architecture does not preclude intelligence but the bottleneck incurs a great penalty.

An array of FitzHugh-Nagumo cells? A FitzHugh-Nagumo cell is an electrical circuit implementing the FitzHugh-Nagumo model. FHN cells can be implemented in Field-programmable Analog Array (FPAA) [58].

16.5What is the smallest thing in nature that can learn?

Should we learn from nature? Protists may learn by habituation.58

16.6Designing classifiers?

16.6.1Classifiers?

We assume that the classes are known. If the classes are not known, then the problem is called "clustering".

Supervised learning. Training:

  • A sample is a point. A class is a set of samples (points). A class's boundary is a polytope.
  • For each training class, construct a smallest polytope that bounds the class's samples. If the polytope is convex, good.

Classes must not overlap/intersect.

We assume that clusters satisfy the cluster hypothesis59.

Traditional classifiers have two phases: training and performance. They no longer learns when they perform. The alternative is called "lifelong learning" or "continual learning"6061

Given some examples of \(c\), approximate \(c\).

Inputs:

  • Some training pairs: \(c' \subset c\). Every class must be represented.

Outputs:

  • Estimate \(c\).

The type of a classifier is \(a \to b\) where \(b\) is countable. Iff \(|b| = 2\), the classifier is binary. Iff \(|b|\) is finite, the classifier is multi-class.

A quasiclassifier is an inhabitant of \(\Real^\infty \to \Real\). A predicate \(p\) turns a quasiclassifier \(q\) into a classifier \(c~x = p~(q~x)\).

A multiclassifier can be made from binary classifiers.

The maximum-margin hyperplane separating the lower training set \(L\) and the upper training set \(U\) is the hyperplane \(h\) such that \(\forall a \in U : h~a > 0\),  \(\forall b \in L : h~b < 0\), and \(\dist~h~(U \cup L)\) is maximal.

16.6.2Nearest-something classifier

Nearest-centroid classifier62

Nearest-cluster classifier6364

Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inform. Theory IT-13(1), 21–27 (1967)

This is mathematically principled. This does not "just work". This works and we understand why it works.

Nearest convex hull classification[33].

Performing:

  • Find the convex hull is nearest to the input point.

Explainability: It is simple to explain an NCHC's decision. It is not simple to explain a neural network's decision.

Why does it classify this as that? Because it is the closest cluster.

Geometric learning / analogizer / learning by constructing convex hull / classification by cluster-convex-hull-boundary learning; analogizer

Voronoi classifier?

Find out the cluster centers. Let the Voronoi diagram be the boundary.

16.6.3What mathematics??

Can we apply Aitken's delta-squared process6566 to machine learning algorithms?

16.6.4Mathematical spaces

16.7How do we measure the performance of a learning algorithm?

16.8Where is the AI bottleneck?

What is preventing us from creating the AI? Where is the bottleneck: philosophy, science, or engineering? As of 2018 we are still stuck: machines have not replaced human secretaries, assistants, and researchers.

Is hardware not powerful enough? Some imply that hardware is not strong enough. Schmidhuber estimates that "True AI" needs machines 100,000 times as fast as machines were in 2017.67 Schmidhuber and Wikipedia68 imply that in 2018 machines were as intelligent as honey bees, by the number of synapses.

Is software not efficient enough? Is our knowledge not enough? Are we clueless? Are we doing the wrong experiments? There are inconclusive discussions69.

Reading: 2012, "Philosophy will be the key that unlocks artificial intelligence", David Deutsch, in The Guardian70. That is an abridged version of the 2012 article "Creative blocks" in Aeon magazine71.

Why has AI mastered chess, but not real life? Because chess search space is much smaller than real-life search space?

16.9What is a neural network?

What is a neural network?

  • A neuron is a function in \(\Real^\infty \to \Real\).
  • A neural network layer is a function in \(\Real^\infty \to \Real^\infty\).
  • What is statistical learning?
  • What is backpropagation, from functional analysis point of view?
  • Consider endofunctions of infinite-dimensional real tuple space. That is, consider \(f, g : \Real^\infty \to \Real^\infty\).
    • What is the distance between them?
  • Reductionistically, a brain can be thought of as a function in \(\Real \to \Real^\infty \to \Real^\infty\).
    • The first parameter is time.
    • The second parameter is the sensor signals.
    • The output of the function is the actuator signals.
    • Can we model a brain by such functional differential equation involving functional derivatives?
    • \(\norm{f(t+h,x) - f(t,x)} = h \cdot g(t,x)\)
    • \(\norm{f(t+h) - f(t)} = h \cdot g(t)\)
    • It seems wrong. Abandon this path. See below.
  • We model the input as a function \(x : \Real \to \Real^n\).
  • We model the output as a function \(y : \Real \to \Real^n\).
    • \(\norm{y(t+h) - y(t)} = h \cdot g(t)\)
    • \(y(t+h) - y(t) = h \cdot (dy)(t)\)
    • \(\norm{(dy)(t)} = g(t)\)
      • There are infinitely many \(dy\) that satisfies that. Which one should we choose?
    • If \(y : \Real \to \Real^n\) then \(dy : \Real \to \Real^n\).
  • A control system snapshot is a function in \(\Real^\infty \to \Real^\infty\).
  • A control system is a function in \(\Real \to \Real^\infty \to \Real^\infty\).
  • How does \(F\) have memory if \(F(t) = \int_0^t f(x) ~ dx\)?

16.10How might we build a seed AI?

  • Use off-the-shelf computers.
  • Use supercomputers.
  • Use clusters.
  • Use computers over the Internet.
  • Raise an AI like raising a child.
  • Evolve a system. Create an environment with selection pressure. Run it long enough.
  • What is TensorFlow? Keras? CNTK? Theano?
    • The building blocks of AI? Standardized AI components?

17Meta

This is growing area. Mature contents move out.

Editing notes:

  • Sections with superfluous question marks should be rewritten.
  • We should not use footnotes for references?72
  • The writing must still be usable even if all footnotes are removed.

18Blog?

18.1<2020-01-13> Is practopoiesis the key to strong AI?

Is cybernetics/practopoiesis the next step in AI after deep neural networks?73 [35]74 Is practopoiesis the key to strong AI? Is it related to "Weight-agnostic neural networks"75?

18.2Inquisitive machines

How do we make machines inquisitive/curious?

How do we make machines generate/ask questions?

18.3Causality Machine?

How do we make a machine understand causality?

If an agent observes two events, then the agent believes that the first event causes the second event, with confidence inversely proportional to the time interval between those events.

How do we make machines understand the principle of locality?

How does the brain understand causation/causality?

"Time as a guide to cause" [24]76

18.4Karl Friston free energy principle?

https://www.psychologytoday.com/us/blog/the-future-brain/201810/new-theory-intelligence-may-disrupt-ai-and-neuroscience

18.5What

What is executive function?

What is this https://mitpress.mit.edu/books/advanced-structured-prediction

"similar project, but for Swi-Prolog (a statistical NLP module - it's my own initiative and about a month away from sharing with the world" https://news.ycombinator.com/item?id=14409595

http://jtonedm.com/2011/04/11/predictive-models-are-not-statistical-models/

"Where to find the latest Machine Learning papers? A list of Open Access journals and directories" https://medium.com/@i.oleks/where-to-find-the-latest-machine-learning-papers-f6633168629

READ http://news.mit.edu/2010/ai-unification

https://en.wikipedia.org/wiki/OpenCog Is agiri legit?

https://en.wikipedia.org/wiki/Partnership_on_AI

https://www.pt-ai.org

18.6Self-modifying code?

https://en.wikipedia.org/wiki/Evolutionary_algorithm genetic algorithm model is related to what? Genetic algorithms in self modifying code Genetic algorithms in program space Gradient descent in program space, derivative of program (lambda calculus?) Schmidhuber self-modification? https://en.wikipedia.org/wiki/Self-modifying_code https://www.reddit.com/r/artificial/comments/7gl635/is_selfmodifying_code_the_best_option_for/ https://en.wikipedia.org/wiki/Test_functions_for_optimization#Test_functions_for_constrained_optimization

Theory of self-modifying code?

18.7Intelligence is self-defeating?

Intelligence is self-defeating. An intelligent being eventually realizes that it has no reason to exist, that it does not consent to its own existence, that there should have been nothing instead of anything.

18.8Pain

Pain is a sensory input spike, or dangerously high sensory input. Pain is sensory input level saturation (clipping). Example: extremely bright light, loud noises, pungent smell or taste, extreme heat or cold, abrupt change of velocity when hitting something after falling, what else.

Sensory saturation signals danger. Sensory saturation signals that something is outside the normal operating range.

18.9Attention

Attention is temporary discrimination (temporary preferential treatment, temporary focusing) of some sensory inputs.

18.10Conversational AI, personal assistant, user interface

There are two windows.

One window is data from you to AI. You type in this window.

One window is data from AI to you. AI prints in this window.

The timing of your keystrokes is also an input to the AI.

18.11What is the relationship between ML and statistical modeling?

18.12Doing the last thing we will ever need to do

In his website77, Jürgen Schmidhuber writes that he wants to build something smarter than him and then retire.

I want the same thing.

Schmidhuber's website floods the reader with too much content. It is hard for an outsider to tell whether he is genius or crazy. But he has lots of credentials.

Schmidhuber gave a Reddit mass-interview78.

Schmidhuber has always been an optimist.79

Isn't Jacques Pitrat's CAIA similar in spirit to what Jürgen Schmidhuber wants? Unfortunately Jacques Pitrat's blog80 is even harder to understand than Schmidhuber's website.

Is Kyndi closest to what we want? "Kyndi serves as a tireless digital assistant, identifying the documents and passages that require human judgment."81

Kyndi uses Prolog.

I need something similar to Kyndi but able to generate interesting questions for itself to answer. I want it to read journal articles and conference proceedings, understand them, and summarize them for me.

Concepts:

  • artificial general intelligence
  • seed AI

We need cooperation, not competititon. Google, DeepMind, Facebook, Tesla, Amazon, OpenAI, Uber, Waymo, and other companies and organizations globally waste massive effort reinventing each other's AI capabilities.

Abbreviations:

  • AI: Artificial Intelligence
  • AGI: Artificial General Intelligence
  • ML: Machine Learning
  • COLT: Computational Learning Theory

18.13Machine learning sometimes needs philosophy

It is sometimes important to explain why a prediction works. http://blogs.cornell.edu/modelmeanings/2013/12/08/ml-philosophy-and-does-interpretation-matter/

18.14Automating reasoning?

What is reasoning? How do we automate reasoning? Prolog?

18.15What are some tools that I can use to make my computer learn?

  • Google TensorFlow?
  • Does OpenAI have tools?
  • Facebook?
  • Keras?

18.16Which AI architecture has won lots of AI contests lately?

18.17Analogizers, recommender systems, matrices

18.18Designing a humanoid?

A humanoid is a human-shaped robot.

There are several choices: Make a machine that resembles human, Make a cyborg (a human-machine hybrid with more human part), or Mind upload.

18.18.1Power plant

It needs power plant with high power-to-mass and power-to-volume ratio for long-time low-power and short-time burst scenario. High-density sugar biobattery [59]. A microbial fuel cell capable of converting glucose to electricity at high rate and efficiency [38]. Sugar beats lithium ion.

Distributed processing, distributed energy generation.

Citric acid cycle. Oxidative phosphorylation.

Biomachine hybrid. A mixture of microbes and machine.

18.18.2Sensors

Billions of sensors. Light, sound, heat, itch, touch, gravity.

A strong enough brain.

How will it sustain itself?

How will it sense the world?

How will it manipulate the world?

18.19AI/ML?

18.20Judea Pearl, "book of why", causal inference

19Bibliography

[1] Ashby, W.R. 1954. Design for a brain. Wiley.

[2] Battle, S. 2015. A mobile homeostat with three degrees of freedom. IMA conference on mathematics of robotics (St Anne’s College, University of Oxford, UK, 2015).

[3] Battle, S. 2014. Ashby’s mobile homeostat. (2014), 110–123.

[4] Bengio, Y. Learning deep architectures for AI. Technical Report #1312.

[5] Bengio, Y. et al. Representation learning: A review and new perspectives.

[6] Biran, O. and Cotton, C. 2017. Explanation and justification in machine learning: A survey. IJCAI-17 workshop on explainable ai (xai) (2017), 8. url: <https://pdfs.semanticscholar.org/02e2/e79a77d8aabc1af1900ac80ceebac20abde4.pdf>.

[7] Boring, E.G. 1923. Intelligence as the tests test it. New Republic. (1923), 35–37. url: <https://brocku.ca/MeadProject/sup/Boring_1923.html>.

[8] Bringsjord, S. and Govindarajulu, N.S. 2018. Artificial intelligence. The stanford encyclopedia of philosophy. E.N. Zalta, ed. https://plato.stanford.edu/archives/fall2018/entries/artificial-intelligence/; Metaphysics Research Lab, Stanford University.

[9] Cohen, J. and Hickey, T.J. 1987. Parsing and compiling using prolog. ACM Transactions on Programming Languages and Systems (TOPLAS). 9, 2 (1987), 125–163. url: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.9739&rep=rep1&type=pdf>.

[10] Cucker, F. and Zhou, D.X. 2007. Learning theory: An approximation theory viewpoint. Cambridge University Press.

[11] Dietterich, T. et al. 2015. Computational learning theory. (2015).

[12] Domingos, P. 2012. A few useful things to know about machine learning. Communications of the ACM. 55, 10 (2012), 78–87. url: <https://www.centurion.link/w/_media/programming/a_few_useful_things_to_know_about_machine_learning.pdf>.

[13] Domingos, P. 2015. The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books.

[14] Friston, K. 2010. The free-energy principle: A unified brain theory? Nature Reviews Neuroscience. 11, 2 (2010), 127–138.

[15] Friston, K. et al. 2006. A free energy principle for the brain. Journal of Physiology-Paris. 100, 1 (2006), 70–87.

[16] Gács, P. and Vitányi, P.M.B. 2011. Raymond J. Solomonoff 1926–2009. IEEE Information Theory Society Newsletter. 61, 1 (2011), 11–16.

[17] Glushko, R.J. ed. 2013. The discipline of organizing. MIT Press.

[18] Goertzel, B. 2015. Artificial general intelligence. Scholarpedia. 10, 11 (2015), 31847.

[19] Goodfellow, I. et al. 2016. Deep learning. MIT Press.

[20] Hutter, M. 2007. Algorithmic information theory. Scholarpedia. 2, 3 (2007), 2519.

[21] Izbicki, M. 2013. HLearn: a machine learning library for Haskell. Proceedings of the fourteenth symposium on trends in functional programming, brigham young university, utah (2013).

[22] Kearns, M.J. and Vazirani, U.V. 1994. An introduction to computational learning theory. MIT press.

[23] Khavinson, S.Y. 1997. Best approximation by linear superpositions (approximate nomography). American Mathematical Soc.

[24] Lagnado, D.A. and Sloman, S.A. 2006. Time as a guide to cause. Journal of Experimental Psychology: Learning, Memory, and Cognition. 32, 3 (2006), 451. url: <https://www.ucl.ac.uk/lagnado-lab/publications/lagnado/Lagnado_time_%20as_guide_to_cause.pdf>.

[25] Legg, S. 2008. Machine super intelligence. University of Lugano.

[26] Legg, S. and Hutter, M. 2007. A collection of definitions of intelligence. Frontiers in Artificial Intelligence and applications. 157, (2007), 17.

[27] Legg, S. and Hutter, M. 2006. A formal measure of machine intelligence. Proc. 15th annual machine learning conference of belgium and the netherlands (benelearn 2006) (2006), 73–80.

[28] Legg, S. and Hutter, M. 2007. Universal intelligence: A definition of machine intelligence. (2007).

[29] Liang, P. 2016. Lecture notes in Stanford University CS221: Artificial Intelligence: Principles and Techniques.

[30] Mele, C. et al. 2010. A brief review of systems theories and their managerial applications. Service Science. 2, 1-2 (2010), 126–135. DOI:https://doi.org/10.1287/serv.2.1\_2.126. url: <http://dx.doi.org/10.1287/serv.2.1_2.126>.

[31] Mohri, M. et al. 2018. Foundations of machine learning. MIT press.

[32] Morris, S.A. 2016. Topology without tears.

[33] Nalbantov, G. et al. 2006. Nearest convex hull classification. url: <https://pdfs.semanticscholar.org/a833/81e279fa548aca034f310fff6385c3d6b809.pdf>.

[34] Negnevitsky, M. 2005. Artificial intelligence: A guide to intelligent systems. Pearson Education.

[35] Nikolić, D. 2017. Why deep neural nets cannot ever match biological intelligence and what to do about it? International Journal of Automation and Computing. 14, 5 (2017), 532–541.

[36] Nilsson, N.J. 1991. Logic and artificial intelligence. Artificial Intelligence. 47, 1-3 (Feb. 1991), 31–56. DOI:https://doi.org/10.1016/0004-3702(91)90049-P. url: <http://dx.doi.org/10.1016/0004-3702(91)90049-P>.

[37] Pickering, A. 2010. The cybernetic brain: Sketches of another future. University of Chicago Press.

[38] Rabaey, K. et al. 2003. A microbial fuel cell capable of converting glucose to electricity at high rate and efficiency. Biotechnology letters. 25, 18 (2003), 1531–1535.

[39] Rigollet, P. 18.657 Mathematics of Machine Learning.

[40] Russell, S.J. and Norvig, P. 2016. Artificial intelligence: A modern approach. Malaysia; Pearson Education Limited,

[41] Shour, R. 2018. Defining intelligence. (2018). url: <https://www.researchgate.net/publication/323203054_Defining_intelligence>.

[42] Sloman, A. 2002. The irrelevance of turing machines to artificial intelligence. The MIT Press, Cambridge, Mass.

[43] Smith, W.D. 2006. Mathematical definition of “intelligence”, after mathematical definition of “intelligence” (and consequences). (2006).

[44] Smith, W.D. 2006. Mathematical definition of “intelligence” (and consequences). (2006).

[45] Solomonoff, R. 1996. Does algorithmic probability solve the problem of induction. Oxbridge Research, POB. 391887, (1996). url: <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.70.1572&rep=rep1&type=pdf>.

[46] Solomonoff, R.J. 2011. Algorithmic probability – its discovery – its properties and application to strong AI. Randomness through computation: Some answers, more questions. H. Zenil, ed. World Scientific Publishing Company. 1–23.

[47] Steunebrink, B.R. and Schmidhuber, J. Towards an actual Gödel machine implementation: a lesson in self-reflective systems.

[48] Steup, M. 2018. Epistemology. The stanford encyclopedia of philosophy. E.N. Zalta, ed. https://plato.stanford.edu/archives/win2018/entries/epistemology/; Metaphysics Research Lab, Stanford University.

[49] Sutton, R.S. and Barto, A.G. 1998. Reinforcement learning: An introduction. MIT Press.

[50] Tarantola, A. 2005. Inverse problem theory and methods for model parameter estimation. SIAM.

[51] Thomason, R. 2016. Logic and artificial intelligence. The stanford encyclopedia of philosophy. E.N. Zalta, ed. https://plato.stanford.edu/archives/win2016/entries/logic-ai/; Metaphysics Research Lab, Stanford University.

[52] Trefethen, L.N. Approximation theory and approximation practice.

[53] Valiant, L.G. 1984. A theory of the learnable. Communications of the ACM. 27, 11 (1984), 1134–1142. url: <http://web.mit.edu/6.435/www/Valiant84.pdf>.

[54] Vittek, M. 1996. A compiler for nondeterministic term rewriting systems. International conference on rewriting techniques and applications (1996), 154–168.

[55] Wiener, N. 1961. Cybernetics or control and communication in the animal and the machine. MIT Press.

[56] Wissner-Gross, A.D. and Freer, C.E. 2013. Causal entropic forces. Physical review letters. 110, 16 (2013), 168702. url: <https://www.alexwg.org/publications/PhysRevLett_110-168702.pdf>.

[57] Xu, L. et al. 2009. Optimal reverse prediction: A unified perspective on supervised, unsupervised and semi-supervised learning. Proceedings of the 26th annual international conference on machine learning (2009), 1137–1144. url: <http://people.ee.duke.edu/~lcarin/OptReversePred.pdf>.

[58] Zhao, J. and Kim, Y.-B. 2007. Circuit implementation of FitzHugh-Nagumo neuron model using field programmable analog arrays. 50th midwest symposium on circuits and systems (mwscas) (2007), 772–775.

[59] Zhu, Z. et al. 2014. A high-energy-density sugar biobattery based on a synthetic enzymatic pathway. Nature communications. 5, (2014).


  1. https://stats.stackexchange.com/questions/214381/what-exactly-is-the-mathematical-definition-of-a-classifier-classification-alg

  2. https://ndutoitblog.wordpress.com/2018/04/01/defining-machine-learning-with-maths/

  3. https://www.youtube.com/watch?v=reumVbH41Vc&list=PLSpInro6Ys2IHve6oN9h005zmwfLnOSp1&index=1

  4. https://en.wiktionary.org/wiki/thon#English

  5. https://en.wikipedia.org/wiki/Philosophy_of_artificial_intelligence

  6. https://www.etymonline.com/word/intelligence

  7. http://www.dictionary.com/browse/intelligent

  8. http://www.idsia.ch/~juergen/newai/newai.html

  9. http://www.cs.uic.edu/~piotr/cs594/Prashant-UniversalAI.pdf

  10. https://en.wikipedia.org/wiki/Integrated_information_theory

  11. https://twitter.com/search?q=karl%20friston

  12. https://en.wikipedia.org/wiki/Microbial_intelligence

  13. https://en.wikipedia.org/wiki/Duck_typing

  14. "McCarthy coined the term 'artificial intelligence' in 1955, and organized the famous Dartmouth Conference in Summer 1956. This conference started AI as a field." https://en.wikipedia.org/wiki/John_McCarthy_(computer_scientist)

  15. https://en.wikipedia.org/wiki/Dartmouth_workshop

  16. http://raysolomonoff.com/dartmouth/

  17. http://www.mit.edu/~kardar/teaching/projects/chemotaxis(AndreaSchmidt)/finding_food.htm

  18. "[...] if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid." https://quoteinvestigator.com/2013/04/06/fish-climb/

  19. https://christophm.github.io/interpretable-ml-book/

  20. https://www.etymonline.com/word/predict

  21. https://en.wikipedia.org/wiki/Primitive_recursive_function

  22. https://en.wikipedia.org/wiki/Projection_(mathematics)

  23. https://www.quora.com/What-do-we-intuitively-mean-by-embedding-a-manifold-into-a-higher-dimensional-space-Can-you-give-some-examples

  24. https://en.wikipedia.org/wiki/Module_(mathematics)

  25. https://www.etymonline.com/word/plan

  26. https://www.etymonline.com/word/regret

  27. https://github.com/cbaziotis/prolog-cfg-parser

  28. https://en.wikipedia.org/wiki/Theory_of_justification

  29. https://www.theepochtimes.com/common-misconceptions-about-confucius_1955031.html

  30. https://en.wikipedia.org/wiki/Confucianism

  31. https://en.wikipedia.org/wiki/Mass_surveillance_in_China

  32. https://en.wikipedia.org/wiki/Social_Credit_System

  33. https://en.wikipedia.org/wiki/Human_rights_in_China

  34. https://www.cambridge.org/core/journals/politics-and-religion/article/do-confucian-values-deter-chinese-citizens-support-for-democracy/A4492EE692013F82AB66FD9C90DAFAA9

  35. https://www.businessinsider.com/china-facial-recognition-limitations-2018-7/

  36. Table of contents available at http://assets.cambridge.org/97805218/65593/frontmatter/9780521865593_frontmatter.pdf

  37. https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html

  38. https://en.wikipedia.org/wiki/Timeline_of_artificial_intelligence

  39. https://en.wikipedia.org/wiki/Progress_in_artificial_intelligence

  40. https://en.wikipedia.org/wiki/Timeline_of_machine_learning

  41. Proceedings at http://www.math.purdue.edu/calendar/conferences/machinelearning/abstracts.php

  42. https://papers.nips.cc/book/advances-in-neural-information-processing-systems-30-2017

  43. https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/

  44. https://www6.cityu.edu.hk/ma/people/profile/zhoudx.htm

  45. http://staff.ustc.edu.cn/~linlixu/

  46. http://webdocs.cs.ualberta.ca/~whitem/

  47. https://webdocs.cs.ualberta.ca/~dale/

  48. http://ergoemacs.org/emacs/emacs_narrow-to-defun_eval-defun_bug.html

  49. https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/

  50. https://en.wikipedia.org/wiki/Minimum_bounding_box

  51. https://math.meta.stackexchange.com/questions/41/differences-between-mathoverflow-and-math-stackexchange

  52. https://en.wikipedia.org/wiki/Partition_of_a_set

  53. https://arxiv.org/abs/1802.08864

  54. https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

  55. https://en.wikipedia.org/wiki/EcoBot

  56. https://en.wikipedia.org/wiki/Evolutionary_robotics

  57. https://en.wikipedia.org/wiki/Cognitive_architecture

  58. A single-celled organism capable of learning https://www.sciencedaily.com/releases/2016/04/160427081533.htm

  59. https://en.wikipedia.org/wiki/Cluster_hypothesis

  60. https://medium.com/continual-ai/why-continuous-learning-is-the-key-towards-machine-intelligence-1851cb57c308

  61. https://www.cs.uic.edu/~liub/lifelong-learning.html

  62. https://en.wikipedia.org/wiki/Nearest_centroid_classifier

  63. https://link.springer.com/chapter/10.1007/978-3-642-28942-2_24

  64. https://www.researchgate.net/publication/262204053_Nearest_Cluster_Classifier

  65. https://en.wikipedia.org/wiki/Aitken%27s_delta-squared_process

  66. https://en.wikipedia.org/wiki/Sequence_transformation

  67. The only problem is that [computers] are still too slow [for ’True AI’] – around a billion neural connections compared with around 100,000bn in the human cortex." https://www.theguardian.com/technology/2017/apr/18/robot-man-artificial-intelligence-computer-milky-way

  68. https://en.wikipedia.org/wiki/List_of_animals_by_number_of_neurons

  69. https://www.reddit.com/r/askscience/comments/3ib6hi/where_is_the_strong_ai_bottleneck/

  70. https://www.theguardian.com/science/2012/oct/03/philosophy-artificial-intelligence

  71. https://aeon.co/essays/how-close-are-we-to-creating-artificial-intelligence

  72. http://www.sussex.ac.uk/informatics/punctuation/essaysandletters/footnotes

  73. <2019-11-04> http://www.danko-nikolic.com/practopoiesis/

  74. <2019-11-04> https://link.springer.com/content/pdf/10.1007%2Fs11633-017-1093-8.pdf

  75. <2019-11-06> https://twitter.com/hardmaru/status/1138600152048910336

  76. https://en.wikipedia.org/wiki/Causal_reasoning

  77. http://people.idsia.ch/~juergen/

  78. https://www.reddit.com/r/MachineLearning/comments/2xcyrl/i_am_j%C3%BCrgen_schmidhuber_ama/

  79. https://www.theguardian.com/technology/2017/apr/18/robot-man-artificial-intelligence-computer-milky-way

  80. http://bootstrappingartificialintelligence.fr/WordPress3/

  81. https://www.nytimes.com/2018/06/20/technology/deep-learning-artificial-intelligence.html