Science fiction is always ahead of its technology contemporaries. In Dr. Asimov’s novelette \textit{The Bicentennial Man}, a housekeeper robot named ``Andrew'' was set to be introduced to the Martin family in 2005. But until now, massive applications of ``Andrew'' like general-purpose robots still remain absent. Generally, two aspects of challenges exist here: the environment and the task. With recent advances in deep learning, the challenge of the unstructured-environment has been hereto, perhaps well addressed due to the representation power of deep neural networks. But due to the complexity of robotic tasks, the task specification problem still lacks a satisfactory solution.
Human demonstrations, compared to way-point based robot teaching and reward/cost based learning methods, provide a more intuitive interface for general purpose task programming. Moreover, learning directly by watching human demonstrations without either tedious kinesthetic teaching or teleoperation, is even more convenient for task teaching.
However, just as the ``no free lunch'' theorem suggests, more convenience means more challenges. Generally, our trouble comes from the “correspondence problem” suggesting that humans and robots perform even the same task in fundamentally different ways both perceptually and physically. Perceptually, the observation space of human and robot could be different. For example, the camera used to record human demo videos can be different from the on-board sensor of a robot when executing the task. Physically, humans and robots have different action spaces. When collecting samples, we have no access to robot actions. So the strong task clue based on state action pairs in approaches like behavior cloning (BC), inverse reinforcement learning (IRL) or meta-learning, is no longer available. These extra difficulties make learning a controller or policy even more challenging in that massive robot-environment interactions are commonly required, which can be cost inhibitive in real-world applications.
This learning paradigm aligns with the human cognitive process in peer learning and observational learning which involves first understanding the task by what has been observed before attempting any motor actions. So a global information of task definition is
In this thesis, we tackle the above challenges by learning a geometry structured latent representation
no robot actions are collected
Moreover, since we only record video frames, there are no robot actions collected,
Rethinking the ``correspondence problem’’, we can infer that only the information of task definition should be transferred from human demonstrator to robot imitator. Empirically, this aligns with the human cognitive process in peer learning and observational learning which involves first understanding the task by what has been observed before attempting any motor actions. Furthermore, suppose we can directly encode task definition from an observed image, then the task definition itself should also transfer across different environmental settings or categorical objects / tools. This gives us a hint to build generalizable robot task learning. Likewise, this is also similar to human performance in that humans can seize generalizable task definitions simply by watching others performing the same task under various task settings.
This thesis concerns four questions: (1) how should we encode the task definition directly from an image? (2) how should we learn it from human demonstration videos? (3) how will the learned task definition encoding relate to a control or policy? (4) how should the task learning be generalizable? In this thesis, we introduce geometry as a structured priori to derive solutions. Next we will explain in more detail.
the prior knowledge which structurizes the task representation learning from human demonstration videos.
Then how should we extract task definitions directly from an image--- what are the considerations and what is the neural network architecture? How does the representation of task definition relate to robot controller design or policy learning? How should the task learning be generalizable? There could be many possible ways to design a task definition encoder, one simple example is a neural network that takes an input image and outputs a vector encoding the task definition. But in practice, such simple parameterization does not help addressing the tedious controller/policy learning problem. In this thesis, we introduce geometry as the prior knowledge which structurizes the task representation learning from human demonstration videos. We show that such geometry structured task representation enables the design of efficient controllers and the generalization to various objects and tools.
There could be many ways to encode a task from images, for example, a neural network that takes an image input and simply outputs a task encoding vector, however such simple parameterization does not help addressing the tedious controller/policy learning problem when transferring the task definition to a robot. Can we bring some structured constraints to the neural network so that the learned task encoding enables an easier controller design or policy learning? Then say geometry is structured.
1. 2 Research Questions
1 什么是task representation learning problem,我们需要什么样的representatin,有什么性质?
2 介绍geometry,介绍visual servoing。介绍为什么geometry可以有用。
3 总结,我们的approach是什么。解决了deep l的什么问题,解决了visua servoing的什么问题。
4. 分析引入geometry
In this thesis, we introduce geometry into the neural network as a structured task representation. This motivation comes from decades of research proven effective techniques in the visual servoing literature that use geometric association constraints (Fig. 1, \eg, point-to-point, point-to-line, line-to-line) and their combinations to represent a task in a general way and to control the robot in an efficient way.
For example, in our problem setting, the camera used to record human demonstration videos is different from the on-board one that guides robot actions. So the human demonstrator and robot imitator have two different observation spaces. Moreover, since we only record video frames, there are no robot actions collected, then the strong task definition clues using state action pairs in either behavior cloning (BC), inverse reinforcement learning (IRL) or meta-learning approaches, are no longer available. As a result, learning a robot controller or policy still requires massive data samples via an interactive learning environment in either a tediously designed simulator or the real robot. This can be cost inhibitive in real-world applications.
\section{Research Questions}
Formally we name the task definition encoding as the task representation, learning an encoder that extracts task definition from an observed image as the \textbf{task representation learning problem}.
This research intersects with deep learning and traditional visual servoing. It combines the representation power of deep neural networks while maintaining advantages of visual servoing like data efficiency, good interpretability and almost no hardware wear-out during training. As a result, this research transforms the task specification problem in visual servoing to the problem of task representation learning by feeding several human demonstration videos.
Our goal is to learn generalizable task representation from several human demonstration videos and
We show that such geometry structured task representation enables the design of efficient controllers and the generalization to various objects and tools.
Furthermore, we ask the question how to encode the task definition so that it enables an easier robot controller design or policy learning.
Intuitively, we can directly formulate a neural network that takes input of an image and outputs a vector meaning the task definition.
view on human demonstration towards encoding more task
concepts rather than control
we examine \textit{what to imitate} which allows a common shared knowledge transferring
Recent advances in deep learning based approach have hereto well addressed the unstructured-environment challenge.
Science fiction is always ahead of its technology contemporaries. In Dr. Asimov’s novelette \
textit
{The Bicentennial
Man
}, a housekeeper robot named
``
Andrew''
was set
to
be introduced
to the Martin family in 2005.
But
until
now
, massive applications of
``
Andrew'' like general-purpose robots
still
remain absent.
Generally
, two aspects of
challenges
exist here: the environment and the
task
. With recent advances in
deep
learning
, the
challenge
of the unstructured-environment has been hereto, perhaps well addressed due to the
representation
power of
deep
neural
networks
.
But
due to the complexity of robotic
tasks
, the
task
specification
problem
still
lacks a satisfactory solution.
Human
demonstrations
, compared to way-point based robot teaching and reward/cost based
learning
methods, provide a more intuitive interface for general purpose
task
programming.
Moreover
,
learning
directly
by watching
human
demonstrations
without either tedious kinesthetic teaching or
teleoperation
, is even more convenient for
task
teaching.
However
,
just
as the
``
no free lunch'' theorem suggests, more convenience means more
challenges
.
Generally
, our trouble
comes
from the “correspondence
problem”
suggesting that
humans
and robots perform even the same
task
in
fundamentally
different
ways
both
perceptually
and
physically
.
Perceptually
, the observation space of
human
and robot could be
different
. For
example
, the camera
used
to
record
human
demo
videos
can be
different
from the on-board sensor of a robot when executing the
task
.
Physically
,
humans
and robots have
different
action
spaces. When collecting samples, we have no access to robot actions.
So
the strong
task
clue based on state
action
pairs in approaches like behavior cloning (BC), inverse reinforcement
learning
(IRL) or meta-learning, is no longer available. These extra difficulties
make
learning
a
controller
or
policy
even more challenging in that massive robot-environment interactions are
commonly
required, which can
be cost
inhibitive
in real-world applications.
This
learning
paradigm aligns with the
human
cognitive process in peer
learning
and observational
learning
which involves
first
understanding the
task
by what has been
observed
before
attempting any motor actions.
So
a global information
of
task
definition
is
In this
thesis
, we tackle the above
challenges
by
learning
a
geometry
structured latent representation
no robot actions
are collected
Moreover
, since we
only
record
video
frames, there are no robot actions collected,
Rethinking the
``
correspondence
problem’’
, we can infer that
only
the information of
task
definition
should
be transferred
from
human
demonstrator to robot imitator.
Empirically
, this aligns with the
human
cognitive process in peer
learning
and observational
learning
which involves
first
understanding the
task
by what has been
observed
before
attempting any motor actions.
Furthermore
, suppose we can
directly
encode
task
definition
from an
observed
image
, then the
task
definition
itself should
also
transfer across
different
environmental settings or categorical objects / tools. This gives us a hint to build generalizable robot
task
learning
.
Likewise
, this is
also
similar to
human
performance in that
humans
can seize generalizable
task
definitions
simply
by watching others performing the same
task
under various
task
settings.
This
thesis
concerns four questions: (1) how should we encode the
task
definition
directly
from an
image
? (2) how should we learn it from
human
demonstration
videos
? (3) how will the learned
task
definition
encoding relate to a control or
policy
? (4) how should the
task
learning
be
generalizable? In this
thesis
, we introduce
geometry
as a structured
priori
to derive solutions.
Next
we will
explain
in more detail.
the
prior knowledge which
structurizes
the
task
representation
learning
from
human
demonstration
videos.
Then how should we extract
task
definitions
directly
from an
image---
what are the considerations and what is the
neural
network
architecture? How does the
representation
of
task
definition
relate to robot
controller
design
or
policy
learning
? How should the
task
learning
be
generalizable? There could be
many
possible
ways
to
design
a
task
definition
encoder, one simple
example
is a
neural
network
that takes an input
image
and outputs a vector encoding the
task
definition
.
But
in practice, such simple parameterization does not
help
addressing
the tedious controller/policy
learning
problem
. In this
thesis
, we introduce
geometry
as the prior knowledge which
structurizes
the
task
representation
learning
from
human
demonstration
videos
. We
show
that such
geometry
structured
task
representation
enables
the
design
of efficient
controllers
and the generalization to various objects and tools.
There could be
many
ways
to encode a
task
from
images
, for
example
, a
neural
network
that takes an
image
input and
simply
outputs a
task
encoding vector,
however
such simple parameterization does not
help
addressing
the tedious controller/policy
learning
problem
when transferring the
task
definition
to a robot. Can we bring
some
structured constraints to the
neural
network
so
that the learned
task
encoding
enables
an easier
controller
design
or
policy
learning
? Then say
geometry
is structured
.
1. 2
Research
Questions
1
什么是task
representation
learning
problem,我们需要什么样的representatin,有什么性质?
2
介绍geometry,介绍visual
servoing。介绍为什么geometry可以有用。
3
总结,我们的approach是什么。解决了deep
l的什么问题,解决了visua
servoing的什么问题。
4.
分析引入geometry
In this
thesis
, we introduce
geometry
into the
neural
network
as a structured
task
representation
. This motivation
comes
from decades of
research
proven effective techniques in the visual
servoing
literature that
use
geometric association constraints (Fig. 1, \
eg
, point-to-point, point-to-line, line-to-line) and their combinations to represent a
task
in a general
way
and to control the robot
in an efficient way
.
For
example
, in our
problem
setting, the camera
used
to
record
human
demonstration
videos
is
different
from the on-board one that guides robot actions.
So
the
human
demonstrator and robot imitator have two
different
observation spaces.
Moreover
, since we
only
record
video
frames, there are no robot actions collected, then the strong
task
definition
clues using state
action
pairs in either behavior cloning (BC), inverse reinforcement
learning
(IRL) or meta-learning approaches, are no longer available.
As a result
,
learning
a robot
controller
or
policy
still
requires massive data samples via an interactive
learning
environment in either a
tediously
designed simulator or the real robot. This can
be cost
inhibitive
in real-world applications.
\section{Research Questions}
Formally
we name the
task
definition
encoding as the
task
representation
,
learning
an encoder that extracts
task
definition
from an
observed
image
as the \
textbf
{task
representation
learning
problem}.
This
research
intersects with
deep
learning
and traditional visual
servoing
. It combines the
representation
power of
deep
neural
networks
while maintaining advantages of visual
servoing
like data efficiency,
good
interpretability and almost no hardware wear-out during training.
As a result
, this
research
transforms the
task
specification
problem
in visual
servoing
to the
problem
of
task
representation
learning
by feeding several
human
demonstration
videos.
Our goal is to learn generalizable
task
representation
from several
human
demonstration
videos
and
We
show
that such
geometry
structured
task
representation
enables
the
design
of efficient
controllers
and the generalization to various objects and tools.
Furthermore
, we ask
the question how
to encode the
task
definition
so
that it
enables
an easier robot
controller
design
or
policy
learning.
Intuitively
, we can
directly
formulate a
neural
network
that takes input of an
image
and outputs a vector meaning the
task
definition.
view
on
human
demonstration
towards encoding more task
concepts
rather
than control
we examine \
textit
{what to imitate} which
allows
a common shared knowledge
transferring
Recent advances in
deep
learning
based approach have hereto well addressed the unstructured-environment
challenge
.