Robotics and Computer Vision

Note: This class was taken during the fall semester of 2017 and was taught by professor Kristin Dana.

Exam 2 Study Guide

Coordinate Frame Transformations

Camera Calibration

Remember the image formation pipeline:

\[World Coordinates -> Camera Coordinates -> 2D Coordinates -> Pixel Coordinates\]

Goal: find \(M\), a 3x4 matrix

\[M = KR [ I | t ]\]

\(K\) is the intrinsic parameter matrix and R is the 3x3 rotation matrix. \(I\) is a 3x3 identity matrix and \(t\) is the 3x1 translation vector.

Need a set of corresponding camera and world points in order to do this.

Recall the equation for finding the image points

\[im_{P} = M W_P\]

In other words, given pixel coordinates and world coordinates for a corresponding point

\[\begin{bmatrix} x' \\ y' \\ w' \end{bmatrix} = MW_p\]

We then divide by \(w'\) components to take care of the homogeneous coordinates (This is where we lose the information about the 3d component). Thus:

\[x_i = \frac{x'}{w'} = \frac{m_{11}X_i + m_{12}Y_i + m_{13}Z_i + m_{14}}{m_{31}X_i + m_{32}Y_i + m_{33}Z_i + m_{34}}\]

From here you can rearrange the equation for x and set the expression equal to 0.

\[0 = m_{11}X_i + m_{12}Y_i + m_{13}Z_i + m_{14} -x_i(m_{31}X_i + m_{32}Y_i + m_{33}Z_i + m_{34})\]

Similarly we get an equation using \(y_i\)

\[0 = m_{21}X_i + m_{22}Y_i + m_{23}Z_i + m_{14} -y_i(m_{31}X_i + m_{32}Y_i + m_{33}Z_i + m_{34})\]

Then, for each point in the matrix we can get 2 rows in a 2Nx12 column matrix where the 12 columns come from the 12 parameters of the M matrix.

Thus we should be be able to solve for the matrix with around 6 corresponding points where the matrix is determined. (Use SVD and take the last column of V to get a solution in the null space)

Next problem: Finding intrinsic/extrinsic parameters from M

If we want to find the intrinsic parameters we can use QR decomposition in order to determine the intrinsic matrix.

To recover the intrinsic parameters of the rotation matrix:

  1. Invert the first 3x3 columns of the 3x4 rotation matrix
  2. Apply the QR decomposition to \((KR)^{-1}\)
  3. The result of QR gives an orthonormal matrix Q, and the R matrix.
    • The orthonormal matrix \(Q = R^{-1}\) (of \(w_{R_c}\) - the camera to world roation) and \(R = K^{-1}\)

To summarize then, in order to calibrate and find the camera matrix parameters \(K\) and \(R\) and \(W_{t_c}\)

Real-World Calibration

Because lenses are not perfect and real cameras do not exactly follow the pinhole-camera model, there are other parameters that need to be taken into account

While we won’t go in depth on those in this course we still should know how other systems work. The Caltech calibration system which is included in the OpenCV distribution is available and uses checkerboard patterns in order to properly calibrate cameras account for lens distortions. It uses nonlinear optimizations in order to solve the problem.

Typically about 30 images are needed for proper calibration.

Lens distortions:

Lens distortions - radial, barrel, and pincushion distortions are all possible results of camera lenses and must be taken into account to properly calibrate a camera.

Simple distortion models:

3D Reconstruction

We can perform 3D reconstruction via triangulation if we have 2 images where we have the M matrix of each camera (Or the \(K\), \(R\), and \(t\). However because we’re looking to do 3D reconstruction we assume we don’t have access to the 3D points in space. So the M matrices must not have been obtained from the the DLT method explained in the calibration section above.

In fully calibrated construction we:

For stereo reconstruction of an x, y coordinate match, do the following

Given \(K = \begin{bmatrix} \frac{f}{s_x} & s & o_x \\ 0 & \frac{f}{s_y} & o_y \\ 0 & 0 & 1 \end{bmatrix}\)

Uncalibrated Reconstruction and Epipolar Geometry

Using epipolar geometry is a kind of “shortcut” to the full triangulation of fully calibrated cameras. It allows us to get the 3d point without having to find the 11+ parameter matrix that is required as seen in the previous method.

Epipolar Geometry


Property: All epipolar lines go through the epipole

Using a fundamental matrix to perform reconstruction is called projective reconstructions. Using the essential matrix is reconstruction up to a scale factor

We know that a \(p_r = R(p_l - t)\). However w.r.t the camera matrices the translation \(t = o_r - o_l\) which is just the difference of the camera centers. Thus the \(R\) we refer to from here on is \(r_{R_l}\)

Computing F (or E), the 8 point algorithm

Fundamental Matrix Properties

The fundamental matrix is of rank 2.

Now for every point in the image, the epipolar constraint is satisfied, even at the epipole.

Partial Calibration

Finding Camera Matrices from F

Right Camera Matrix:

\[M_r = K_r [ I | 0 ]\]

Left Camera Matrix:

\[M_l = K_l [ e_x F | e ]\]

Where \(e\) is the epipole and \(e_x\) is the cross product of the epipole

\[e_x = \begin{bmatrix} 0 & -e_3 & e_2 \\ e_3 & 0 & -e_1 \\ -e_2 & e_1 & 0 \end{bmatrix}\]

Finding Camera Matrices from F

First define the following:

\[W = \begin{bmatrix} 0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
\[Z = \begin{bmatrix} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix}\]

Factor \(\rightarrow E = SR\) where \(SVD(E) = U\Sigma V^T\):

From this you must calculate the left and right matrices using all combinations of S and R, and the combinations that result in all triangulated points in front of the camera (\(+Z\)) will result in the correct reconstruction.

\[E = SR\]
\[M_r = K_r [ I | 0 ]\]
\[M_l = K_l [ R | t ]\]

Image Rectification

Makes correspondence easier if images are parallel.

Given two parallel images which have known intrinsic parameters we can represent the essential matrix as:

\[E = [t_x] R = \begin{bmatrix} 0&0&0 \\ 0&0&-B \\ 0&B&0 \end{bmatrix}\]

So in other words, the goal of rectification is to find a homography, \(H\), which causes the images to ‘become’ parallel

We can use disparity to help calculate \(B\) where

\[p_l - p_r = \frac{Bf}{z} = \text{disparity}\]

Because of this, correpsondence is easier because corresponding points lie on the same y-value pixels.

To calculate:

Trw = [Mextright ; 0 0 0 1];
Tlw = [Mextleft; 0 0 0 1];
Twr = inv(Trw); % can be done using transpose
Twl = inv(Tlw); % can be done using transpose
Tlr = Tlw*Twr;
% Rotation from right to left coordinate frame 
Rlr = Tlr(1:3,1:3);
% translation
tlr = Tlr(1:3,4);

% For rectification, we want the x axis of the new coordinate frame to be the vector connecting the
% optical axes, so we can write the first column as:
v1 = tlr./norm(tlr);
% The y axis should be perpendicular to the optical axis [0 0 1] and the x
% axis, so take the cross product
v2 = [-v1(2) v1(1) 0 ]';
v2 = v2./norm(v2);
% the new optical axis or z axis, must be perpendicular to the x and y axis
v3 = cross(v1,v2);
Rrectleft = [v1'; v2';v3';]'; % Now we can write the rotation matrix by interpreting the columns
% to rectify the points in the right camera, first multiply by Rrect
% so that the relative orientation of the two coordinates is the same;
% then Rlr will align the right coordinate points with the left coordinate
% frame to make parallel cameras. 

Interest Point Detection

Many algorithms rely on point correspondences.

How do we find them?

A few options - most widely used are SIFT (Scale Invariant Feature Transform) and SURF

Deep Learning

Deep learning is all about gradient optimization. The overall goal is to always minimize some type of loss function.

The basic procedure is as follows:

These are all usually repeated through stochastic gradient descent (SGD)

The network is really just a composition of functions for each layer represented as something like \(N(X, W) = f1(f2(f3(f4(X, W))))\)

Weights can be calculated via an iterative process where \(W_1 = W_0 + \alpha \frac{\partial f(W_0)}{\partial x_i}\)

To calculate SGD on networks the weights are typically a function of the previous layers’ weights and outputs and the current layer’s inputs

We can use the Jacobian matrix to help us calculate the SGD for our networks. The jacobian matrix paramters are defined as the following:

\[J_{ij} = \frac{\partial f_i}{\partial x_j}\]

So in order to calculate the backpropagation of the network we

Classifier Accuracy

Computational Appearance - Reflectance, Texture, and Light Fields

Reflectance is the existence of a BRDF function or Bidirectional Reflectance Distribution Function. Basically this means that the angle at which light reflects will change depending on the object.

Mathematically, the function will, given an input angle and an output angle, return the amount of reflected light for the two specific angles.

Work by prof. Dana has been done to entirely measure the reflectance at all viewing angles by using a parabolic mirror.

Motion Estimation

Given two images where there is movement (via camera, or world movement), we want to find out the function of that motion. For our purposes we assume that the motion is linear or affine.

That means the intensity of any given \((x, y)\) coordinate pair at a time \(t\) is given as \(i(x, y, t)\) and the same intensity should be found at \(i(x+u, y+v, t+\Delta t)\) in the 2nd image.

These u and v are composed of linear equations such that \(u = ax + by + c\) and \(v = dx + ey + f\).

Overall this is called the brightness constancy assumption and we use it to solve for our u and v parameters

the equation for this is

\[i_t + i_x (ax + by + c) + i_y (dx + ey + f) = 0\]

If we do this for all points on the image we can use a least-squares estimate to obtain the best fit parameters for a, b, c, d, e, and f

Robotics and Computer Vision - zac blanco