Gaze-Tracker

Building light weight eye trackers for mobile devices using simple Convolutional Neural Networks. This repo contains the work done during GSoC-2022 under @INCF.

View the Project on GitHub s0mnaths/Gaze-Tracker

Gaze-Tracker

eye.png

Introduction

This year during the GSoC’22 I worked on the Gaze Track project from last year, which is based on the implementation, fine-tuning and experimentation of Google’s paper Accelerating eye movement research via accurate and affordable smartphone eye tracking.

Eye tracking can be used for a range of purposes, from improving accessibility for people with disabilities to improving driver safety. However, modern state-of-the-art mobile eye trackers are costly, often bulky devices that require careful setup and calibration, and they tend to be expensive. The aim of this project therefore is to develop an affordable and open source alternative to these Eye Trackers.

My main task during the GSoC period was the implement the model architecture proposed by Google, in Tensorflow, run SVR experiments, and compare the results to Abhinav’s and Dinesh’s versions. Please refer to the the posts by them on their implementation.

Dataset

Every trained model offered in this project was developed using data from a portion of the enormous MIT GazeCapture dataset, which was made available in 2016. The dataset can be accessed by registering on the website. They include JSON files with the corresponding images that contain information such as bounding box coordinates for the eyes, faces, and other features, as well as data on the number of frames, face detections, and eye detections.

The official GazeCapture repository provides excellent explanations of the dataset’s file structure and the information it contains.

Splits

Only those frames with valid face and eye detections are included in the final dataset. The frame is discarded if any one of the detections is absent.

Therefore, after applying the following filters, our dataset is generated.

  1. Only Phone Data
  2. Only portrait orientation
  3. Valid face detections
  4. Valid eye detections

After the conditions listed above are satisfied, overall there are 501,735 frames from 1,241 participants.

For the base model training, there are two types of splits that are considered.

MIT Split

Similar to how GazeCapture does it, the MIT Split keeps the train/test/validation split at the per-participant level. This means that a participant’s data does not appear in more than one of the train, test, or validation set. This helps the model to train and generalize more, since the same person does not appear in all the splits.

The details regarding the split are as follows

Train/Validation/Test Number of Participants Total Frames
Train 1,075 427,092
Validation 45 19,102
Test 121 55,541

Google split

The unique ground truth points were used by Google to split their dataset. This means that each participant’s frames are included in the train test and validation sets.

The details regarding the split are as follows

Train/Validation/Test Number of Participants Total Frames
Train 1,241 366,940
Validation 1,219 50,946
Test 1,233 83,849

The Network

Using Tensorflow we reproduce the neural network architecture as provided in the Google paper and the supplementary information.

The network architecture is depicted in the diagram below.

gazetracknetwork.jpg

Training

The model binary we got from google, whose pipeline we are trying to implement is in .tflite format and takes in data in form of TF Records. So we first build .tfrecs of our data to feed into the model.

When our trained model will be converted to tflite version, there is a possibility of significant accuracy drop. This can be avoided by post-training quantization using the Tensorflow pipeline itself, implemented very similar to Google’s pipeline.

I used the Reduce LR on Plateau learning rate scheduler. Experiments were carried out with Exponential LR, Reduce LR on Plateau and no LR schedulers. Reduce LR on Plateau gave the best results. This is opposite to Abhinav’s & Dinesh’s PyTorch versions.

The loss we used was Mean Squared Error (MSE) and metric Mean Euclidean Distance (MED) was defined as

def mean_euc(a, b):
    euc_dist = np.sqrt(np.sum(np.square(a - b), axis=1))
    mean_euc = euc_dist.mean()
    return mean_euc

Results

Base model results -

We compare our results with that of Dinesh’s(Pytorch implementation from last year) and Abhinav’s(PyTorch Implementation, with changes in hyperparameters).

Split TF Implementation Dinesh’s Abhinav’s
MIT 2.03cm 2.03cm 2.06cm
Google 1.80cm 1.86cm 1.68cm

Following the Tensorflow pipeline we’re able to get comparable results. This would be useful later when we compare our own tflite version with the tflite binary provided by Google.

TF model checkpoints are available on the project repository.

Here are some of the visualizations of gaze predictions from this years Tensorflow Implementation.

The ‘+’ signs are the ground truth gaze locations, Dots are base model predictions and Tri-Downs are mean of base model predictions for that particular ground truth gaze location. Each gaze location has several frames connected to it, and as a result, has several predictions. We apply colour coding to correlate predictions to their corresponding ground truth. All dots and tri-ups of a color correspond to the ‘+’ of the same color. The star(*) corresonds to the camera position, which is at the origin.

MIT Split

MIT-110-192-merged.png

Google Split

GS-2590-2138-merged.png

SVR Implementation

The next task was to compare the SVR results with the current implementations. Google, in their pipeline extracts the output of shape (1,4) from the penultimate layer of the multilayer feed-forward convolutional neural network (CNN), and fits it at a per-user-level to build a high-accuracy personalized model. We follow the same.

For the purpose of getting the output of the penultimate layer, a hook is attached to the model. A multioutput regressor SVR is used once the output of shape (x,4) from the penultimate layer has been obtained. This was fitted on the test set of the trained model.

For sweeping the parameters of SVR, we consider:

This is similar to what Google stated in their supplementary material.

To select the best value, the epsilon value of the Multioutput Regressor was swept between 0.01 and 1000. For the purpose of fitting the SVR, the test set is divided into two ratios: 70:30 and 2/3:1/3. We then perform 3 fold and 5 fold grid search. Using this we obtain the best parameter for each individual, which is used to fit the SVR.

Various Splits for SVR

There are two within-individual SVR personalization versions -

Within both these versions, there are two sub-versions

The Unique Ground Truth values version corresponds more to the real life scenario since the random Data Points version may have very similar samples in both the train and test sets, which would result in poor generalization.

Another split we tried is the No Shuffle split, where we use, say first 70% of the test set points for fitting the SVR, and the latter 30% for testing the SVR. This also corresponds to the actual use-case, where we first calibrate the SVR, and then the subject uses the model.

We select 10 users based on the highest number of frames from each of the above mentioned splits. This is the data that the base model has not seen, and so SVR is fitted on them.

MIT Split

1. Mean Results Comparison

Base-model Results:

Implementation MED
Abhinav’s 1.82cm
New(TF) 1.68cm

Post-SVR Results:

  1. Random Data points/samples (All Frames)
Implementation70 & 30 split2/3 & 1/3 split
Shuffle = TrueShuffle = FalseShuffle = True Shuffle = False
Abhinav’s1.46cm---
New(TF)1.48cm1.69cm1.49cm1.64cm
  1. Unique Ground Truth values (30 points)
Implementation70 & 30 split 2/3 & 1/3 split
Shuffle = TrueShuffle = FalseShuffle = TrueShuffle = False
Abhinav’s1.76cm---
New(TF)1.73cm1.75cm1.84cm1.72cm

Base Model MED vs Post SVR MED:

  1. Random Data points/samples (All Frames)
Version70 & 30 split2/3 & 1/3 split
Shuffle = TrueShuffle = FalseShuffle = True Shuffle = False
TF Base Model MED1.79cm1.88cm1.79cm1.86cm
Post SVR MED1.48cm1.69cm1.49cm1.64cm
  1. Unique Ground Truth values (30 points)
Version70 & 30 split 2/3 & 1/3 split
Shuffle = TrueShuffle = FalseShuffle = TrueShuffle = False
TF Base Model MED1.78cm1.75cm1.83cm2cm
Post SVR MED1.73cm1.75cm1.84cm1.72cm

2. Per-Individual Comparison:

  1. Random Data points/samples (All Frames)
User IDNo. of framesBase Model MED(Mean across all versions)SVR-3CV (70&30)SVR-3CV (2/3&1/3)
Shuffle = TrueShuffle = FalseShuffle = True Shuffle = False
31838741.38cm1.34cm1.42cm1.35cm 1.32cm
18778602.03cm1.28cm1.13cm1.32cm1.09cm
13267841.53cm1.31cm1.47cm1.29cm1.44cm
31407831.54cm1.54cm1.44cm1.56cm1.45cm
20917881.70cm1.80cm1.98cm1.81cm 1.92cm
23018641.86cm1.36cm1.75cm 1.34cm1.69cm
22408011.46cm1.24cm 1.52cm1.23cm1.46cm
3828512.38cm 2.44cm2.89cm2.44cm2.75cm
2833796 1.71cm1.68cm1.86cm1.67cm1.87cm
2078 7861.24cm0.82cm1.42cm0.83cm1.37cm
  1. Unique Ground Truth values (30 points)
User IDSVR-3CV (70&30)SVR-3CV (2/3&1/3)
Shuffle = TrueShuffle = FalseShuffle = TrueShuffle = False
31831.83cm0.85cm1.88cm1.58cm
18771.82cm1.46cm1.64cm1.47cm
1326 2.39cm2.10cm2.12cm2.09cm
31401.20cm 1.30cm1.73cm1.58cm
20911.81cm1.99cm1.94cm1.73cm
23011.43cm1.501.77cm1.61cm
22401.26cm1.62cm 1.16cm1.73cm
3822.43cm2.52cm2.69cm2.39cm
28331.82cm1.82cm1.89cm1.79cm
20781.27cm1.28cm1.09cm1.20cm

Analysis

We can see that the mean losses when considering all the frames are lower, compared to the unique ground truth values version. This is due to data leakage as discussed previously. We also notice that the overall mean errors post SVR is significantly lower than that of base model errors. When we don’t shuffle the set during split, the loss increases, since it mimics the real life scenario when the user might look at new ground truth points. When we consider frames with unique ground truth values, the errors per individual are varying a lot, which results in almost similar mean errors. Since this was only trained on 30 frames, the SVR has not generalized well, and possibly learned some unwanted features. This will be cleaned out in the future work.

Google Split

1. Mean Results Comparison

Base-model Results:

Implementation MED
Abhinav’s 1.15cm
New(TF) 1.24cm

Base Model MED vs Post SVR MED:

  1. Random Data points/samples (All Frames)
Version70 & 30 split 2/3 & 1/3 split
Shuffle = TrueShuffle = FalseShuffle = TrueShuffle = False
TF Base Model MED1.31cm1.31cm1.32cm1cm
Post SVR MED1.04cm1.14cm1.12cm1.04cm

2. Per-Individual Comparison:

User IDNo. of framesBase Model MED(Mean across all versions)SVR-3CV (70&30)SVR-3CV (2/3&1/3)
Shuffle = TrueShuffle = FalseShuffle = True Shuffle = False
5039651.38cm1.37cm1.35cm1.32cm 1.41cm
186610181.34cm0.86cm1.24cm1.18cm0.88cm
245910061.48cm0.69cm0.81cm0.81cm0.68cm
18169891.04cm0.92cm0.93cm0.92cm0.94cm
3004 9831.22cm1.18cm1.07cm1.05cm1.16cm
32539781.26cm 0.84cm1.07cm0.98cm0.84cm
12319681.39cm1.09cm1.33cm 1.36cm1.06cm
21529571.28cm1.36cm1.28cm1.27cm1.38cm
20159471.27cm1.12cm1.23cm1.2cm1.11cm
1046 9461.24cm0.97cm1.07cm1.07cm0.97cm

Analysis

Since the Google split has frames of each individual in both the sets, it results in very low errors on the 10 individual dataset, as compared to the MIT split. Google uses this version, and quite possibly this is the reason that their mean errors are very low (0.46±0.03cm)

App

Data was collected using an Android App. The users’ photo was clicked at random times while the circle/dots are appearing on the screen. The centre of the circle is noted as the X,y coordinate and frames were assigned to particular coordinate depending on the time stamp.

An Android app was used to collect the data. At random intervals while the circle/dots were visible on the screen, the users’ photos were clicked. The centre of the circle is noted as the (x,y) coordinate of the gaze and frames were assigned to that particular coordinate depending on the time stamp.

New Learnings

Future Scope and Improvements

References

1. Eye Tracking for Everyone
K.Krafka*, A. Khosla*, P. Kellnhofer, H. Kannan, S. Bhandarkar, W. Matusik and A. Torralba
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

2. Accelerating eye movement research via accurate and affordable smartphone eye tracking
Valliappan, N., Dai, N., Steinberg, E., He, J., Rogers, K., Ramachandran, V., Xu, P., Shojaeizadeh, M., Guo, L., Kohlhoff, K. and Navalpakkam, V.
Nature communications, 2020

Acknowledgements

I would like to thank my mentors Dr. Suresh Krishna and Mr.Dinesh Sathia Raj for their guidance in every aspect of this project. This work would not have been possible without their support.