Visualization Recommender System
lower the barrier to exploring basic visualizations
- by automatically generating results
- for analysts to search and select, not manually specify
Abstract

machine learning-based approach
- learns visualization
design choices
- from a large corpus of
datasets
- from associated
visualisations
by…
- identify five key design choices
- using one million dataset-visualisation pairs
evaluation
for : generalizability and uncertainty
by : benchmark with a crowdsourced test set
result : comparable to human performance
Problem Formulation
Data visualization communicates information by representing data with visual elements
representations
are specified using…
encodings
: map from data to theretinal properties
ofgraphical marks
retinal properties
- e.g. position, length, colour
graphical marks
- e.g. point, line, rectangle
e.g.

To create scatterplot showing the relationship between MPG and Hp,
encoding each pair of data points of a circle on a 2D plane
while specifying other retinal properties such as size and colour
Method
Vega-lite
- selecting
mark type
andfields
to be encoded along the x- and y-axes
Tableau
- placing the 2 columns onto the respective column and row shelves
formulate basic visualization of a dataset
d
as a set of interrelated design choicesC = {c}
set of choices
that result in valid visualizations=== a
subset
of the space of all possible choicesautomatically suggest a subset of design choices
C_rec ⊆ C
that maximize effectivenesseffectiveness
can be defined by informational measures- e.g. efficiency, accuracy, memorability, emotive measures
Related Work
Rule-based Visualization Recommender Systems
ML-based Visualization Recommender Systems
DeepEye
Data2Vis
Draco-Learn
VizML
Pros
learning task
: learn to…predict design choices
- easier to quantitatively validate
- easier to provide interpretable measures of feature importance
- easily integrated into visualization systems
DeepEye : classify and rank visualizations
Data2Vis : end-to-end generation model
Draco-Learn : soft constraints weights
data quantity
: training corpus is…orders of magnitude larger than DeepEye and Data2Vis
- permits the use of large feature sets that capture many aspects of a dataset
- permits the use of high-capacity models such as deep neural network
data quality
: the datasets used are…extremely diverse in shape, structure, and distribution
the result of real visual analysis by analysts on their own datasets
others
- few datasets used to train
- visualisations are generated by rule-based systems
- evaluated number controlled settings
- cons
- only recommends visual encodings, not data queries
- do not create ★ an application that employs the visualization model ★
Data
Feature Extraction
distinguish the feature categories to…
- organize how to create and interpret features
- observe the contribution of different types of features
- to capture if less generalizable than other categories

Dimensions D
: the number of rows in a columnTypes T
: whether a column is categorical, temporal, or quantitativeValues V
: the statistical and structural properties of the values within a columnNames N
: the column name
👉 order in D - T - V - N
by how biased the features to be towards the corpus
Method
describe each pair of columns with
30 pairwise-column features
divide into 2 categories :
Values
andNames
many pairwise-column features depend on the individual column types
create
841dataset-level features
by aggregating single- and pairwise-column features using16 aggregation functions
aggregation function : convert single- and pairwise-column features into scalar values
- train separate models per number of columns
- include column features with padding
Design Choice Extraction
extract an analyst’s design choices to…
parse the traces because each visualization consists of traces that associate collections of data with visual elements
Encoding-level Desing Choices
mark type
: scatter, etc.column encoding
: which column is represented on which axis, whether or not an X or Y column is the single column represented along that axis
by aggregating these choices…
👇
Visualization-level Design Choices
describe the type shared among all traces
determine whether the visualization has a shared axis
Methods
Feature Processing
apply
one-hot encoding
to categorical featuresset
numeric values
above 99% or below 1% to respective cut-offsimpute
missing categorical values
using the mode of non-missing values, and missing numeric values with the mean of non-missing valuesremove the mean of numeric fields and scale to unit variance
randomly remove datasets that are exact deduplicates of each other
remove all but one randomly selected dataset per user
👉 remove bias towards more prolific users
Prediction Tasks
Two visualization-level prediction tasks
use dataset-level features

to predict visualization-level design choices
Three encoding-level prediction tasks
use features about individual columns

to predict how they are visually encoded
consider each column indepedently,
instead of alongside other columns in the same dataset
Visualization Type & Mark Type task
2 class task : line vs. bar
3 class task : scatter vs. line vs. var
Neural Network and Baseline Models
fully-connected feedforward neural network
with 3 hidden layers, each consisting of 1000 neurons
with ReLU activation functions
split the data in 60/ 20/ 20 (train/ validation/ test sets) 5 times using 5-fold cross-validation
Features
- D
- D+T
- D+T+V
- D+T+V+N = ALL