Visualization Recommender System
lower the barrier to exploring basic visualizations
- by automatically generating results
- for analysts to search and select, not manually specify
Abstract
machine learning-based approach
- learns visualization
design choices - from a large corpus of
datasets - from associated
visualisations
by…
- identify five key design choices
- using one million dataset-visualisation pairs
evaluation
for : generalizability and uncertainty
by : benchmark with a crowdsourced test set
result : comparable to human performance
Problem Formulation
Data visualization communicates information by representing data with visual elements
representations are specified using…
encodings: map from data to theretinal propertiesofgraphical marksretinal properties- e.g. position, length, colour
graphical marks- e.g. point, line, rectangle
e.g.
To create scatterplot showing the relationship between MPG and Hp,
encoding each pair of data points of a circle on a 2D plane
while specifying other retinal properties such as size and colour
Method
Vega-lite
- selecting
mark typeandfieldsto be encoded along the x- and y-axes
Tableau
- placing the 2 columns onto the respective column and row shelves
formulate basic visualization of a dataset
das a set of interrelated design choicesC = {c}set of choicesthat result in valid visualizations=== a
subsetof the space of all possible choicesautomatically suggest a subset of design choices
C_rec ⊆ Cthat maximize effectivenesseffectivenesscan be defined by informational measures- e.g. efficiency, accuracy, memorability, emotive measures
Related Work
Rule-based Visualization Recommender Systems
ML-based Visualization Recommender Systems
DeepEye
Data2Vis
Draco-Learn
VizML
Pros
learning task: learn to…predict design choices
- easier to quantitatively validate
- easier to provide interpretable measures of feature importance
- easily integrated into visualization systems
DeepEye : classify and rank visualizations
Data2Vis : end-to-end generation model
Draco-Learn : soft constraints weights
data quantity: training corpus is…orders of magnitude larger than DeepEye and Data2Vis
- permits the use of large feature sets that capture many aspects of a dataset
- permits the use of high-capacity models such as deep neural network
data quality: the datasets used are…extremely diverse in shape, structure, and distribution
the result of real visual analysis by analysts on their own datasets
others
- few datasets used to train
- visualisations are generated by rule-based systems
- evaluated number controlled settings
- cons
- only recommends visual encodings, not data queries
- do not create ★ an application that employs the visualization model ★
Data
Feature Extraction
distinguish the feature categories to…
- organize how to create and interpret features
- observe the contribution of different types of features
- to capture if less generalizable than other categories
Dimensions D: the number of rows in a columnTypes T: whether a column is categorical, temporal, or quantitativeValues V: the statistical and structural properties of the values within a columnNames N: the column name
👉 order in D - T - V - N by how biased the features to be towards the corpus
Method
describe each pair of columns with
30 pairwise-column featuresdivide into 2 categories :
ValuesandNamesmany pairwise-column features depend on the individual column types
create
841dataset-level featuresby aggregating single- and pairwise-column features using16 aggregation functionsaggregation function : convert single- and pairwise-column features into scalar values
- train separate models per number of columns
- include column features with padding
Design Choice Extraction
extract an analyst’s design choices to…
parse the traces because each visualization consists of traces that associate collections of data with visual elements
Encoding-level Desing Choices
mark type: scatter, etc.column encoding: which column is represented on which axis, whether or not an X or Y column is the single column represented along that axis
by aggregating these choices…
👇
Visualization-level Design Choices
describe the type shared among all traces
determine whether the visualization has a shared axis
Methods
Feature Processing
apply
one-hot encodingto categorical featuresset
numeric valuesabove 99% or below 1% to respective cut-offsimpute
missing categorical valuesusing the mode of non-missing values, and missing numeric values with the mean of non-missing valuesremove the mean of numeric fields and scale to unit variance
randomly remove datasets that are exact deduplicates of each other
remove all but one randomly selected dataset per user
👉 remove bias towards more prolific users
Prediction Tasks
Two visualization-level prediction tasks
use dataset-level features
to predict visualization-level design choices
Three encoding-level prediction tasks
use features about individual columns
to predict how they are visually encoded
consider each column indepedently,
instead of alongside other columns in the same dataset
Visualization Type & Mark Type task
2 class task : line vs. bar
3 class task : scatter vs. line vs. var
Neural Network and Baseline Models
fully-connected feedforward neural network
with 3 hidden layers, each consisting of 1000 neurons
with ReLU activation functions
split the data in 60/ 20/ 20 (train/ validation/ test sets) 5 times using 5-fold cross-validation
Features
- D
- D+T
- D+T+V
- D+T+V+N = ALL