VizML: A Machine Learning Approach to Visualization Recommendation 논문 리뷰

Visualization Recommender System
lower the barrier to exploring basic visualizations

by automatically generating results

for analysts to search and select, not manually specify

Abstract

machine learning-based approach

learns visualization design choices
from a large corpus of datasets
from associated visualisations

by…

identify five key design choices
using one million dataset-visualisation pairs

evaluation

for : generalizability and uncertainty

by : benchmark with a crowdsourced test set

result : comparable to human performance

Problem Formulation

Data visualization communicates information by representing data with visual elements

representations are specified using…

encodings : map from data to the retinal properties of graphical marks
retinal properties
- e.g. position, length, colour
graphical marks
- e.g. point, line, rectangle

e.g.

To create scatterplot showing the relationship between MPG and Hp,

encoding each pair of data points of a circle on a 2D plane
while specifying other retinal properties such as size and colour

Method

Vega-lite

selecting mark type and fields to be encoded along the x- and y-axes

Tableau

placing the 2 columns onto the respective column and row shelves

formulate basic visualization of a dataset d as a set of interrelated design choices C = {c}

set of choices that result in valid visualizations

=== a subset of the space of all possible choices
automatically suggest a subset of design choices C_rec ⊆ C that maximize effectiveness
effectiveness can be defined by informational measures
- e.g. efficiency, accuracy, memorability, emotive measures

Rule-based Visualization Recommender Systems

ML-based Visualization Recommender Systems

DeepEye
Data2Vis
Draco-Learn

VizML

Pros
- learning task : learn to…
  
  predict design choices
  - easier to quantitatively validate
  - easier to provide interpretable measures of feature importance
  - easily integrated into visualization systems
  DeepEye : classify and rank visualizations
  
  Data2Vis : end-to-end generation model
  
  Draco-Learn : soft constraints weights
- data quantity : training corpus is…
  
  orders of magnitude larger than DeepEye and Data2Vis
  - permits the use of large feature sets that capture many aspects of a dataset
  - permits the use of high-capacity models such as deep neural network
- data quality : the datasets used are…
  
  extremely diverse in shape, structure, and distribution
  
  the result of real visual analysis by analysts on their own datasets
  others
  - few datasets used to train
  - visualisations are generated by rule-based systems
  - evaluated number controlled settings

cons
- only recommends visual encodings, not data queries
- do not create ★ an application that employs the visualization model ★

Data

Feature Extraction

distinguish the feature categories to…

organize how to create and interpret features
observe the contribution of different types of features
to capture if less generalizable than other categories

Dimensions D : the number of rows in a column
Types T : whether a column is categorical, temporal, or quantitative
Values V : the statistical and structural properties of the values within a column
Names N : the column name

👉 order in D - T - V - N by how biased the features to be towards the corpus

Method

describe each pair of columns with 30 pairwise-column features
divide into 2 categories : Values and Names

many pairwise-column features depend on the individual column types
create 841dataset-level features by aggregating single- and pairwise-column features using 16 aggregation functions
aggregation function : convert single- and pairwise-column features into scalar values
1. train separate models per number of columns
2. include column features with padding

Design Choice Extraction

extract an analyst’s design choices to…

parse the traces because each visualization consists of traces that associate collections of data with visual elements

Encoding-level Desing Choices

mark type : scatter, etc.
column encoding : which column is represented on which axis, whether or not an X or Y column is the single column represented along that axis

by aggregating these choices…

👇

Visualization-level Design Choices

describe the type shared among all traces
determine whether the visualization has a shared axis

Methods

Feature Processing

apply one-hot encoding to categorical features
set numeric values above 99% or below 1% to respective cut-offs
impute missing categorical values using the mode of non-missing values, and missing numeric values with the mean of non-missing values
remove the mean of numeric fields and scale to unit variance
randomly remove datasets that are exact deduplicates of each other

remove all but one randomly selected dataset per user

👉 remove bias towards more prolific users

Prediction Tasks

Two visualization-level prediction tasks

use dataset-level features

to predict visualization-level design choices

Three encoding-level prediction tasks

use features about individual columns

to predict how they are visually encoded

consider each column indepedently,

instead of alongside other columns in the same dataset

Visualization Type & Mark Type task

2 class task : line vs. bar

3 class task : scatter vs. line vs. var

Neural Network and Baseline Models

fully-connected feedforward neural network
with 3 hidden layers, each consisting of 1000 neurons
with ReLU activation functions
split the data in 60/ 20/ 20 (train/ validation/ test sets) 5 times using 5-fold cross-validation

Features

D
D+T
D+T+V
D+T+V+N = ALL

Visualization Recommender System

Abstract

machine learning-based approach

by…

evaluation

Problem Formulation

Method

Related Work

Rule-based Visualization Recommender Systems

ML-based Visualization Recommender Systems

DeepEye

Data2Vis

Draco-Learn

VizML

Data

Feature Extraction

Method

Design Choice Extraction

Encoding-level Desing Choices

Visualization-level Design Choices

Methods

Feature Processing

Prediction Tasks

Two visualization-level prediction tasks

Three encoding-level prediction tasks

Visualization Type & Mark Type task

Neural Network and Baseline Models

Features