0%

VizML: A Machine Learning Approach to Visualization Recommendation 논문 리뷰

Visualization Recommender System

lower the barrier to exploring basic visualizations

  • by automatically generating results
  • for analysts to search and select, not manually specify

Abstract


Screenshot 2021-05-19 at 14 20 00

machine learning-based approach

  • learns visualization design choices
  • from a large corpus of datasets
  • from associated visualisations

by…

  1. identify five key design choices
  2. using one million dataset-visualisation pairs

evaluation

for : generalizability and uncertainty

by : benchmark with a crowdsourced test set

result : comparable to human performance


Problem Formulation

Data visualization communicates information by representing data with visual elements


representations are specified using…

  • encodings : map from data to the retinal properties of graphical marks

  • retinal properties

    • e.g. position, length, colour
  • graphical marks

    • e.g. point, line, rectangle

e.g.

Screenshot 2021-05-19 at 15 04 20

To create scatterplot showing the relationship between MPG and Hp,

  • encoding each pair of data points of a circle on a 2D plane

  • while specifying other retinal properties such as size and colour


Method

Vega-lite

  • selecting mark type and fields to be encoded along the x- and y-axes

Tableau

  • placing the 2 columns onto the respective column and row shelves



  1. formulate basic visualization of a dataset d as a set of interrelated design choices C = {c}

    set of choices that result in valid visualizations

    === a subset of the space of all possible choices

  2. automatically suggest a subset of design choices C_rec ⊆ C that maximize effectiveness

    effectiveness can be defined by informational measures

    • e.g. efficiency, accuracy, memorability, emotive measures

Rule-based Visualization Recommender Systems


ML-based Visualization Recommender Systems

DeepEye
Data2Vis
Draco-Learn
VizML
  • Pros

    • learning task : learn to…

      predict design choices

      • easier to quantitatively validate
      • easier to provide interpretable measures of feature importance
      • easily integrated into visualization systems

      DeepEye : classify and rank visualizations

      Data2Vis : end-to-end generation model

      Draco-Learn : soft constraints weights

    • data quantity : training corpus is…

      orders of magnitude larger than DeepEye and Data2Vis

      • permits the use of large feature sets that capture many aspects of a dataset
      • permits the use of high-capacity models such as deep neural network
    • data quality : the datasets used are…

      extremely diverse in shape, structure, and distribution

      the result of real visual analysis by analysts on their own datasets

      others

      • few datasets used to train
      • visualisations are generated by rule-based systems
      • evaluated number controlled settings

  • cons
    • only recommends visual encodings, not data queries
    • do not create ★ an application that employs the visualization model ★

Data

Feature Extraction

distinguish the feature categories to…

  1. organize how to create and interpret features
  2. observe the contribution of different types of features
  3. to capture if less generalizable than other categories

Screenshot 2021-05-19 at 15 26 21
  • Dimensions D : the number of rows in a column
  • Types T : whether a column is categorical, temporal, or quantitative
  • Values V : the statistical and structural properties of the values within a column
  • Names N : the column name

👉 order in D - T - V - N by how biased the features to be towards the corpus


Method
  1. describe each pair of columns with 30 pairwise-column features

  2. divide into 2 categories : Values and Names

    many pairwise-column features depend on the individual column types

  3. create 841dataset-level features by aggregating single- and pairwise-column features using 16 aggregation functions

    aggregation function : convert single- and pairwise-column features into scalar values

    1. train separate models per number of columns
    2. include column features with padding

Design Choice Extraction

extract an analyst’s design choices to…

parse the traces because each visualization consists of traces that associate collections of data with visual elements


Encoding-level Desing Choices

  • mark type : scatter, etc.
  • column encoding : which column is represented on which axis, whether or not an X or Y column is the single column represented along that axis

by aggregating these choices…

👇

Visualization-level Design Choices

  1. describe the type shared among all traces

  2. determine whether the visualization has a shared axis


Methods

Feature Processing

  1. apply one-hot encoding to categorical features

  2. set numeric values above 99% or below 1% to respective cut-offs

  3. impute missing categorical values using the mode of non-missing values, and missing numeric values with the mean of non-missing values

  4. remove the mean of numeric fields and scale to unit variance

  5. randomly remove datasets that are exact deduplicates of each other

    remove all but one randomly selected dataset per user

    👉 remove bias towards more prolific users


Prediction Tasks

Two visualization-level prediction tasks

use dataset-level features

Screenshot 2021-05-19 at 15 49 37

to predict visualization-level design choices


Three encoding-level prediction tasks

use features about individual columns

Screenshot 2021-05-19 at 15 49 50

to predict how they are visually encoded

consider each column indepedently,

instead of alongside other columns in the same dataset



Visualization Type & Mark Type task

2 class task : line vs. bar

3 class task : scatter vs. line vs. var


Neural Network and Baseline Models

  • fully-connected feedforward neural network

  • with 3 hidden layers, each consisting of 1000 neurons

  • with ReLU activation functions

  • split the data in 60/ 20/ 20 (train/ validation/ test sets) 5 times using 5-fold cross-validation


Features

  1. D
  2. D+T
  3. D+T+V
  4. D+T+V+N = ALL