The core areas of my research are the analysis of multivariate statistical data and the
foundations of statistics and data analysis. The analysis of multivariate data is obviously a
wide area, as probably the best part of statistics is concerned with multivariate data. Here
are some key topics in which I was, am and will be active.
My diploma thesis was on robust linear regression, investigating various versions of robust
regression estimators theoretically. This led to my first journal publication and
a later collaboration with robustness pioneer Frank Hampel.
Cluster analysis has always been a major part of my research. In my PhD thesis I invented
a new approach for clustering data with different linear regression relationships, fixed point
clustering, that was later also applied to clustering multivariate data with outliers. I devel-
oped asymptotic and robustness theory for fixed point clusters. This got me interested in the use of mixture models for clustering, and let to various
contributions. I investigated theoretical foundations of mixture models, namely identifiability for regression mixtures and robustness theory. Methodological
contributions in mixture modelling are a method to merge Gaussian mixture components, because often what is to be interpreted as “clusters” can be modelled by more
than a single Gaussian, and a method for outlier-robust mixture clustering (with P. Coretto). Work on estimating the number of clusters in such a robust mixture
setting is in preparation.
There are many new cluster analysis methods proposed every year, often with somewhat
dubious justifications, and at some point I decided that I want to focus more on research that
helps the user with the necessary decisions when carrying out a cluster analysis, including
the comparison of various clustering methods and cluster validation. I started to think
more deeply about the requirements of cluster analysis and the definition of the cluster
analysis problem (often not given in the literature), leading to my paper “What are the true
clusters?” and my closing chapter on clustering strategy in
the Handbook of Cluster Analysis, of which I am the first editor and single or co-author of
six chapters. In the area of cluster validation I invented a method for cluster-wise
stability assessment, some new methods for cluster visualisation,
and a bootstrap-based approach to testing the existence of a clustering and the number of
A fascinating and important issue in cluster validation is the estimation of the number of
clusters, on which I have made methodological and conceptual contributions.
Recently I am working on a multicriterion characterisation of cluster validation that can
be adapted to different user’s needs and can also be used to treat the number of clusters
As member of the IFCS Task Force for Cluster Benchmarking I am involved in research
on the systematic comparison of clustering methods and I am currently preparing
a publication on simulation studies to compare clustering methods. Earlier work includes
some sophisticated simulation studies with original elements.
There are various pre-processing steps in cluster analysis that have a potentially strong
impact on the final result, and that are often ignored in the literature. Particularly I have
been interested in and made contributions to the construction of dissimilarity measures.
An intriguing idea by Cinzia Viroli has made me come back to the design of cluster
analysis methods, to which I currently contribute the asymptotic theory. I think that my philosophical background gives me a quite unique understanding of the cluster analysis problem that has
given my work a depth that is otherwise difficult to find in the field, and it also leads to
ideas for innovative original approaches. I am also somewhat frustrated by the often very
superficial way in which new methods are claimed to be “superior” or “solutions to long
standing problems” and am therefore passionate about both theoretical ways to investigate
and compare clustering methods, and high quality benchmarking.
Categorical, mixed type, and ordinal data
Categorical, mixed type and ordinal data have become an are of interest of mine in cluster
analysis, and more recently in a more general fashion. Particularly I
believe that ordinal data is not very well understood, but also standard analysis of mixed
type data can lead to not well investigated issues. I plan to do some future research in this
area, particularly investigating to what extent methods for ordinal data can be expressed as
methods for interval data based on scores, and in what circumstances this is not desirable.
The handling and potential unification through scaling of mixed type data is a major issue
in data preprocessing, see above.
Supervised classification is connected to clustering and another core area in multivariate
analysis. I have collaborated with Cinzia Viroli to provide theory underlining a new classification approach. Earlier work concerned dimension reduction and visualisation for
classification. One potential in this work is the development of classification techniques that are based on having a probability model for one “homogeneous class” whereas
the rest of the data may be heterogeneous and may not follow a simple model. This is a
potential topic for a PhD thesis or my own future research. Some of my applied work also focuses on supervised classification.
I believe that data visualisation plays a very important role in statistics; many formal
methods can lead to artefacts that can only be discovered by appropriate data visualisation.
Also data visualisation is far more informative regarding model assumptions than any single
formal method. I am always interested in new revealing ways of looking at the data and how
to interpret visualisations. Work of mine concerns methods to assess the separation of
a cluster or class from the rest of the data and visualisation of the separation between pairs of
clusters. A number of publications use innovative visual displays of data, particularly when
it comes to simulation results. A current original project
is the use of cluster analysis and various visualisation techniques to explore the instability
of regression model selection with some medical applications. I am also working on
an essay on data visualisation from a philosophical point of view.
Modelling and applications of statistics
I have a wide range of collaborative work with scientists who apply statistics. Sometimes
this is very close to my main research areas such as clustering and classification. However, there is also creative work in
other areas such as modeling and parametric bootstrap testing in spatial statistics, information representation and scoring in archaeology
and pattern recognition, dimension reduction and general statistical mod-
elling. Furthermore there is expert advice on statistical techniques such as
random effects and multilevel models and item response theory.
Most of my applied work was prompted by requests of scientists, for which I have
always been open. From time to time in Hamburg and London I was officially responsible
for statistical advisory and I advised more than 100 clients. I will continue to be open for applied collaborations and simply
giving advice, which is always a very rewarding experience. Areas of application are not
restricted; and I have worked in many such areas (musicology, psychology, biogeography,
ecology, medicine, astronomy, genetics, chemistry, archaeology, market research, pattern
recognition and web-design).
Some collaborations are ongoing. In particular I work with my long standing collaborator
Bernhard Hausdorf (director of Hamburg’s Museum for Zoology) on new publications about
the effect of spatial distance on genetic variety and species delimitation, and on the analysis
of phylogenetic trees.
Foundations of statistics and data analysis
Since I can remember, I have a strong interest in the foundations of statistics and data
analysis and the deeper meaning of what we are doing (or, respectively, whether it makes
sense at all). As a student I came into contact with constructivist philosophy, which left
a strong impression on me, and particularly on my thinking regarding the foundations of
statistics. I have continued since then to take part in philosophical reading groups and have
also built contact to philosophers of science (the philosophers Hasok Chang and Deborah
Mayo list me in their acknowledgements for valuable discussions in their recent books).
I have given presentations at philosophical conferences and published two book reviews of books
on foundations of statistics by philosophers. I think that my familiarity with modern philosophy of science puts me in a unique position do contribute to the current
discussions on the foundations of statistics and data analysis. In fact I have developed my
own philosophy of statistics over the years, which I hope to put together in book form in the
near future. Currently only parts of these ideas are published. I particularly highlight two
publications. When thinking about the literature on the foundations of statistics, particularly
the frequentist/Bayes divide, it occurred to me that the general role of mathematical models
is taken for granted in a way that is problematic. I therefore developed a constructivist
philosophy of mathematical modelling, published in “Foundations of Science”.
This plays a strong role in my already mentioned work on the foundations of clustering, but also in the comprehensive treatment (together with A. Gelman) of the (mis-)use of the concepts objectivity and subjectivity in decision making in statistics, including
a discussion of the major streams in the foundations of statistics.
Another thing that is exemplary in that paper is how my philosophical interest (as well
as Gelman’s) has strong implications on the practice and application of statistics. I actually
believe that my philosophical thinking allowed me to become aware of a number of issues
and possibilities in applied statistics (such as the lack of correspondence of Gaussian mixture
components to “interpretative clusters”, principles for the constructions
of dissimilarities, or the analysis of the implications of choosing loss
functions, particularly regarding robustness and outliers).