Data Profiling and Data Cleansing (WS 2014/15) - tele-TASK
High quality e-learning content created with tele-TASK - more than video! Powered by Hasso Plattner Institute (HPI)
Podcaster
Episoden
Über diesen Podcast
Data profiling is the set of activities and processes to determine
the metadata about a given dataset. Profiling data is an important
and frequent activity of any IT professional and researcher. It
encompasses a vast array of methods to examine data sets and
produce metadata. Among the simpler results are statistics, such as
the number of null values and distinct values in a column, its data
type, or the most frequent patterns of its data values. Metadata
that are more difficult to compute usually involve multiple
columns, such as inclusion dependencies or functional dependencies
between columns. More advanced techniques detect approximate
properties or conditional properties of the data set at hand. The
first part of the lecture examines efficient detection methods for
these properties. Data profiling is relevant as a preparatory step
to many use cases, such as query optimization, data mining, data
integration, and data cleansing. Many of the insights gained during
data profiling point to deficiencies of the data. Profiling reveals
data errors, such as inconsistent formatting within a column,
missing values, or outliers. Profiling results can also be used to
measure and monitor the general quality of a dataset, for instance
by determining the number of records that do not conform to
previously established constraints. The second part of the lecture
examines various methods and algorithms to improve the quality of
data, with an emphasis on the many existing duplicate detection
approaches.
Kommentare (0)