The first step in the data lifecycle is data
collection.
We collect data about our world in a two-step
process:
First, we observe a phenomenon that exists
in the natural world.
This includes sensing the various qualities
of the things we’re observing and measuring
their quantities as well.
Next, we record this observation using a symbolic
representation.
In data science, this typically involves encoding
the observation in a computer as a binary
representation.
It’s important to note that data do not exist
until there has been both an observation and
a recording of the observation.
Data are created as the result of something
being observed and recorded as a signal or
set of symbols.
Prior to a recording of an observation, there
is no data, just the phenomenon that exists
in the world.
There are several ways we can observe our
world to collect data:
We can use sensors to record measurements
of observable phenomena.
For example, we can record observations of
the ambient air temperature using a digital
thermometer.
We can enter data into a transactional system,
to record business transactions.
For example, we can create records for new
customers, record sales transactions, and
create medical records.
We can also record human interactions with
computer systems.
For example, we can record website visits,
advertisement clicks, and time spent browsing
a webpage.
And we can run experiments in order to generate
new data in controlled environments.
For example, we can run clinical studies to
determine the effectiveness of certain medications.
High quality data begins with data collection,
so it’s important to know how to properly
observe and record data.