华盛顿⼤学公开课 Okay, so刀耕火种的意思
I want to spend a little time on
the term, "Big Data" and I'm not too
concerned with any sort of technical
definition of it, of the term.
Becau it probably doesn't exist but I
want to arm you with some of the language
that people u when they describe Big
Data.
So you know, you can speak intelligently
about it when, when asked.
Okay?
So, the, probably the main thing to
recognize is this notion of the three V's
of Big Data which are volume, velocity,
and variety.
And we talked a little bit about this in
a previous gment.
So just to repeat, you know, volume is
the size of the data.
And you measure in bytes or number of
rows or number of objects or what have
you, sort of the vertical dimension of
the data.
What I'll say here is the latency of the
data processing relative to the demand
for interactivity, and that's maybe a
mouthful.
But, what I mean by that is you know, how
fast is it coming bad on how fast it
needs to be consumed.
And so, there's a lot of applications for
which interactive respon time are
increasingly important, if not directly
important.
Okay?
And so, when this becomes the bottle
neck, when this becomes the challenge,
then this will also start to become
pretty relevant.
And the one that I think is really pretty
interesting to me and is near and dear to
my heart and with my rearch is notion
of variety.
And so here the probl孤独的说说
em is, you know, an
increasing number of different data
sources are being applied for any
particular task.
So you need to pull out, you know, ASI
files as well as download data from the
web as well as pull data out of some
databa.
As well as you u some of tho quel
system, and so on.
And the integration of all this data
sources, is, a, pretty significant
problem, and can end up occupying, a lot
of your time.
So I made this point, a couple of
gments ago, about rearches who spend
near 90% of their time, quote, handling data.
This is where a lot of that time is
going.
There's a notion of variety.
Okay.
So all three of the are relevant in performing sort of data science tasks. Alright, let me give you another notion
and I'm going to go back to u science examples.
And you've en some of the before, but
if you sort of make a plot of number of
bytes on the Y-axis versus number of data sources.
Maybe columns of data in single table or columns of data across multiple tables or
a number of distinct data sources on the
X axis.
You can sort of, map out different fields
of study or different problems and sort
of, e where they lie.
And so typically Astronomy has been the challenge by the sheer volume of data.
So that's right about here we're high on
the Y axis but you know, but the number
of actual sources in astronomy is not too high.
There's telescope there's the spectral imagers and then there's the simulation
of the, of the, of the galaxy and so
that's relatively few.
In say the ocean sciences
and certainly
in life sciences although I only show you know, one example here.
The variety is really more of a challenge, the actual shear scale is not
as high as the, you know, hundreds
of[UNKNOWN] bytes that could be generated by the, the telescopic projector like
a large[UNKNOWN] telescope.
but the number of different types of instruments you can, u to acquire data
is large and ever-growi程序员创业
ng, right?
So you have the glider systems that
will go out for months at a time and kind
of porpoi through the water.
you have autonomous underwater vehicles that are more for short term missions.
you know there, there's oceanographic cruis where they deploy the conductivity temperature and depth instrumensts that can take profiles of
the water, right.
So this is you know, at a fixed XY and a varying Z and a variying T at a varying depth at a varying time while the gliders
are sort of varying in all four dimensions.
You have the simulations that are
probably one of the largest sources of information, right.
So the can be[UNKNOWN] scales, sort of. At the order of the entire northern hemisphere, or whole eastern pacific, or there could be models, of a particular bay.
Or inlet or estuary connected to a river
or connected to the open ocean or a much smaller scale thing.
So, there's a lot of diversity there.
And I say stations to mean the sort of fixed stations where there's a particular n that are deployed to one location and just measuring across time.
ADCP is a Acoustic Doppler something profilor, where they're still using sound waves to record the time that the sound waves take to bounce off a particular matter in the ocean.
And they can therefore measure velocity. And so this gives you an entire profile
of the velocities in the ocean.
And you can mount the on the a floor pointing upwards, or you can mount them on the bottom of a boat pointing downwards, and so on.
And then there's satellite images that measure sort of a color and wave breaking as well.
Okay.
So fine.
So just a little more on the term Big Data.
a, a quote.
The notion that Mike Franklin at the University of Berkeley us, which I like
is that you know, "Big Data is really relative rigtht, it's any data that's expensive to manage and hard to extract value from".
So it's not so much about a particular
cut off, you know, what makes it big, is
it petabyte scale is big versus terabyte scale is small or gigabyte scale is very small since it fits in memory on your machine.
You know, not necessarily.
It depends on what you're trying to do
with it and it depends on what sort of resources and infrastructure you have to bring to bear on the problem.
And so in some n, difficult data is perhaps what big data really means.
It's not so much really big, it's about
being challenging, okay.
This is really important to remember,
that big is relative.
So let me give you a little bit of the
hist
ory of the term big data.
There's the earliest notion I could find was 1月19日是什么星座
from Erik Larson in 1989 where he says, you know, the keepers of, from Harper's magazine that eventually went into a book.
The keepers of Big Data say they do it for consumer's benefit but data have a way of being ud for purpos other than originally intended.
So his point was not really about technology at all.
It was just a notion that data is being collected for one purpo and being reud for another.
Which is a theme that I mentioned in the very first gment in this cour and
we'll come back to over and over again. And so I think he had it right that n so his real point was
that, you know about consumer private data starting to be comm吉他右手
oditized.
Which was absolutely, true and, and fairly prescient at the time since it's become a big issue now.
But, you know, and it's been especially impressive that, you know, given that this predates the ri of the internet.
and already sort of foreshadows very topical issues in big data, this ethics
and privacy and nsitivity and so forth that we'll talk a little bit about.
but this isn't quite what we mean by big data nowadays typically becau it didn't have that technology aspect to it.
It didn't talk about the challenge of actually managing this, the data ts alright.
So another point of reference is that more reasonable reports from the consulting firms get credit for this
notion of 3D's ,its got really the
original source.
This was a report from governor is 2001 written by a guy named Doug Laney. And so we talked足球节目
about volume, velocity and variety which we've said but let me just give you a chance to look at the quotes.
You know and so in volume it's he's really talking about, sort of business to business.
If you think about 2001, this is around
the dot com boom.
And so everyone was trying to figure out what this new era of technology was going to get, what the internet was
really going to give to them beyond. Just, sort of putting up a webpage and rving it out to your customers. What, how are you going to be able to interact with your supply chain or your your vendors and so on.
Okay, and so that's what it means by this notion of e-channels.
But you know, up to ten x, the quantity
of data about an individual transaction may be collected.
You know, absolutely true that this data exhausts, this point we've made a couple of times, is giving ri to a larger
scale of data being collected.
You know, when velocity well has incread the point of interaction speed, right.
So this is that need for interactivity.
then but didn't ud to be so required, but as the velocity of all business and
all transactions sort of incread.
So do the constraints on the infrastructure ud to process it.
And so on variety, I like this one a lot. Through 2003, 2004, right, so he's been sort of fairly conrvative about how far out he wanted to predict.
No greater barrier to effective data management will exist and the variety of incompatible data formats, non-aligned data structures and inconsistent data mantics.
so this is great, this is, this is, you know, you could have said this for the through 2015 and been arguably correct, this problem is not gone away.
So, another point in the history of this term Big Data there was a ries of talks, a lot of work by John Mashey, who was formerly the chief scientist at the SGI, who would talk about Big Data being the next wave of infrastress.
And so what he meant by infrastress was what's really going to drive the technology forward.
Where we're going to feel the pain.
And his point was that the IO interfaces, was where it was tough.
So in particular disk, disk capacities were growing incredibly fast and still
are and the latencies are not keeping pace, right.
So you can go down to, a local store and buy 3-tier BI drive for probably $200.
But the rate at which you can pull data off that is esntially the same as it
has been for many, many year.
And so now it takes you hours to actually read every byte of data on that disk that you're, that you stored.
And so, this is a problem becau the actual analysis you can do of all the data we can, you know we can keep it. And that's really cheap, but cannot do anything with it becau the, the pipe is so small.