首页 > 美文阅读

华盛顿大学公开课

更新时间:2023-05-05 05:21:29 阅读：评论：0

华盛顿⼤学公开课 Okay, so刀耕火种的意思 I want to spend a little time on

the term, "Big Data" and I'm not too

concerned with any sort of technical

definition of it, of the term.

Becau it probably doesn't exist but I

want to arm you with some of the language

that people u when they describe Big

Data.

So you know, you can speak intelligently

about it when, when asked.

Okay?

So, the, probably the main thing to

recognize is this notion of the three V's

of Big Data which are volume, velocity,

and variety.

And we talked a little bit about this in

a previous gment.

So just to repeat, you know, volume is

the size of the data.

And you measure in bytes or number of

rows or number of objects or what have

you, sort of the vertical dimension of

the data.

What I'll say here is the latency of the

data processing relative to the demand

for interactivity, and that's maybe a

mouthful.

But, what I mean by that is you know, how

fast is it coming bad on how fast it

needs to be consumed.

And so, there's a lot of applications for

which interactive respon time are

increasingly important, if not directly

important.

Okay?

And so, when this becomes the bottle

neck, when this becomes the challenge,

then this will also start to become

pretty relevant.

And the one that I think is really pretty

interesting to me and is near and dear to

my heart and with my rearch is notion

of variety.

And so here the probl孤独的说说 em is, you know, an

increasing number of different data

sources are being applied for any

particular task.

So you need to pull out, you know, ASI

files as well as download data from the

web as well as pull data out of some

databa.

As well as you u some of tho quel

system, and so on.

And the integration of all this data

sources, is, a, pretty significant

problem, and can end up occupying, a lot

of your time.

So I made this point, a couple of

gments ago, about rearches who spend

near 90% of their time, quote, handling data.

This is where a lot of that time is

going.

There's a notion of variety.

Okay.

So all three of the are relevant in performing sort of data science tasks. Alright, let me give you another notion

and I'm going to go back to u science examples.

And you've en some of the before, but

if you sort of make a plot of number of

bytes on the Y-axis versus number of data sources.

Maybe columns of data in single table or columns of data across multiple tables or

a number of distinct data sources on the

X axis.

You can sort of, map out different fields

of study or different problems and sort

of, e where they lie.

And so typically Astronomy has been the challenge by the sheer volume of data.

So that's right about here we're high on

the Y axis but you know, but the number

of actual sources in astronomy is not too high.

There's telescope there's the spectral imagers and then there's the simulation

of the, of the, of the galaxy and so

that's relatively few.

In say the ocean sciences

and certainly

in life sciences although I only show you know, one example here.

The variety is really more of a challenge, the actual shear scale is not

as high as the, you know, hundreds

of[UNKNOWN] bytes that could be generated by the, the telescopic projector like

a large[UNKNOWN] telescope.

but the number of different types of instruments you can, u to acquire data

is large and ever-growi程序员创业 ng, right?

So you have the glider systems that

will go out for months at a time and kind

of porpoi through the water.

you have autonomous underwater vehicles that are more for short term missions.

you know there, there's oceanographic cruis where they deploy the conductivity temperature and depth instrumensts that can take profiles of

the water, right.

So this is you know, at a fixed XY and a varying Z and a variying T at a varying depth at a varying time while the gliders

are sort of varying in all four dimensions.

You have the simulations that are

probably one of the largest sources of information, right.

So the can be[UNKNOWN] scales, sort of. At the order of the entire northern hemisphere, or whole eastern pacific, or there could be models, of a particular bay.

Or inlet or estuary connected to a river

or connected to the open ocean or a much smaller scale thing.

So, there's a lot of diversity there.

And I say stations to mean the sort of fixed stations where there's a particular n that are deployed to one location and just measuring across time.

ADCP is a Acoustic Doppler something profilor, where they're still using sound waves to record the time that the sound waves take to bounce off a particular matter in the ocean.

And they can therefore measure velocity. And so this gives you an entire profile

of the velocities in the ocean.

And you can mount the on the a floor pointing upwards, or you can mount them on the bottom of a boat pointing downwards, and so on.

And then there's satellite images that measure sort of a color and wave breaking as well.

Okay.

So fine.

So just a little more on the term Big Data.

a, a quote.

The notion that Mike Franklin at the University of Berkeley us, which I like

is that you know, "Big Data is really relative rigtht, it's any data that's expensive to manage and hard to extract value from".

So it's not so much about a particular

cut off, you know, what makes it big, is

it petabyte scale is big versus terabyte scale is small or gigabyte scale is very small since it fits in memory on your machine.

You know, not necessarily.

It depends on what you're trying to do

with it and it depends on what sort of resources and infrastructure you have to bring to bear on the problem.

And so in some n, difficult data is perhaps what big data really means.

It's not so much really big, it's about

being challenging, okay.

This is really important to remember,

that big is relative.

So let me give you a little bit of the

hist

ory of the term big data.

There's the earliest notion I could find was 1月19日是什么星座 from Erik Larson in 1989 where he says, you know, the keepers of, from Harper's magazine that eventually went into a book.

The keepers of Big Data say they do it for consumer's benefit but data have a way of being ud for purpos other than originally intended.

So his point was not really about technology at all.

It was just a notion that data is being collected for one purpo and being reud for another.

Which is a theme that I mentioned in the very first gment in this cour and

we'll come back to over and over again. And so I think he had it right that n so his real point was

that, you know about consumer private data starting to be comm吉他右手 oditized.

Which was absolutely, true and, and fairly prescient at the time since it's become a big issue now.

But, you know, and it's been especially impressive that, you know, given that this predates the ri of the internet.

and already sort of foreshadows very topical issues in big data, this ethics

and privacy and nsitivity and so forth that we'll talk a little bit about.

but this isn't quite what we mean by big data nowadays typically becau it didn't have that technology aspect to it.

It didn't talk about the challenge of actually managing this, the data ts alright.

So another point of reference is that more reasonable reports from the consulting firms get credit for this

notion of 3D's ,its got really the

original source.

This was a report from governor is 2001 written by a guy named Doug Laney. And so we talked足球节目 about volume, velocity and variety which we've said but let me just give you a chance to look at the quotes.

You know and so in volume it's he's really talking about, sort of business to business.

If you think about 2001, this is around

the dot com boom.

And so everyone was trying to figure out what this new era of technology was going to get, what the internet was

really going to give to them beyond. Just, sort of putting up a webpage and rving it out to your customers. What, how are you going to be able to interact with your supply chain or your your vendors and so on.

Okay, and so that's what it means by this notion of e-channels.

But you know, up to ten x, the quantity

of data about an individual transaction may be collected.

You know, absolutely true that this data exhausts, this point we've made a couple of times, is giving ri to a larger

scale of data being collected.

You know, when velocity well has incread the point of interaction speed, right.

So this is that need for interactivity.

then but didn't ud to be so required, but as the velocity of all business and

all transactions sort of incread.

So do the constraints on the infrastructure ud to process it.

And so on variety, I like this one a lot. Through 2003, 2004, right, so he's been sort of fairly conrvative about how far out he wanted to predict.

No greater barrier to effective data management will exist and the variety of incompatible data formats, non-aligned data structures and inconsistent data mantics.

so this is great, this is, this is, you know, you could have said this for the through 2015 and been arguably correct, this problem is not gone away.

So, another point in the history of this term Big Data there was a ries of talks, a lot of work by John Mashey, who was formerly the chief scientist at the SGI, who would talk about Big Data being the next wave of infrastress.

And so what he meant by infrastress was what's really going to drive the technology forward.

Where we're going to feel the pain.

And his point was that the IO interfaces, was where it was tough.

So in particular disk, disk capacities were growing incredibly fast and still

are and the latencies are not keeping pace, right.

So you can go down to, a local store and buy 3-tier BI drive for probably $200.

But the rate at which you can pull data off that is esntially the same as it

has been for many, many year.

And so now it takes you hours to actually read every byte of data on that disk that you're, that you stored.

And so, this is a problem becau the actual analysis you can do of all the data we can, you know we can keep it. And that's really cheap, but cannot do anything with it becau the, the pipe is so small.

本文发布于:2023-05-05 05:21:29，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/529544.html

上一篇：全款购房合同有几份(五篇)

下一篇：2023年幼儿园毕业典礼致辞学生幼儿园毕业典礼致辞老师(十四篇)

标签：华盛顿公开课

留言与评论（共有 0 条评论）