Multimodal Segmentation of Lifelog Data
Aiden R. Doherty1, Alan F. Smeaton1, Keansub Lee2 & Daniel P.W. Ellis2
Centre for Digital Video Proessing & Adaptive Information Cluster, Dublin City University, Ireland1
LabROSA, Columbia University, New York, USA2
{adoherty, asmeaton}@computing.dcu.ie, {dpwe, kslee}@ee.columbia.edu
Abstract
A personal lifelog of visual and audio information can be very helpful as a human memory augmentation tool. The SenCam, a passive wearable camera, ud in conjunction with an iRiver MP3 audio recorder, will capture over 20,000 images and 100 hours of audio per week. If ud constantly, very soon this would build up to a substantial collection of personal data. To gain real value from this collection it is important to automatically gment the data into meaningful units or activities. This paper investigates the optimal combination of data sources to gment personal data into such activities. 5 data sources were logged and procesd to gment a collection of personal data, namely: image processing on captured SenCam images; audio processing on captured iRiver audio data; an
d processing of the temperature, white light level, and accelerometer nsors onboard the SenCam device. The results indicate that a combination of the image, light and accelerometer nsor data gments our collection of personal data better than a combination of all 5 data sources. The accelerometer nsor is good for detecting when the ur moves to a new location, while the image and light nsors are good for detecting changes in wearer activity within the same location, as well as detecting when the wearer socially interacts with others.
Introduction
The SenCam, developed by Microsoft Rearch Cambridge, is a small wearable device which incorporates a digital camera and multiple nsors including: nsors to detect changes in light levels, an accelerometer to detect motion, a thermometer to detect ambient temperature, and a passive infra red nsor to detect the prence of a person. Sensor data is captured approximately every 2 conds and bad on the readings it is determined when an image should be captured. For example the light nsor will trigger the capture of an image when the wearer is moving between two different rooms as there will be a distinct change in the light level as the wearer moves towards the door, opens it and moves into a new room. An image is also captured when the passive infrared nsor detects the prence of a person arriving in front of the device indicating that the wearer has j
黑板的英语ust met somebody. The accelerometer nsor is uful as a lack of motion indicates an optimal time to take a non-blurred image. If there is no image captured bad on nsor activity over a predetermined time period (50 conds), an image will be automatically captured. All the nsor data is correlated with the captured SenCam images when downloaded to a computer.
The iRiver T10 is an MP3 player with 1GB of flash memory and is powered by a single AA battery. This device also has a built-in microphone and is capable of recording MP3 data sampled at 64 kbps, which means that an entire day’s worth of data can be easily recorded. Figure 1 depicts an individual wearing the SenCam in front of his chest, via a strap around his neck, and an audio recorder clipped on to the right belt strap of his trours.
Hodges et. al. (2006) detail the potential benefits of a personal visual diary such as that generated by a SenCam or audio recorder. In preliminary experiments they have found that the u of a SenCam dramatically aided a subject, suffering from a neurodegenerative dia (limbic encephalitis), to recall events that happened during her day when reviewing that day’s
activities using SenCam images. From personal experience the authors of this paper have also found improved short-term memory recall of activities experienced during days while wearing the SenCam device.悠悠天宇旷
Figure 1 SenCam and iRiver audio recoder
A SenCam captures 3,000 images on an average day creating a sizable collection of images even within a short period of time, e.g. over 20,000 images per week which equates to approximately one million images captured per year. Over a lifetime of wearing this passively capturing camera, an individual could reasonably expect to have an image collection of over 50 million images. Therefore no one individual could ever manually retrieve images of encountered activities from their lives with satisfactory success. This rais the issue of how to reconstruct large personal image collections into manageable gments that can be easily retrieved by urs, and to perform this gmentation automatically.
We foree an interface whereby a ur can view for each day, a number of keyframe images, each reprenting a different activity or event. To determine the activity reprentative images it is there
云南风景
fore imperative to automatically identify the boundaries between different activities, e.g. the ability to identify a boundary between activities such as having breakfast, working in front of a computer, having lunch with work colleagues, travelling on a bus, attending a game of football, etc.
We u 5 different sources of information to gment the SenCam images into distinct activities, namely: low-level image descriptors, audio, temperature, light, and movement data. We will discuss how each of the 5 sources of information are procesd and fud together. We will then investigate what is the minimal and best combination of data sources to carry out reliable activity gmentation bad on the 5 sources.
This paper is organid as follows: The next ction describes current work in this field. Thereafter the techniques ud to detect activity boundaries for each data source are described in detail. We then discuss our experimental procedure followed by an analysis of the results from our experiments. Finally we determine what combination of data sources performs best followed by detailing work to be carried out in the future.
Literature Review
Several rearch groups have recorded personal images or audio, however their devices generally r
equire the ur to wear a laptop carried on a bag around their backs (Tancharoen, Yamasaki & Aizawa, 2006; Lin & Hauptmann, 2006), and in some cas a head mounted camera (Tano et. al., 2006). As McAtamney & Parker (2006) note in their study, both the wearer and the subject talking to them are aware of personal recording devices while holding conversations. Therefore it is desirable to decrea the obviousness of the visual appearance of
a wearable device to encourage more natural interactions with the wearer. The SenCam is small and light and from experience of wearing the device, after a short period of time it becomes virtually unnoticed to the wearer.
Gemmell et. al. (2004) describe the SenCam in detail highlighting its passively capturing nature. They explain that “…The next version of SenCam will include audio capture, and will trigger image capture bad on audio events. Eventually we would like to record audio clips surrounding image capture events…” This motivated us to also record audio with our iRiver MP3 voice recorder. In prior work we note that audio can be a rich form of additional information that can compliment visual sources of information (Ellis & Lee, 2004a).
To our knowledge no groups have captured data for the duration of an entire day. Using the Deja Vie
w Camwear (Reich, Goldberg & Hudek, 2004) Wang et. al. (2006) in their work state that “… One of the authors carried the camwear, and recorded on average of 1 hour of video every day from May to June…” Similarly Lin and Hauptmann (2006) record data for only between 2 and 6 hours on weekdays, while others only capture for small time periods in the day too (Tancharoen, Yamasaki & Aizawa, 2005). For this paper one of the authors captured SenCam image data for over 15 hours per day from morning to evening. This provides a more thorough reprentation of an individual’s whole lifestyle.
One method to review images captured by the SenCam is to u the SenCam Image Viewer (Hodges et. al., 2006). In esnce this contains “…a window in which images are displayed, and a simple VCR-type control which allows an image quence to be played slowly (around 2 images/cond), quickly (around 10 images/cond), re-wound and paud…” However it takes upwards on 2 minutes to quickly play through a day’s worth of SenCam images, which translates to 15 minutes to review all the images from 1 week. We believe a one page visual summary of a day containing images reprenting encountered activities or events, coupled with the ability to arch for events or similar events, us a much more uful way to manage SenCam images. Lin and Hauptmann (2006) clearly state that “…continuous video need to be gmented into manageable uni
ts…” A similar approach is required with respect to a lifelog collection of recorded personal images or video. Wang et. al. (2006) gment their video into 5 minute clips, however activities can vary in length and more intelligent techniques are required. Tancharoen and Aizawa (2004) describe a conversation detection approach in their paper. Our work is heavily focud on investigating what individual and combined data sources yield the richest activity gmentation information.
Tancharoen et. al. (2004) describe the benefits of recording various sources of personal information including: video, audio, location, and physiological. However they do not evaluate various combinations of the sources of data. However Wang et. al. (2006) investigate combining visual and audio sources to improve access to personal data and show the potential gains in using multi-modal techniques in this domain. As mentioned, they only u 2 sources of data, however in this paper we will investigate 5 sources of data and determine the optimal combinations of the sources.
Segmentation of Data into Events
帮妈妈买菜
The aim of automatic event detection is to determine boundaries that signify a transition between different activities of the wearer. For example if the wearer was working in front of his computer and t
hen goes to a meeting, it is desirable to automatically detect the boundary between the gment of images of him working at the computer, and the gment of images of him being at a meeting as shown in Figure 2:
Figure 2 Example of activity boundary
We now discuss the techniques ud on the various sources of data to gment each day into meaningful activities. After discussing techniques for each individual data source, we will describe our method of fusing the data sources.浮躁的意思
Pre-Processing of Raw Data
Images were taken from the SenCam and placed into distinct folders for each day. U was made of the aceToolbox (AceMedia Project, 2006), a content-bad analysis toolkit bad on the MPEG-7 eXperimental Model (XM) (Manjunath, Salembier, & Sikora, 2002), to extract low-level image feature
s for each and every image. The audio files recorded in parallel to the SenCam images did not contain timestamp information and therefore it was necessary to manually note the start time of all audio gments. While this was quite tedious it will cea to be an issue with a more integrated audio functionality on the SenCam (Gemmell et. al., 2004). Segmentation using SenCam Images
To process SenCam images we ud the edge histogram low-level feature, which captures the distribution of edges in an image using the Canny edge detection algorithm (Canny, 1986). Our initial intention was to u scalable colour as the low level feature, but we discovered that the edge histogram provided a better reprentation of event boundaries as it was less nsitive to lighting changes. Initially to arch for event boundaries we carried out a form of video shot boundary detection (Brown et. al., 2000), whereby we compared the similarity between adjacent images using the Manhattan distance metric. If adjacent images are sufficiently dissimilar, bad on a predetermined threshold, it is quite probable that a boundary between events has occurred. However this is not always the ca. As the SenCam is a wearable camera that passively captures images, it naturally captures from the perspective of the wearer. Therefore if one is talking to a friend but momentarily looks in the opposite direction an image may be taken by the SenCam.
However more than likely the wearer will then turn back to their friend and continue talking. If adjacent images only are compared the wearer looking momentarily in the opposite direction may trigger an event boundary, as both images could well be quite distinct in their visual nature. This feature is illustrated in Figure 3:
相应的近义词快来赚钱Figure 3 Illustration of possible fal positive events
After communication with Gaughan & Aime (2006), we address this problem by using an adaptation of Hearst’s Text Tiling algorithm (Hearst & Plaunt, 1993). This effectively involves the comparison of two adjacent blocks of images against each other, to determine how similar they are. In our work we u a block size of 5, then slide forward by 1 image and repeat the similarity calculation. If the two adjacent blocks of 5 are broadly similar then it is quite likely no event boundary has occurred, however if the two blocks are sufficiently dissimilar, bad on a defined threshold after smoothing, it is quite likely that there has been a change in the wearer’s activities. In using this approach the effect of outlier images, like the wearer briefly changing his point of view, is less detrimental to the detection of changes in the wearer’s activities, as illustrated in Figure4 where the hou and tree icons reprent two different events.
Figure 4 Image adaptation of texttiling
Segmentation using Recorded Audio
Features
Unlike speech recognition approaches which aims to distinguish audio events at a fine time scale (10 ms or 25 ms), we ud long time-frame (one-minute) features that provide a more compact reprentation of long-duration recordings. The advantage of this is that properties of the background ambience may be better reprented when short-time transient foreground events are smoothed out over a one-minute window (Ellis & Lee, 2004b). The most uful three features were log-domain mean energy measured on a Bark-scaled frequency axis (designed to match physiological and psychological measurements of the human ear), and the mean and
variance over the frame of a ‘spectral entropy’ measure that provided a little more detail on the structure within each of the 21 broad auditory frequency channels.
Segmentation using BIC
We ud the three features to identify gment boundaries in the data using the Bayesian Informa
tion Criteria (BIC) procedure originally propod for speaker gmentation in broadcast news speech recognition (Chen & Gopalakrishnan, 1998). BIC is a likelihood criterion penalized by model complexity as measured by the number of model parameters. Specifically, the BIC score for a boundary at time t (within an N point window) is:
)log()(#2)|()|()|(log )(211101N M M X L M X L M X L t BIC N t t N ⋅∆−⎟⎟⎠⎞⎜⎜⎝
⎛=+λ
where X 1N reprents the t of feature vectors over time steps 1..N etc., L(X|M)is the likelihood of data t X under model M, and ∆#(M)is the difference in number of parameters between the single model (M 0) for the whole gment and the pair of models, M 1 and M 2, describing the two gments resulting from division. The model M denotes a multivariate Gaussian distribution with mean vector µ and full covariance matrix ∑. λ is a tuning constant, theoretically one, that can be viewed as compensating for ‘inefficient’ u of the extra parameters in the larger model-t. When BIC(t) > 0, we place a gment boundary at time t, and then begin arching again to the right of this boundary and the arch window size N is ret. If no candidate boundary t meets this criterion, the arch window size is incread, and the arch across all possible boundaries t is repeated. This continues until the end of the signal is reached.
In order to u the BIC approach to obtain something approximating a probability of a boundary at each time t, we must first calculate a single score for every time point. We can do this by fixing the windows ud for the BIC calculation, and recording only the BIC score for the comparison of models bad on two equal-sized windows either side of a candidate boundary point versus a single model straddling that point (denoted BIC_fw(t)). We can then view the BIC score as a "corrected" log likelihood ratio for that window of N points, which we can normalize to a per-point ratio by taking the Nth root. Then we can convert this to the probability who odds ratio (p/(1-p)) is equal to that likelihood
)
孔卡翻译/)(_exp(1)/)(_exp()(N t fw BIC N t fw BIC t P +=
Segmentation Using Temperature Readings
We are able to u the temperature nsor onboard the SenCam to detect changes in location as the nsor is nsitive to within one degree Fahrenheit and thus it is possible to detect changes even when moving between rooms within the one building. To achieve this, the variance of nsor values recorded over a predetermined window size is calculated and if this is low then it is quite likel
y that the wearer has stayed in the same environment. However if the degree of variance is quite high it is probable that the wearer has changed environment whether by changing rooms, or perhaps going from outdoors to indoors or vice versa.