Apr.-June 2014 (21, 02) pp. 22-31
1070-986X/14/$31.00 © 2014 IEEE

Published by the IEEE Computer Society
Clustering Faces in Movies Using an Automatically Constructed Social Network
Mei-Chen Yeh, National Taiwan Normal University

Wen-Po Wu, Reallusion Corporation
  Article Contents  
  Proposed Approach  
Download Citation
Download Content
PDFs Require Adobe Acrobat
Clustering faces in movies is a challenging task because faces in a feature-length film are relatively uncontrolled and vary widely in appearance. Such variations make it difficult to appropriately measure the similarity between faces under significantly different settings. In this article, the authors develop a method that improves face-clustering accuracy by incorporating the social context information inherent among characters in a movie. In particular, they study the relation of social network construction and face clustering and present a fusion scheme that eliminates ambiguities and bridges information from two fields. Experiments on real-world data show superior clustering performance compared with state-of-the-art methods. Furthermore, their method can help incrementally build a character's social network that is similar to a manually labeled example.
To enable efficient actor searches and content management, automatic face-clustering techniques can help recognize faces in films and video footage and organize them by character names. However, despite the vast amount of research conducted on this subject, 1 4 the face-clustering task remains highly challenging, especially for feature-length films, which are relatively uncontrolled and vary widely in appearance. Numerous factors other than character identity—such as lighting conditions, facial expressions, poses, and partial occlusions—change the way a face appears. Thus, state-of-the-art face description and modeling methods have had only limited success in real-world testing.
Alternatively, social network techniques have increasingly gained attention in movie content analysis because they provide additional cues to audio-visual features for organizing the growing volumes of available movie data. A social network is a collection of relationships that demonstrate how people are socially connected to one another. 5 , 6 A weighted undirected graph can represent the social context information, where vertices denote the people and edges indicate the social closeness of two individuals. The analyses of social networks have shown some success in discovering hidden structures that cannot be directly perceived by analyzing low-level audio-visual features. 5 , 7 , 8
In this article, we study the relation of social network construction and face clustering in movies and TV. More specifically, we look at how knowledge of the characters' social activities can enhance face clustering and how clustered faces can help estimate the relationships among characters. There are several interesting connections between the two tasks. For example, an effective social network is usually built upon a robust face-clustering result. 5 Similarly, because the communities among characters in a movie are relatively limited compared with those on social networking websites, the social relationships inherent in a movie should benefit from the face-clustering task. 8 We demonstrate through experiments that our proposed framework may eliminate a certain level of ambiguity in each case and bridge information from both fields. For many real-world applications, the connection could produce more valuable information, such as a name and the social partners associated with a detected face, which in turn would result in considerable benefits to media content management.
To automate the analysis process, we need to address two technical problems:
  • How do we construct characters' social networks from a movie?
  • How do we use social networks for face clustering?
A preliminary version of the work 9 focused on constructing characters' social networks, along with a fully automatic approach for solving the problem. We extend that work here by using the social network to facilitate the face-clustering problem and performing an empirical analysis on how the two visual tasks benefit each other. Our empirical study shows that an automatically built social network both provides meaningful information that describes characters' social interactions and helps resolve ambiguities when distinguishing identities are captured with similar shooting conditions. However, the extraction of useful cues from a noisy social graph is not trivial—the discriminative power of social features for face clustering depends on the way we utilize them. We examine various design choices and study how social contexts should be utilized to truly aid in the face-clustering process.
Proposed Approach
To iteratively perform face clustering and social network construction in movies, we propose a framework to connect these tasks (see Figure 1 ). We start with a state-of-the-art appearance-based face recognition approach 10 and describe our implementation and experimental results, which are considered the baseline performance. Then, we present a fully automatic approach for constructing roles' social network from movies 9 and a new method that explores the social contexts inherent in a movie for enhancing the clustering performance on an unconstrained face set. (See the “Related Work in Movie Content Analysis” sidebar for earlier work in this field.)
The framework of bridging face clustering and social network construction. Given grouped faces, a social network that describes characters and their interactions can be automatically built. The social network is then used to extract social features that enhance face clustering.

Figure 1.The framework of bridging face clustering and social network construction. Given grouped faces, a social network that describes characters and their interactions can be automatically built. The social network is then used to extract social features that enhance face clustering.

Associate-Predict Model: The Baseline
The associate-predict model addresses the issue of large intrapersonal variations in face recognition. 10 The model is built on an extra generic identity dataset (machine memory) in which each identity contains multiple images with considerable intrapersonal variations. Unlike conventional recognition approaches that directly compare two faces, each input face is first associated with a similar generic identity, and its new appearance is predicted under the similar setting of the other face. The face matching is finally performed on the predicted new face pairs.
Following a memory construction procedure similar to that described in earlier work, 10 we built the machine memory from the Multi-PIE dataset. 11 This extra generic identity dataset contains 129 subjects, and each identity has five poses and five lighting conditions. To evaluate the effectiveness of this model for clustering faces in movies, we conducted a comparison study with a direct-matching approach. We used the OpenCV library ( http://opencv.org) to detect face regions in the full-length film The Devil Wears Prada. Each face is described by a 59-dimensional local binary pattern (LBP), 12 and we use the chi-squared distance to measure face-to-face proximity. The distances are calculated through an extra generic identity dataset in the associate-predict approach. Finally, we apply the affinity propagation (AP) clustering algorithm 13 to group faces.
Figure 2 shows a direct comparison between the associate-predict approach 10 and the direct-matching method. We set various preference values in AP to control the number of data points selected as exemplars, with low preferences leading to few clusters and vice versa. (We define the two measures we used later on.) Theresulting curves clearly show that the associate-predict method (blue dashed curve) out performs the direct-matching method (black dotted curve), with a performance gain of 2.15 percent in purity and 3.51 percent in normalized mutual information (NMI) on average. Even though we apply the state-of-the-art model, the improvement is limited when using the appearance-based features alone.
A direct comparison of the clustering accuracy between the direct-matching method, the associate-predict model, and our approach in terms of (a) purity and (b) normalized mutual information.

Figure 2.A direct comparison of the clustering accuracy between the direct-matching method, the associate-predict model, and our approach in terms of (a) purity and (b) normalized mutual information.

Constructing a Social Network
Social networks are conventionally represented by an undirected weighted graph $G = (V,{\rm{ }}E,{\rm{ }}W)$$G = (V,{\rm{ }}E,{\rm{ }}W)$, where $V = \{ v_1 ,v_2 ,{\hbox{ }} \ldots ,v_N \}$$V = \{ v_1 ,v_2 ,{\hbox{ }} \ldots ,v_N \}$ denotes the set of characters in a movie, E = { \{e_{ij} \cr |{\hbox{if }}v_i}{ \{e_{ij} \cr |{\hbox{if }}v_i}{{\hbox {and}}v_j {{\hbox {and}}v_j {\hbox {have a relationship}}\}{\hbox {have a relationship}}\}, and the element $w_{ij}$$w_{ij}$ in W represents the social closeness between $v_i$$v_i$ and $v_j$$v_j$. 5 The face clusters derived from the previous step constitute the vertices in V. To compute $w_{ij}$$w_{ij}$, most existing methods utilize coappearance to quantify the characters' interrelationships—the social closeness of two characters is measured by the number of scenes in which both characters are present. 5 , 6 These approaches require precise scene boundaries. The social network is semiautomatically constructed because we need to apply a scene-detection method to obtain initial scene boundaries and perform manual labeling to correct errors introduced by the scene-detection method.
The interrelationships of the characters may be more appropriately represented by the interaction than coappearance 9 for two reasons. First, two characters can interact even if they do not physically appear in the same space at the same time. For example, in the movie You've Got Mail, Joe Fox and Kathleen Kelly unknowingly become friends over email. Second, the fact that two people are present in a same scene does not imply that they interact with each other. For example, the camera may capture anunknown person who coincidentally appears in background. In this article, we use both the coappearance-based and interaction-based cues.
First, we implement a shot-change-detection method to determine shot boundaries. The social closeness $w_{ij}$$w_{ij}$ between two face clusters $v_i$$v_i$ and $v_j$$v_j$ is quantified as follows:$$w_{ij} = \sum {_{p \in v_i } } \sum {_{q \in v_j } } {\bf{1}}\left ({\left| {p.{\hbox{shot}} - q.{\hbox{shot}}} \right| \le 1} \right) \tag {1}$$$$w_{ij} = \sum {_{p \in v_i } } \sum {_{q \in v_j } } {\bf{1}}\left ({\left| {p.{\hbox{shot}} - q.{\hbox{shot}}} \right| \le 1} \right) \tag {1}$$

where p.shot denotes the shot ID of a face region p and 1(·) is an indicator function. That is, the relationship between two characters is quantified by the amount of coappearance and shot alternation between two face clusters. Figure 3 illustrates the social graph for the movie The Devil Wears Prada constructed using the automatic approach.
The character social networks from the movie The Devil Wears Prada. (a) Automatic approach, first iteration; (b) automatic approach, second iteration; and (c) manual approach (ground truth). Nodes that represent the same character are manually combined for better visualization.

Figure 3.The character social networks from the movie . (a) Automatic approach, first iteration; (b) automatic approach, second iteration; and (c) manual approach (ground truth). Nodes that represent the same character are manually combined for better visualization.

This method is not limited to dialog scenes involving two characters, and it can generally deal with interactions that involve more than two people. Compared with coappearance-based approaches, the method is easy to implement and requires only the shot-boundary information to fairly describe the characters' interactions. As we showed in our preliminary work, 9 the approach can be used to generate similar social graphs from either the manually labeled or automatically clustered faces.
Extracting Social Features from the Social Network
The automatically constructed social network is usually noisy. For example, multiple nodes in the social graph may correspond to the same character, and a node may contain faces of more than one character. Thus, we need to extract useful cues from the noisy social network to improve face clustering. We decompose the problem into two steps: selecting an anchor set and computing social context cues given the anchor set.
Anchor Set
To deal with the noise that inevitably exists in an automatically constructed social network, we compute social cues from a subset of the social network. This idea was inspired by the work of Peng Wu and Feng Tang, 8 who apply “significant clusters” to refine and use the social graph.
We first observe that nodes in the social network are not equally and entirely important for revealing identities. For example, if a character $v_i$$v_i$ has a relationship with every other character in the movie, co-occurring with $v_i$$v_i$ provides nearly no information. Intuitively, large clusters are usually selected because they tend to contain the faces of main characters. 8
Besides the selection of top large clusters, we also examine another strategy, selecting the clusters that have large variances in relationship magnitude. More specifically, given the social closeness matrix W, we compute the variance of each row. A cluster with a large variance implies that the corresponding character interacts with only a few particular characters in the movie. We consider those clusters as elements in the “anchor set” because the probabilities of interacting with them would be more skewed and, thus, probably discriminative.
The size of the anchor set is automatically determined as follows: Clusters are sorted by size (or magnitude variance), and the difference of values between adjacent clusters is computed. Suppose the peak of the difference curve occurs at K. We selected the top K clusters with large sizes (or magnitude variances).
Social-Based Proximity
Our main approach of face clustering using a social graph is derived from the following observations:
  • Two connected nodes tend not to be the same person.
  • Two nodes that have a similar connectivity (connection and no connection) to those in the anchor set tend to be the same person because they have consistent social behaviors.
The second observation is particularly valid in movies because characters usually form a limited number of communities in a story. Unlike those in social networking websites such as Facebook, the social context inherent in the characters' social network is not sparse.
We model each face cluster as a node and associate each node with a social feature vector ${\bf{x}}_{v_i } $${\bf{x}}_{v_i } $ = ($w_{ik}$$w_{ik}$), where $w_{ik}$$w_{ik}$ is the social closeness of $v_i$$v_i$ and $v_k$$v_k$, and k indicates the node index of those in the anchor set. That is, ${\bf{x}}_{v_i } $${\bf{x}}_{v_i } $ is a K-dimensional feature vector, assuming we have K nodes in the anchor set. The social-based similarity between two clusters $v_i$$v_i$ and $v_j$$v_j$ is given by$$s\left ({v_i ,v_j } \right) = f\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) \times g\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right), \tag {3}$$$$s\left ({v_i ,v_j } \right) = f\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) \times g\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right), \tag {3}$$

where f(·) measures the similarity of the nonzero social elements and g(·) measures that of the zero elements in two social feature vectors. More specifically, based on the idea from earlier research, 9 , 12 f(·) incorporates the observation that two clusters may correspond to a character if they have similar social connections. The similarity is determined by their cosine value adjusted by the social closeness between $v_i$$v_i$ and $v_j$$v_j$:$$f\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) = \cos \left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) \times \exp \left ({ - {{w_{ij} } \over {\left\| W \right\|_{\max } }}} \right). \tag {4}$$$$f\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) = \cos \left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) \times \exp \left ({ - {{w_{ij} } \over {\left\| W \right\|_{\max } }}} \right). \tag {4}$$

If two people interact (that is, $w_{ij}$$w_{ij}$ > 0), they cannot be the same person even if they have similar social connections to other people. Moreover, we propose that the social behavior of two people is similar if they both have no interaction with certain people. For example,$${\displaylines g\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) = {1 \over K}\left ({\sum\limits_{k = 1}^K {1\left ({{\bf{x}}_{v_i } \left (k \right) = 0,{\bf{x}}_{v_j } \left (k \right) = 0} \right)} } \right)} \times \exp \left ({ - \left ({1 - {{\min \left| {v_i } \right|,\left| {v_j } \right|} \over {\max \left| v \right|}}} \right)} \right). \tag {5}$$$${\displaylines g\left ({{\bf{x}}_{v_i } ,{\bf{x}}_{v_j } } \right) = {1 \over K}\left ({\sum\limits_{k = 1}^K {1\left ({{\bf{x}}_{v_i } \left (k \right) = 0,{\bf{x}}_{v_j } \left (k \right) = 0} \right)} } \right)} \times \exp \left ({ - \left ({1 - {{\min \left| {v_i } \right|,\left| {v_j } \right|} \over {\max \left| v \right|}}} \right)} \right). \tag {5}$$

The first term computes the number of 0-0 matches in ${\bf{x}}_{v_i } $${\bf{x}}_{v_i } $ and ${\bf{x}}_{v_j } $${\bf{x}}_{v_j } $, and the similarity is adjusted by the number of faces in each cluster, denoted by |$v_i$$v_i$|. If a face cluster has just a few faces, it is more likely to introduce 0-0 matches when the cluster is compared with others. Thus, the similarity value should be inversely proportional to the number of faces in the cluster.
Combing Facial and Social Cues
The original appearance-based clustering result may be inaccurate because of large intrapersonal variation. We believe that an improvement could be obtained by considering social cues in the clustering process. We combine the social-based measure (Equation 3) extracted from the social network with the appearance-based measure (the chi-squared distance $\chi ^2 $$\chi ^2 $ of two LBP descriptors) to determine the overall proximity of two faces p and q. The computation is simply multiplying two terms:$$d(p,{\hbox{ }}q) = \chi ^2 (p,{\hbox{ }}q) \times (1 - s(v_i ,{\hbox{ }}v_j )), \tag {6}$$$$d(p,{\hbox{ }}q) = \chi ^2 (p,{\hbox{ }}q) \times (1 - s(v_i ,{\hbox{ }}v_j )), \tag {6}$$

where $p \in v_i$$p \in v_i$ and $q \in v_j$$q \in v_j$. We did not use a linear weighting method because it is unclear how the weights should be determined to fairly reflect the true proximity. Moreover, if one of the measures reflects that two faces are unalike, we tend not to group them into the same cluster.
The refined proximity matrix is finally fed the AP clustering algorithm, and we obtain the second clustering result; based on this, the social network is again constructed. The system generates webpages that visualize the clustering results and the social network for each round. These processes are iteratively performed until the number of clusters reaches the number desired or the results are accepted by users.
Computational Complexity Analysis
Our social network construction approach quantifies the social closeness between two characters based on shot alternation and coappearance. The weight matrix W can be built by linearly examining the cluster IDs of each detected face by the order of the shot sequence. The time complexity is O( N), where N is the number of detected faces in the film. Once we obtain the social network, the computation of the social-based proximity matrix is $O\left ({|V|^2 K} \right)$$O\left ({|V|^2 K} \right)$, where | V| is the number of clusters and K is the dimension of a social feature. The most computationally expensive component in the framework is the AP clustering method, 13 which is used to group faces given a proximity matrix. The AP clustering approach is performed by exchanging messages between data points until a high-quality set of clusters gradually emerges. The approach needs $O\left ({|V|^2 } \right)$$O\left ({|V|^2 } \right)$ for each iteration, but it does not depend on the number of clusters or the dimensionality, and it has been shown to be much faster than k-means for high-dimensional data. 13 If faces are detected, the social network construction and clustering tasks take just a few seconds to process a two-hour film in our experiments.
We evaluated the proposed method on automatically detected face sets from five feature-length films and two sitcoms. The detection was performed using OpenCV on every frame, producing 320,508 face images including false positives. Table 1 shows the dataset statistics, including genre, length, and the number of faces and characters. We manually labeled each face and constructed the ground truth data, which we used to evaluate the clustering performance and the quality of social network.

Table 1.Dataset statistics.

Index Title * Year Genre Length (minutes) No. of faces No. of characters
M1 The Devil Wears Prada 2006 Comedy, drama, romance 102 69,211 15
M2 Little Fockers 2010 Comedy 91 69,077 10
M3 Taken 2008 Action, crime, thriller 87 43,741 5
M4 In Time 2011 Action, sci-fi, Thriller 97 48,604 7
M5 Seven Pounds 2008 Drama 111 37,701 7
S1 The Big Bang Theory (S5-E1) 2011 Comedy 20 17,287 8
S2 Gossip Girls (S5-E23) 2012 Drama, romance 40 34,887 9
The numbers after the TV entries indicate the season and episode numbers.

We used two metrics—purity and NMI—to evaluate the clustering performance. Purity is computed as the ratio of correctly assigned face labels and the total number of faces, where each cluster is assigned to the most frequent label in the cluster:$${\hbox{purity}}\left ({\Omega ,C} \right) = {1 \over N}\sum {_i \max _j } \left| {\omega _i \cap c_j } \right|,\tag \tag {7}$$$${\hbox{purity}}\left ({\Omega ,C} \right) = {1 \over N}\sum {_i \max _j } \left| {\omega _i \cap c_j } \right|,\tag \tag {7}$$

where $\Omega = \left\{ {\omega _1 ,\omega _2 , \ldots, \omega _m } \right\}$$\Omega = \left\{ {\omega _1 ,\omega _2 , \ldots, \omega _m } \right\}$ is the clusters that the approach discovered, $C = \left\{ {c_1 ,c_2 , \ldots ,c_n } \right\}$$C = \left\{ {c_1 ,c_2 , \ldots ,c_n } \right\}$ denotes ground truth clusters, and N is the number of faces. A large purity value indicates that the approach discovered many clusters—a purity value of 1 means each face gets its own cluster.
Alternatively, NMI considers the trade-off of the cluster number and the clustering performance:$${\hbox{NMI}}\left ({\Omega ,C} \right) = {{I\left ({\Omega ;C} \right)} \over {\sqrt {H\left (\Omega \right)H\left (C \right)} }}, \tag {8}$$$${\hbox{NMI}}\left ({\Omega ,C} \right) = {{I\left ({\Omega ;C} \right)} \over {\sqrt {H\left (\Omega \right)H\left (C \right)} }}, \tag {8}$$

where I(Ω; C) is the mutual information between the discovered and ground truth clusters and H(.) is the entropy of a clustering result. NMI(Ω, C) ranges from 0 to 1, where the value 1 means two cluster sets are identical and the value 0 means they are independent.
Effectiveness of Social Cues
We first evaluate the effectiveness of social cues for face clustering. Figure 4 shows a direct comparison of the clustering accuracy of the state-of-the-art appearance-based method—the associate-predict model 10 (blue bars)—and the proposed method (red bars). The proposed method—one iteration only in the experiment—consistently outperforms the associate-predict model for all films we tested. The gain was particularly significant when we used movie data. The number of characters in movies is usually larger than in a sitcom, and the interactions between characters are more diverse. That probably explains why our approach, which explores social relationship, achieves good performance in such cases.
Comparison of the clustering accuracy between the associate-predict model 10 and our approach. We performed one iteration only in terms of (a) purity and (b) normalized mutual information.

Figure 4.Comparison of the clustering accuracy between the associate-predict model and our approach. We performed one iteration only in terms of (a) purity and (b) normalized mutual information.

To further understand the clustering performance with respect to the number of clusters, see Figure 2 . Our approach (red solid curve) significantly outperforms appearance-based approaches across a range of clusters. For example, for the cluster number 28, the proposed method has an improvement gain of 9.5 percent in purity and 26.7 percent in NMI compared with the associate-predict model.
Our approach combines facial similarity and social information to refine the measurement of how two faces are alike. While using the associate-predict model handles intrapersonal variations, the integration of social cues improves the discriminative capability because two visually dissimilar faces of the same character can have a refined distance if their social activities are consistent.
It is worth mentioning that we also applied our social-enhanced approach on a social network built differently from the proposed approach. 6 Although that method 6 builds the social network in a different and more sophisticated manner, the social-context-enhanced clustering results are similar, as the red and the green curves in Figure 2 show. That is, we can achieve the clustering gain no matter which approach we use to build the social network, as long as it fairly captures the character relationships.
Empirical Analysis
Now we examine various design choices in the process of extracting feature from a social graph. We obtained the following experimental results using the movie The Devil Wears Prada.
Anchor Set
Figure 5 shows a comparison of the two different strategies for finding the subset of clusters that we use to identify the characters' interactions: nodes with a large degree (the green dashed curve) and nodes with a large magnitude variance (the blue dotted curve). It is interesting to see that these strategies have little effect on the overall clustering performance. Moreover, the use of the anchor set is effective, and its size should be appropriately set.
Clustering performances for two strategies of selecting an anchor set from a noisy social graph. (a) The purity strategy's performance. (b) Normalized mutual information (NMI) metrics. (c) Clustering accuracy in relation to the anchor set's size for each approach.

Figure 5.Clustering performances for two strategies of selecting an anchor set from a noisy social graph. (a) The purity strategy's performance. (b) Normalized mutual information (NMI) metrics. (c) Clustering accuracy in relation to the anchor set's size for each approach.

Figure 5 c shows how the anchor set size affects the clustering performance. The clustering accuracy decreases when many nodes (more than 40 percent) are selected in the anchor set. For example, the performance gain is 12.44 percent in purity and 25.93 percent in NMI when we use only 10 percent of the nodes over the case when we use the entire social graph. The automatic selection approach usually selects a few nodes in the anchor set—that is, K is approximately equal to one-eighth of the nodes in the testing cases. Using the anchor set not only improves the effectiveness of social features but reduces the feature dimension and the computational time for the similarity calculation in the next step.
Social Proximity
Next, we examine the effectiveness of related and unrelated social information when computing the similarity of two clusters. Intuitively, similar social behaviors mean two people have interactions with some people (related information, described in Equation 4) or neither of them interacts with certain people (unrelated information, described in Equation 5).
Figure 6 validates the usefulness of both cues. The green dashed curve is derived when only related information is used, and the blue dotted curve is obtained by using only unrelated information. The combination achieves the best performance, providing an additional improvement over using one cue alone. The result shows that the similarity measure should consider more than just the adjacent nodes because the other nodes provide complementary information necessary to describe the linking status.
Clustering performances with respect to related and unrelated information in terms of (a) purity and (b) normalized mutual information.

Figure 6.Clustering performances with respect to related and unrelated information in terms of (a) purity and (b) normalized mutual information.

Iterative Process
The automatic framework illustrated in Figure 1 bridges two visual tasks and generates two results—face clusters and social graphs. We now demonstrate that these tasks can benefit each other. We first examine the characters' social networks automatically built by the approach.
Figure 3 a shows the social network derived by the proposed method, which is built on the AP clustering result using the appearance-based features alone. Figure 3 b shows the social network obtained from the second-round AP clustering. For comparison, Figure 3 c presents the social network established on manually labeled faces. Nodes that represent the same character are combined for better visualization. It is interesting to observe that the learned social network evolves from a noisy graph into the ground truth, which implies that the automatic approach can produce fairly meaningful information that describes the characters' social relationships from an initially noisy face-clustering result. The social network derived from the second round AP has had a similar structure to that of the ground truth graph.
A social graph's fidelity has an effect on face clustering as well. Table 2 shows the clustering performance of four rounds in the iterative process. The experiment was conducted using a fixed preference value in the AP clustering algorithm. Because values in the distance matrix (Equation 6) become smaller, the number of discovered clusters decreases when more iterations are performed. Although both the purity rate and the number of discovered clusters decrease, we observe the clustering performance improves in terms of NMI. However, the gain is increasingly limited after the first round. That is, the use of social cues provides significant gain in measuring the face similarities in the beginning, but the refined social graphs may not provide much information in resolving identity ambiguities. We conclude that a practical strategy to improve clustering is to use the appearance-based approach to obtain a large number and relatively pure clusters and then apply a social-context enhanced approach to refine the clusters.

Table 2. Clustering performance in each iteration.

Iteration 0 1 2 3 4
No. of clusters 134 107 89 78 60
Purity 0.8954 0.8725 0.8586 0.8255 0.8164
NMI 0.4881 0.5111 0.5139 0.5226 0.5254

As a fully automated system, our experiment shows that the proposed method can be used to obtain a social network nearly as good as ground truth. To extend this work, we have been developing a practical movie content management system that can be used to label faces and generate relationship charts of characters, benefiting from the proposed techniques and the empirical results.
1. O. Arandjelovic and A. Zisserman, “Automatic Face Recognition for Film Character Retrieval in Feature-Length Films,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition (CVPR), 2005, pp. 860–867.
2. R.G. Cinbis, J. Verbeek, and C. Schmid, “Unsupervised Metric Learning for Face Identification in TV Video,” Proc. IEEE Int'l Conf. Computer Vision (ICCV), 2011, pp. 1559–1566.
3. J. Sivic, M. Everingham, and A. Zisserman, “Who Are You? Learning Person Specific Classifiers from Video,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1145–1152.
4. W. Zhaoetal., “Face Recognition: A Literature Survey,” ACM Computing Surveys, vol. 35, no. 4, 2003, pp. 399–458.
5. C.-Y. Weng, W.-T. Chu, and J.-L. Wu, “RoleNet: Movie Analysis from the Perspective of Social Network,” IEEE Trans. Multimedia, vol. 11, no. 2, 2009, pp. 256–271.
6. P. Wu and D. Tretter, “Close & Closer: Social Cluster and Closeness from Photo Collections,” Proc. ACM Int'l Conf. Multimedia, 2009, pp. 709–712.
7. Z. Stone, T. Zickler, and T. Darrell, “Autotagging Facebook: Social Network Context Improves Photo Annotation,” Proc. IEEE Int'l Workshop Internet Vision, Computer Vision and Pattern Recognition Workshops, 2008, pp. 1–8.
8. P. Wu and F. Tang, “Improving Face Clustering Using Social Context,” Proc. ACM Int'l Conf. Multimedia, 2010, pp. 907–910.
9. M. Yeh, M.-C. Tseng, and W.-P. Wu, “Automatic Social Network Construction from Movies Using Film-Editing Cues,” Proc. IEEE Int'l Conf. Multimedia and Expo Workshops (ICMEW), 2012, pp. 242–247.
10. Q. Yin, X. Tang, and J. Sun, “An Associate-Predict Model for Face Recognition,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition (CVPR), 2011, pp. 497–504.
11. R. Grossetal., “Multi-PIE,” Proc. IEEE Int'l Conf. Automatic Face and Gesture Recognition, 2008; http://research.microsoft.com/pubs/69512/multipie-fg-08.pdf.
12. T. Ahonen, A. Hadid, and M. Pietikäinen, “Face Description with Local Binary Pattern: Application to Face Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 12, 2006, pp. 2037–2041.
13. B.J. Frey and D. Dueck, “Clustering by Passing Messages Between Data Points,” Science, vol. 315, 2007, pp. 972–976.
Mei-Chen Yeh is an assistant professor in the Computer Science and Information Engineering Department at National Taiwan Normal University. Her research interests include multimedia computing, visual retrieval, and pattern recognition. Yeh has a PhD in electrical and computer engineering from the University of California, Santa Barbara. She is a member of IEEE. Contact her at myeh@csie.ntnu.edu.tw.
Wen-Po Wu is a research and development software engineer at Reallusion. His research interests include multimedia computing, animation, and computer graphics. Wu has an MS in computer science and information engineering from National Taiwan Normal University. Contact him at robertWu@reallusion.com.tw.
[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment:

Computing Now Blogs
Business Intelligence
by Ray Major
Cloud Computing
A Cloud Blog: by Irena Bojanova
Enterprise Solutions
Enterprise Thinking: by Josh Greenbaum
Healthcare Technologies
The Doctor Is In: Dr. Keith W. Vrbicky
Hot Topics
NealNotes: by Neal Leavitt
Industry Trends
Mobile Computing
Shay Going Mobile: by Shay Shmeltzer
NGN-Insights: by Martin Nuss and Uday Mudoi
No Batteries Required: by Ray Kahn
Software Technologies: by Christof Ebert