Race, Religion and the City: Twitter Word Frequency Patterns Reveal Dominant Demographic Dimensions in the United States

نویسندگان

  • Eszter Bokányi
  • Dániel Kondor
  • Laszlo Dobos
  • Tamas Sebok
  • József Stéger
  • István Csabai
  • Gábor Vattay
چکیده

environments are connected to real-world phenomena. Two common data sources are mobile phone networks, where user activity and aggregated measures of network utilization are recorded at the antenna level as part of regular operation [14] and online social networks (OSNs) [15], where the content publicly shared by users in many cases includes their position [16]. Some other data sources with promising application possibilities include monetary transactions [17–19], GPS traces from cars [20, 21] and other devices and public transportation usage as recorded by electronic payment systems [22, 23]. Using these data, previous research has shown that it is possible to obtain accurate and up-to-date measures of population density [9] or crowd size at sports events or in airports [10]. Furthermore, the demographic features of a city or a country can be estimated by parsing OSN user names or user profile descriptions [24, 25]. By focusing on the community structure instead of estimating features of individuals, networks of connections among mobile phone or social network users reveal geographic clustering on large scales [18, 26, 27], Twitter users’ language choice reflects different cultural communities [28], while user activity has been used on urban scales as an innovative method of land use detection [13, 29, 12, 11]. In addition to land use data, commuting and mobility patterns in the city [30, 31] and larger scale travel trends can also be investigated with the help of mobile and OSN networks [32–34]. Apart from looking at the spatio-temporal patterns, analysing the content of users posted in OSNs can provide further insights, adapting text mining methods and results which have been previously developed and obtained on the growing corpus of digital texts [35–39]. From predicting heart-disease rates of an area based on its language use [40], connecting health measures to photo scenicness ratings [41] or relating unemployment to social media content [42, 43] to forecasting stock market moves from search semantics [44], many studies have attempted to connect online media language and metadata to real-world outcomes. Various studies have analyzed spatial variation in the OSN messages’ texts and its applicability to several different questions, including user localization based on the content of their posts [45, 46], empirical analysis of the geographic diffusion of novel words, phrases, trends and topics of interest [47, 48], measuring public mood [49]. In these studies, either a priori models were used, or a model was built with a supervised learning method, with a focus on the specific phenomenon, meaning the exploitation of only one aspect (user name, user profile description, misspelled words, words connected to fatigue etc.), yet possibly neglecting the dataset’s other features. While being effective, there remain the following questions: (a) what are main patterns in the data in general; (b) can they be discovered without making a priori assumptions about what to look for; (c) can we relate these patterns to relevant real social phenomena. In this study our goal is to analyze in an unsupervised manner how and to what extent regional-scale demographic attributes are represented in social media posts. We approach this using geo-tagged short messages (tweets) posted on the Twitter microblogging service as a source of large-scale digital corpus. We employ a combination of Latent Semantic Analysis (LSA) [36] and Robust Principal Component Analysis (RPCA) [50, 51], which permits us the automated identification of the most significant topics and language use features with regional variation on Twitter. We use tweets posted in the USA over a 3-year period aggregated at the county-level. This allows comparison with census data at the same level, thus allowing us to draw some hypotheses about the driving forces behind regional language dissimilarity patterns. 2 Methods 2.1 Twitter dataset We use the datastream freely provided by Twitter through their Application Program Interface (API), which amounts to approximately 1% of all sent messages. In this study, we focus on the part of the datastream with geolocation information. These geolocated tweets originate from users who chose to allow their mobile phones to post the GPS coordinates along with a Twitter message. The total geolocated content was found to only comprise a small percentage of all tweets; therefore with data collection focusing only on these, a large fraction of all geo-tagged tweets can be gained [52]. Our dataset includes a total of 335 million tweets from the contiguous United Stated of America collected between February 2012 and June 2013. These are all geotagged – that is, they have GPS coordinates associated with them. We construct a geographically indexed database of these tweets, permitting the efficient analysis of regional features [53]. Using the Hierarchical Triangular Mesh (HTM) scheme for practical geographic indexing [54, 55], we assigned a US county to each tweet. County borders are obtained from the GAdm database . 1http://gadm.org 2 2.2 Latent Semantic Indexing and Robust Principal Component Analysis We aim to use a type of vector space model on our Twitter corpus, where documents correspond to county-level aggregated tweets. The terms we consider are raw words obtained after a tokenization process, that is, we apply a ’word-bag’ approach to our documents, effectively limiting any analysis to word frequencies and ignoring relations among words and longer phrases. We filter stop-words in several languages (most important being English and Spanish) to remove most common but uninformative terms from our data. We construct a term-document matrix Wij as the as the number of occurrences of the i-th word in the j-th cell. As the population density of the USA is very heterogeneous, the number of word occurrences in each county is also heterogeneous. To improve the quality of the dataset, we only include counties t contain at least 10000 occurrences of at least 500 individual words. We also limit the words used to those with at least 10000 occurrences in at least 1000 individual counties. This way there remain 2800 counties and 10132 words, which form the Wij word occurrence matrix. We normalize Wij so that the elements are the relative frequencies of words in each county: Xij ≡Wij/ ∑ kWkj , i.e. we normalize each element by the total number of words posted in that county; this is called inverse document frequency weighing in text-mining literature. To identify all possible regional characteristics of language usage, we rely on techniques known from the field of natural language processing. There exist many feature or topic extraction methods, all of them aiming to reduce the dimensionality of the data by finding related or similar words and documents. A common approach is Latent Semantic Analysis (LSA) [36, 56], which applies Singular Vector Decomposition (SVD) on a word by document matrix derived from the corpus. This method groups words together based on their semantic similarity [35], creating ’feature’ documents, of which the first few represent the concepts causing the most variation in the data. A notable achievement of LSA is that it is an unsupervised learning method, thus providing information about the corpus without using a priori assumptions or any arbitrary preselections based on the purpose of the examination. According to the nature of our dataset, there are several users who generate automated messages like weather stations, advertisers or tornado and earthquake advisories, which are considered as noise in our investigations. Especially in sparsely inhabited areas, these outlier messages can account for a large fraction of the dataset. Also, highly localized features, such as tourist attractions, can generate outliers of significant volume. This can result in highly localized outliers dominating the results of the SVD, making identifying relevant structure challenging. Applying the Robust PCA method [51, 50] allows us to preprocess the matrix before further analysis by separating it into a low-rank and a sparse part, whose principal components can then be computed and analyzed separately. This means that the original data matrix is written as a sum of two parts: X = X +X , (1) where X is a sparse matrix and X contains the dense but low-rank part of the data. The mathematical condition for finding X and X is minimizing the sum λ‖X‖1 + ‖X‖σ , (2) where for a matrix X of dimensions n1 × n2 with n1 ≥ n2, λ ≡ 1/ √ n1, and the norms are the l1 and nuclear norms respectively: ‖X‖1 = ∑ ij |Xij | ‖X‖σ = ∑ i σi(X) . (3) Here σi(X) denotes the i-th singular value of X. An efficient algorithm for finding X S and X is the inexact augmented Lagrangian method [50] (Matlab code developed by the authors of [50] implementing the algorithm is publicly available ). Employing this method results in the sparse part containing most of the outliers, and and in true language use variations to be represented in the low-rank part. Due to the structure of our data matrix, and the employed Robust PCA method, we choose not to subtract averages from the data; of course, this will probably result in average word frequencies dominating the first principal component. We further analyze only the results of the LSA of the low-rank component. 2http://perception.csl.illinois.edu/matrix-rank/sample_code.html 3 2.3 Demographic data To discover possible governing factors of the geographical language variation patterns and connections between topics and their geography, we correlate right singular vectors with a variety of demographic data series from the 2010 US Census , 2011 American Community Survey (ACS) 5 year estimates concerning educational attainment by counties , county business patterns according to North American Industry Classification System (NAICS) classification 5 and church adherence rates and congregations numbers per county provided by the the Association of Religion Data Archives (ARDA) . 2.4 Boolean relationship detection Apart from evaluating linear correlation measures with the singular vectors, we also carry out a boolean relationship detection, using the methodology of Sahoo et al. [57], which is based on calculating a test statistic based on the contingency table of the scatterplots (e.g. Fig. 3f-j, see the next section for an interpretation of the results displayed) after creating the four segments of the data with a horizontal and a vertical limit. We find the most significantly sparse segment by setting the limits so that the test statistic gives a maximum for the specific segment. During the calculations, we set an error bar on both side of the limits, and points being in this error zone are not taken into consideration when testing for the sparseness. If the contingency table is A low A high Σ B low m00 m01 b0 B high m10 m11 b1 Σ a0 a1 s The test statistic for the four segments is δ = mij − 〈mij〉 √ 〈mij〉 , where 〈mij〉 denotes the expected value in case of independent variables 〈mij〉 = ai s bj s · s. If there are some points left in the segment, they are considered as an error, and the measure of error would be = 1 2 ( mij mi0 −mi1 + mij ai ) . We consider a segment significantly sparse if δ > 3, and < 0.2. Then in the whole range of variables A and B (using 100 steps in both directions and an error boundary of 1,5% for the skipping of points near the borders) we measure δ and values, and take the segmentation with the maximum δ for the sparse areas, where is still low enough. 3 Results Using a corpus of over 335 million geo-tagged tweets posted in the USA, we compile word-frequency distributions for each US county, and then apply the automatic filtering and feature selection method described. We analyze the features found with this technique by considering the connection between geographic and semantic distances (Fig. 1.), and by plotting right singular vectors on the map (Fig. 2a-e.) and displaying left singular vectors as wordclouds (positive weights Fig. 2f-j., negative weights Fig. 2k-o.). First, we find that the method applied successfully uncovers some coherent topics, especially in the first few singular vectors, where singular values are still great enough for the topic to give a significant 3http://www2.census.gov/census_2010/, http://www.census.gov/support/USACdataDownloads.html 4http://www.census.gov/programs-surveys/acs/ 5http://www.census.gov/econ/cbp/ 6http://www.thearda.com 4 variance of the dataset. As we deliberately choose not to subtract averages from the Xij matrix, the first component shows no discernible pattern, and corresponds to the most common words in the sample. From the second singular vector, however, one or both ends (it can be either negative or positive, as singular vectors can arbitrarily be multiplicated by a minus sign) of each of the most important semantic features on the wordclouds can be related to a certain language style, concept or lifestyle. The words giving the largest contribution to the pattern of the second left singular vector (Fig. 2f.) mark a strong presence of slang in the sample. This includes forms with alternate spelling like ’aint’, ’gotta’; swearing like ’ass’, ’hoe’, ’bitch’; abbreviations of common phrases like ’tryna’, ’imma’, ’kno’, ’yall’; OSN-specific slang such as ’oomf’ which stands for ’one of my followers’ (i.e. on Twitter); a very specific misspelling of ’goodmorning’ (instead of ’good morning’); and variations of the racial slurs ’nigga’ and ’niggas’. Swear words and abbreviations typical for online language also dominate this end of the component. The next most important feature, which can be found in the third vector (Fig. 2l.), identifies words connected to urban lifestyle like eating out (’pizza’, ’grill’), drinking coffee (’coffee’, ’cafe’, ’starbucks’), education (’university’, ’library’, ’campus’) or working out (’gym’, ’fitness’). Further dominating concepts are travel (’enjoying’, ’trip’, ’pic’, ’hotel’) in the fourth singular vector (Fig. 2h.) and religion (’lord’, ’prayers’, ’praying’, ’blessed’) alongside with positive content (’glad’, ’thankful’, ’wonderful’, ’proud’) in the negatively weighed words of the fifth singular vector (Fig. 2n.). In this case, the opposite end can also be easily interpreted: the faith-related words in the fifth component are countered by an increased usage of profanity present among words with positive weights (Fig. 2i.). This might be the consequence of people tweeting about religious topics also trying to avoid swearing; this hypothesis can also be supported with less strong swearing alternatives (’crap’, ’freaking’, ’dang’) prevailing among the negatively weighed words along the religious words. If the native language of a group is different from that of the majority, the words of this different language also stand out from the overall structure, as there is naturally a stronger correlation among words belonging to the same language. Therefore the applied method can discover languages different from that of the bulk of the sample. In the sixth singular vector, we can observe this phenomenon with Spanish words, which form more than the third of the positively weighed wordcloud (Fig. 2j.). The English terms ’Mexico’ and ’Mexican’ also appear in this group, which shows that concepts related to the topic are also identified even if they do not belong to the discovered language. Similarly to topic identification, where semantically close words form topics, analyzing regional patterns reveal documents that are close to each other in the semantic space spanned by these topics. Plotting the right singular vectors on a map (Fig. 2a-e.), the most striking feature is the regional proximity of documents having close weights in the singular vectors. Document-by-document (county-by-county) Euclidean distances in the PCA subspace of the first 25 component as the function of real county-by-county centroid distances 7 illustrate this observation. In Fig. 1. mean PCA subspace distances (red dots) are plotted for each 40 km range of real county centroid distances. As a baseline, the same is done for a random permutation of counties (blue dots). It is remarkable that below 500 km, counties are closer in the semantic space, as could be expected from a random realization. From 700 km to 1800 km, semantic distance is greater than it would be randomly. Geographical proximity is thus a main driving force in the similarity of language patterns in Twitter-space. Analyzing these geographical patterns in each singular vectors provides insights into the regional distribution of the single topics. On a US map, the second component (Fig. 2a.), which is responsible for the most variance in the Twitter data, emerges as a block in the Southeastern part of the US. Apart from the big Southeastern block, Chicago and Detroit are also marked by this pattern of language usage. In the third component (Fig. 2b.), negative weights (brown patches) mark the biggest cities and surrounding counties which belong to their agglomeration. The most positive pattern of the fourth component (Fig. 2c.) reveals some important touristic attractions such as the center of New York, Washington and San Francisco, the Craters of the Moon National Monument and Preserve in Idaho, Aspen Mountain ski area in Colorado or Hawaii. The regional pattern of the fifth component (Fig. 2d.) is less obvious, though a part of the central US and the Southeastern block is discernible in the religion-related end of the component. The sixth component distinguishes the Southwestern part and the Northwestern corner of the US (Fig. 2e.), Florida and some bigger cities such as New York or Chicago. To discover possible governing factors of the geographical language variation patterns and their relation to demography, we calculate Pearson correlation values between right singular vectors and data obtained from the US Census Bureau described in Section 2.3. Data series that have the greatest absolute correlation values (p<0.0001, Bonferroni-corrected) with each component are shown in Table 1. The large correlation (0.872) of the second component with the population proportion of African-Americans 7http://cta.ornl.gov/transnet/SkimTree.htm 5 per county indicates that the observed slang words and the blockwise regional pattern are linked to the presence of this demographic group (note that, however, we have no evidence of whether the tweets causing the variation were indeed posted by African-American people). Fig. 3a. shows the census proportions on a US map, with the regional pattern approximately corresponding to that of the singular vector. It is with noting that apart from the large Southeastern block, Chicago and Detroit are also marked by having the characteristic slang word pattern, as well as a higher proportion of African-American population. A similarly large correlation (0.500) with ethnicity (Hispanic or Latino origin) also arises is the case of the sixth component, as expected from the observed Spanish words and the Southwestern positive weights on the map. Fig. 3. shows the percent of people with Hispanic or Latino origin in US counties, the distribution resembling that of the right singular vector. The data series that show the largest correlation with the third component are resident total population rank (0.844) and rural-urban continuum code 8 (0.630). Since neither are continuous variables, we instead show population density values in each county on the map of Fig. 3b. Densely populated areas mark the biggest cities and their surrounding agglomerations of the US, and these areas are also discernible in the brown patches of the third singular vector in Fig. 2b. It confirms the idea of the most densely populated areas giving the negative end of the third singular vector in both the words and their regional distribution. A basic feature of the Twitter corpus is thus linked simply to city lifestyle, more generally to the associated socioeconomic status. Correlation values show whether there exists some relation between the language patterns and demographic data (see Table 1). Analyzing scatterplots of the greatest correlations provides us some insight into the structure of these relations. Plotting the regional weights of the second and sixth singular vector against African-American and Hispanic or Latino ethnicity percentages exhibits very similar features (Fig. 3f,3j). Correlation analysis also revealed that a prevalence of evangelical religious groups (Baptists and Methodists) is related to (-0.372) the religious content of the fifth component (see Fig. 3i); countylevel rates of adherence of evangelical churches are plotted in Fig. 3d. The existence of a virtual ’Bible Belt’ is thus confirmed in the Twitter-space, corresponding to former identification of religious groups in cyberspaces [58, 59]. An opposite correlation is present with Catholic and Orthodox churches, which we speculate to be the consequence of these having a smaller attendance in counties where evangelical churches are more prominent. Although almost all of the above-described correlations could be explained by an underlying function, a boolean implication model description seems more plausible. Boolean implications have already been used in gene expression research [57], to uncover non-symmetric relationships where correlation analysis would only partially or not at all measure connection between two variables. In the case of ethnicities, if we take y values as a measure of how strongly slang (Fig. 3f) or Spanish (Fig. 3j) (see the wordclouds of Fig. 2f and Fig. 2j) is present in the Twitter messages of the counties, we can observe that below a certain ratio of ethnicity prevalence (6.0% in the second component and 7.6% in the sixth), language patterns show different levels of non-slang or non-Spanish usage. If ethnicity prevalence is greater than the threshold value, slang or Spanish usage rises steeply with growing ethnicity proportion. Above the threshold, there are very few counties with non-slang or non-Spanish language patterns. In this terminology, the two scatterplots corresponding to ethnicity prevalence can be translated to ’high ethnicity rates ⇒ missing non-slang/non-Spanish’ words implication. The limits corresponding to the best implication model were the mentioned 5.99%±1.28% and 7.65%±1.43% of prevalence for the two ethnic groups, with −0.00328± 0.00123 and −0, 00277±0.00209 as a limit on the axes of the second and sixth component. The measures of sparseness for the lower right segments were δ = 21.941, = 0.036, and δ = 12.98, = 0.15, respectively. A boolean implication also describes the scatterplot of the fifth component measured against evangelical adherence rates (Fig. 3i). Here the y axis represents a level of swearing (cf. the words in Fig. 2i) present in the Twitter-sphere of tweets posted in a county. Thus the implication can be translated to ’high evangelical prevalence rates ⇒ low swearing level’. The pattern implies a stronger connection between the two variables, as could be inferred from the symmetric correlation measure. It seems as if above a certain adherence rate, a text with a high swearing level could not propagate further or could not find way to broader discussion. Here the automatically detected limit was at a 19.67% ± 1.55% adherence level, the limit on the ’swearing’ axis lies at 0.02230±0.00177, and the lower right corner showed a significantly large sparseness with δ = 8.579 and = 0.013. 8http://www.ers.usda.gov/data-products/rural-urban-continuum-codes.aspx 6 4 Discussion We can conclude that the applied unsupervised learning method successfully discovers topics and their regional patterns in the Twitter-sphere, with county weights in right singular vectors representing a distance in the semantic space along a topic given by word weights of the left singular vectors. It is also remarkable that geographical closeness implies closeness in the semantic space, which suggests that language usage is on a certain level bound to geographical proximity. We also find that regional patterns in language use are driven not just by geographical proximity, but socioeconomical and cultural similarities, like degree of urbanization, religion or ethnicity. It seems that the most important factor behind the variation in the language use of different counties is the presence of Afro-American ethnicity, as confirmed by the significant correlation between the census-based share of Afro-American population and the appropriate county weights. Corresponding word weights mirror this observation with words representative of the typical slang use associated with this ethnicity. This type of slang use thus turns out to be the most distinguishing factor in everyday US Twitter conversation. Following ethnicity, the second most important feature found in Twitter language is related to the population density of a county. The interpretation could be that beyond ethnicity, our everyday language is largely influenced by our surroundings. Thus living in densely populated places, which means mostly living in urban areas, results in words specific to urban lifestyle appearing more frequently in user messages. The language footprints of tourism can also be captured by our method, suggesting that the effect of messages or users being on a holiday should always be considered, when trying to relate online content to real-world phenomena. Some relations are better described by a non-symmetric boolean implication model instead of the symmetric correlation measure. We find that the presence of ethnic groups above a certain threshold implies a weight greater than a certain level along the semantic axis corresponding to the component connected to this ethnic group. We also find that counties exhibiting high evangelical adherence rates show low level on the ’swearing scale’ given by the corresponding component. This is interesting, since the phenomenon cannot be observed with the two other major denominations, the Catholic and Orthodox churches. It suggests that the online presence of Evangelical churches is inherently different from that of the other denominations, and its adherents have a significant effect on the word choice on the Twitter platform. Our results suggest that online social network activity can be used effectively to monitor the spatial variation of cultural traits as represented in language use, yielding an up-to-date picture of important social phenomena. We believe our present study demonstrates an approach for measuring the importance of certain demographic attitudes when working with textual Twitter data. We suggest, therefore, that it could form the basis of further research focusing on the evaluation of demographic data estimation from other sources, or on the dynamical processes that result in the patterns found here. While our results were obtained using the Twitter microblogging platform, research could be further extended to investigate whether the incorporation of other metadata (e.g. user activity, user mobility, user profile descriptions etc) or the analysis of different text sources could refine or enhance our findings. Acknowledgements The authors would like to thank the partial support of the European Union and the European Social Fund through the FuturICT.hu project (Grant No.: TAMOP-4.2.2.C11/1/KONV-2012-0013), the OTKA103244, OTKA-114560, Ericsson and the MAKOG Foundation. References [1] D. Brain. From Good Neighborhoods to Sustainable Cities: Social Science and the Social Agenda of the New Urbanism. International Regional Science Review, 28(2):217–238, 2005. [2] Lincoln Quillian. Migration Patterns and the Growth of HighPoverty Neighborhoods , 1970 – 1990 1. American Journal of Sociology, 105(1):1–37, 1999. [3] Robert J Sampson. Disparity and diversity in the contemporary city: social (dis)order revisited. The British Journal of Sociology, 60(1):1–31, March 2009. 7 [4] John Iceland and Rima Wilkes. Does Socioeconomic Status Matter? Race, Class, and Residential Segregation. Social Problems, 53(2):248–273, May 2006. [5] Elizabeth E. Bruch and Robert D. Mare. Neighborhood Choice and Neighborhood Change. American Journal of Sociology, 112(3):667–709, November 2006. [6] Lúıs M a Bettencourt, José Lobo, Dirk Helbing, Christian Kühnert, and Geoffrey B West. Growth, innovation, scaling, and the pace of life in cities. Proceedings of the National Academy of Sciences of the United States of America, 104(17):7301–7306, 2007. [7] David Cummings, Haruki Oh, and Ningxuan Wang. Who Needs Polls? Gauging Public Opinion from Twitter Data. 2012. [8] Brendan O’Connor, Ramnath Balasubramanyan, Bryan R Routledge, and Noah a Smith. From tweets to polls: Linking text sentiment to public opinion time series. In ICWSM, volume 11, pages 1–2, 2010. [9] Pierre Deville, Catherine Linard, Samuel Martin, Marius Gilbert, Forrest R Stevens, and Andrea E Gaughan. Dynamic population mapping using mobile phone data. Proceedings of the National Academy of Sciences, 111(45):15888–15893, 2014. [10] Federico Botta, Helen Susannah Moat, Tobias Preis, Moat Hs, and Tobias Preis. Quantifying crowd size with mobile phone and Twitter data. Royal Society Open Science, 2(5):150162, 2015. [11] Thomas Louail, Maxime Lenormand, Oliva G Cantú-Ros, Miguel Picornell, Ricardo Herranz, Enrique Frias-Martinez, José J Ramasco, and Marc Barthélemy. From mobile phone data to the spatial structure of cities. Scientific reports, 4:5276, jan 2014. [12] Vanessa Frias-Martinez and Enrique Frias-Martinez. Spectral clustering for sensing urban land use using Twitter activity. Engineering Applications of Artificial Intelligence, 35(10):237–245, 2014. [13] Jonathan Reades, Francesco Calabrese, Andres Sevtsuk, and Carlo Ratti. Cellular census: Explorations in urban data collection. Pervasive Computing, IEEE, 6(3):30–38, 2007. [14] Vincent D Blondel, Adeline Decuyper, and Gautier Krings. A survey of results on mobile phone datasets analysis. EPJ Data Science, 4(1):10, 2015. [15] Alan Mislove. Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems. PhD thesis, Rice University, 2009. [16] Zhiyuan Cheng, James Caverlee, Kyumin Lee, and Daniel Z. Sui. Exploring Millions of Footprints in Location Sharing Services. In International AAAI Conference on Web and Social Media, pages 81–88, 2011. [17] D Brockmann, L Hufnagel, and T Geisel. The scaling laws of human travel. Nature, 439(7075):462–5, jan 2006. [18] Christian Thiemann, Fabian Theis, Daniel Grady, Rafael Brune, and Dirk Brockmann. The structure of borders in a small world. PloS one, 5(11):e15422, jan 2010. [19] Stanislav Sobolevsky, Izabela Sitko, Remi Tachet, Juan Murillo Arias, and Carlo Ratti. Cities through the Prism of People’s Spending Behavior. PLoS ONE, 11(2):e0146291, 2016. [20] Luca Pappalardo, Salvatore Rinzivillo, Zehui Qu, Dino Pedreschi, and Fosca Giannotti. Understanding the patterns of car travel. The European Physical Journal Special Topics, 215(1):61–73, jan 2013. [21] Luca Pappalardo, Filippo Simini, Salvatore Rinzivillo, Dino Pedreschi, Fosca Giannotti, and AlbertLászló Barabási. Returners and explorers dichotomy in human mobility. Nature Communications, 6:8166, 2015. [22] Camille Roth, Soong Moon Kang, Michael Batty, and Marc Barthélemy. Structure of Urban Movements: Polycentric Activity and Entangled Hierarchical Flows. PLoS ONE, 6(1):e15923, 01 2011. 8 [23] Samiul Hasan, Christian Schneider, Satish Ukkusuri, and Marta González. Spatiotemporal Patterns of Urban Human Mobility. Journal of Statistical Physics, 151(1/2):304–318, 2013. [24] Luke Sloan, Jeffrey Morgan, Pete Burnap, and Matthew Williams. Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data. Plos One, 10(3):e0115545, 2015. [25] Paul a Longley, Muhammad Adnan, and Guy Lansley. The geotemporal demographics of Twitter usage. Environment and Planning A, 47(2):465–484, 2015. [26] Stanislav Sobolevsky, Michael Szell, Riccardo Campari, Thomas Couronné, Zbigniew Smoreda, and Carlo Ratti. Delineating geographical regions with networks of human interactions in an extensive set of countries. PLoS ONE, 8(12):e81707, 2013. [27] Zsófia Kallus, Norbert Barankai, János Szüle, and Gábor Vattay. Spatial Fingerprints of Community Structure in Human Interaction Network for an Extensive Set of Large-Scale Regions. Plos One, 10(5):e0126713, 2015. [28] Delia Mocanu, Andrea Baronchelli, Nicola Perra, Bruno Gonçalves, Qian Zhang, and Alessandro Vespignani. The Twitter of Babel: Mapping World Languages through Microblogging Platforms. PLoS ONE, 8(4):e61981, 2013. [29] Sébastian Grauwin, Stanislav Sobolevsky, Simon Moritz, István Gódor, and Carlo Ratti. Towards a comparative science of cities: using mobile traffic records in New York, London and Hong Kong, volume 13 of Geotechnologies and the Environment, pages 363–387. 2014. [30] Marta C González, César A Hidalgo, and Albert-László Barabási. Understanding individual human mobility patterns. Nature, 453(7196):779–82, jun 2008. [31] Shan Jiang, Joseph Ferreira, and Marta C González. Activity-Based Human Mobility Patterns Inferred from Mobile Phone Data: A Case Study of Singapore. In Int. Workshop on Urban Computing, 2015. [32] Eunjoon Cho, SA Myers, and Jure Leskovec. Friendship and mobility: user movement in locationbased social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1082–1090, 2011. [33] Bartosz Hawelka, Izabela Sitko, Euro Beinat, Stanislav Sobolevsky, Pavlos Kazakopoulos, and Carlo Ratti. Geo-located Twitter as proxy for global mobility patterns. Cartography and Geographic Information Science, 41(3):260–271, feb 2014. [34] Filippo Simini, Marta C. González, Amos Maritan, and Albert-László Barabási. A universal model for mobility and migration patterns. Nature, 484(7392):96–100, 2012. [35] TK Landauer and ST Dumais. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 1(2):211–240, 1997. [36] SC Deerwester, ST Dumais, and TK Landauer. Indexing by latent semantic analysis. Journal of the American society for Information Science, 41(6164):391, 1990. [37] H Andrew Schwartz, Johannes C Eichstaedt, Margaret L Kern, Lukasz Dziurzynski, Stephanie M Ramones, Megha Agrawal, Achal Shah, Michal Kosinski, David Stillwell, Martin E P Seligman, and Lyle H Ungar. Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS one, 8(9):e73791, jan 2013. [38] Alexander M. Petersen, Joel N. Tenenbaum, Shlomo Havlin, H. Eugene Stanley, and Matjaž Perc. Languages cool as they expand: Allometric scaling and the decreasing need for new words. Scientific Reports, 2:943, 2012. [39] M. Perc. Evolution of the most common English words and phrases over the centuries. Journal of The Royal Society Interface, 9(July):3323–3328, 2012. 9 [40] Johannes C Eichstaedt, Hansen Andrew Schwartz, Margaret L Kern, Gregory Park, Darwin R Labarthe, Raina M Merchant, Sneha Jha, Megha Agrawal, Lukasz a Dziurzynski, Maarten Sap, Christopher Weeg, Emily E Larson, Lyle H Ungar, and Martin E P Seligman. Psychological Language on Twitter Predicts County-Level Heart Disease Mortality. Psychological Science, 26(2):159–169, 2015. [41] Chanuki Illushka Seresinhe, Tobias Preis, and Helen Susannah Moat. Quantifying the Impact of Scenic Environments on Health. Scientific Reports, 5:16899, 2015. [42] Alejandro Llorente, Manuel Cebrian, and Esteban Moro. Social media fingerprints of unemployment. PLoS ONE, 10(5):e0128692, 2015. [43] Jaroslav Pavlicek and Ladislav Kristoufek. Nowcasting Unemployment Rates with Google Searches: Evidence from the Visegrad Group Countries. Plos One, 10(5):e0127084, 2015. [44] C. Curme, T. Preis, H. E. Stanley, and H. S. Moat. Quantifying the semantics of search behavior before stock market moves. Proceedings of the National Academy of Sciences, 111(32):11600–11605, 2014. [45] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pages 759–768, 2010. [46] Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: improving geographical prediction with social and spatial proximity. In Proceedings of the 19th international conference on World wide web, pages 61–70. ACM, 2010. [47] Emilio Ferrara, Onur Varol, Filippo Menczer, and Alessandro Flammini. Traveling trends: social butterflies or frequent fliers? In COSN ’13 Proceedings of the first ACM conference on Online social networks, pages 213–222, 2013. [48] Jacob Eisenstein, Brendan O’Connor, Noah a. Smith, and Eric P. Xing. Diffusion of Lexical Change in Social Media. PLoS ONE, 9(11):e113114, 11 2014. [49] Lewis Mitchell, Morgan R Frank, Kameron Decker Harris, Peter Sheridan Dodds, and Christopher M Danforth. The geography of happiness: connecting twitter sentiment and expression, demographics, and objective characteristics of place. PloS one, 8(5):e64417, jan 2013. [50] Zhouchen Lin, Minming Chen, and Yi Ma. The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices. 2010. [51] EJ Candès, Xiaodong Li, Y Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):11, 2011. [52] F Morstatter, J Pfeffer, H Liu, and K Carley. Is the Sample Good Enough ? Comparing Data from Twitter ’ s Streaming API with Twitter ’ s Firehose. In International Conference on Weblogs and Social Media, pages 400–408, 2013. [53] Laszlo Dobos, Janos Szule, Tamas Bodnar, Tamas Hanyecz, Tamas Sebok, Daniel Kondor, Zsofia Kallus, Jozsef Steger, Istvan Csabai, and Gabor Vattay. A multi-terabyte relational database for geotagged social network data. In 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 Proceedings, pages 289–294, 2013. [54] AS Szalay, Jim Gray, George Fekete, and PZ Kunszt. Indexing the sphere with the hierarchical triangular mesh. arXiv, (arXiv:cs/0701164), 2007. [55] Dániel Kondor, László Dobos, István Csabai, András Bodor, Gábor Vattay, Tamás Budavári, and Alexander S. Szalay. Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management SSDBM ’14, pages 1–4, New York, New York, USA, 2014. ACM Press. [56] Y Gotoh and S Renals. Document space models using latent semantic analysis. In Proc. Eurospeech, pages 1443–1446, 1997. 10 [57] Debashis Sahoo, David L Dill, Andrew J Gentles, Robert Tibshirani, and Sylvia K Plevritis. Boolean implication networks derived from large scale whole genome microarray datasets. Genome Biology, 9(10):R157, 2008. [58] Taylor Shelton, Matthew Zook, and Mark Graham. The Technology of Religion: Mapping Religious Cyberscapes. The Professional Geographer, 64(4):602–617, nov 2012. [59] Matthew Zook and Mark Graham. Featured graphic: The virtual ‘bible belt’. Environ. Plann. A, 42(4):763–764, 2010. 5 Data availability statement Owing to Twitter’s policy we cannot publicly share the original dataset used in this analysis. The county-wide word frequency matrix and the results of the LSA compiled are available in the Dataverse repository at http://dx.doi.org/10.7910/DVN/EXWJRJ and also at http://www.vo.elte.hu/papers/ 2016/twitter-pca.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Explore Spatiotemporal and Demographic Characteristics of Human Mobility via Twitter: A Case Study of Chicago

Characterizing human mobility patterns is essential for understanding human behaviors and the interactions with socioeconomic and natural environment, and plays a critical role in public health, urban planning, transportation engineering and related fields. With the widespread of location-aware mobile devices and continuing advancement of Web 2.0 technologies, location-based social media (LBSM)...

متن کامل

Identifying the Dominant Dietary Patterns Among Adults in Gonabad City in 2019

A healthy diet is one of the most important aspects of health. Nowadays, scrutinizing dietary patterns rather than specific nutrients have prime importance. The purpose of this study is to identify the dominant dietary patterns among adults in Gonabad City in 2019. In this cross-sectional study, 250 individuals aged 18-70 years living in Gonabad were selected by multistage random sampling from ...

متن کامل

Race, Education Attainment, and Happiness in the United States

Background and aims: As suggests by the Minorities’ Diminished Returns (MDR) theory, educationattainment and other socioeconomic status (SES) indicators have a smaller impact on the health andwell-being of non-White than White Americans. To test whether MDR also applies to happiness, in thepresent study, Blacks and Whites were compared in terms of the effect of education attai...

متن کامل

Race and Ethnic Differences in the Associations between Cardiovascular Diseases, Anxiety, and Depression in the United States

Introduction: Although cardiovascular diseases and psychiatric disorders are linked, it is not yet known if such links are independent of comorbid medical diseases and if these associations depend on race and ethnicity. This study aimed to determine if the associations between cardiovascular diseases with general anxiety disorder (GAD) and major depressive episode (MDE) are ind...

متن کامل

Stock Market Interactions between the BRICS and the United States: Evidence from Asymmetric Granger Causality Tests in the Frequency Domain

The interaction of BRICS stock markets with the United States is studied using an asymmetric Granger causality test based on the frequency domain. This type of analysis allows for both positive and negative shocks over different horizons. There is a clear bivariate causality that runs both ways between the United States stock market and the respective BRICS markets. In addition, both negative a...

متن کامل

A Comparative Study of the Principles of Fair Proceeding in Iran with Tax Litigation patterns in the United States, Britain, France and Germany

One of the most important economic topics in every country is considering tax issues as a way of increasing the government's income through attracting public confidence by observing the principles of proceeding in the tax system of the country which might likely cause a national production boom, increase economic growth rate, reduce unemployment and the fair distribution of wealth. In this rega...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1605.02951  شماره 

صفحات  -

تاریخ انتشار 2016