Show Summary Details

Page of

Printed from Encyclopedia of Social Work. Under the terms of the licence agreement, an individual user may print out a single article for personal use (for details see Privacy Policy and Legal Notice).

Subscriber: null; date: 09 February 2025

Data Science and Social Worklocked

Data Science and Social Worklocked

  • Woojin Jung, Woojin JungRutgers University
  • Andrew H. KimAndrew H. KimRutgers University
  • , and Charles ChearCharles ChearRutgers University

Summary

Data science presents a new and promising frontier for social work both in methodology and in ensuring data justice and equity. Within social work, text documentation and social media are popular forms of non-traditional data, but other forms, such as imagery and connectivity data, also provide new opportunities. Administrative data linkages, particularly within the realm of child welfare, are a common approach to data use. Methodologically, natural language processing and machine learning are some of the widely applied techniques; however, computer vision, combined with spatial analysis, presents areas with significant potential. Various fields or substantive areas in social work leverage data science to predict risk and utilize algorithmic decision-making. Data science has been used around the world in both data rich and data sparse countries. Social workers are called upon to take action and take part in the conversation of data justice and equitable deployment. Professionals in social work are encouraged to have a thorough understanding of and employ a diverse range of data science tools.

Subjects

  • Research and Evidence-Based Practice

Definition and History

Data science is an interdisciplinary field that combines statistics, computer science, and domain knowledge to process and analyze data (Cariceo et al., 2018). The methods used to collect and prepare data are typically domain specific, while computer science and statistics focus on data representation and modeling, respectively (Brady, 2019). Data science expands analytical interest from causal inference (theory driven) to pattern recognition (data driven) and prediction (Maass et al., 2018; Mazzocchi, 2015).

Data science has relationships with various other concepts as well. Since the early 2010s, data science has been fueled by wider availability of big data, which are characterized by enormous amounts of observations and variables (volume), unstructured-ness (variety), unprecedented speed (velocity), and procedures to ensure integrity of data (veracity; Bello-Orgaz et al., 2016; Conway & O’Connor, 2016). Computing power and cyberinfrastructure, such as high-performance computing (HPC) and cloud computing, enable efficient handling of big data.

Furthermore, data science and related artificial intelligence (AI) technologies aim to improve decision-making through automation (Kuziemski & Misuraca, 2020). These advances use machine learning (ML), which is the capability of a machine to learn and act with minimal or no instruction by humans. Additionally, computational social science — a term often used to describe the integration of data science with social sciences— refers to the use of computational techniques for analyzing complex and large-scale behavioral data gathered from humans and simulations (Lazer et al., 2009, 2020).

While interest and application of data science, AI, and ML in social work-related research has grown significantly, there is evidence of longstanding engagement and nascent development spanning several decades. In 1977, for example, researchers at the University of Wisconsin-Madison developed and tested a predictive system to identify suicide attempters and found that the computer system was more frequently correct than medical providers (Gustafson et al., 1977). In the 1980s and 1990s, numerous papers were published on emerging technologies for decision support, expert systems, and ML, particularly in relation to child welfare assessment and intervention. Mattaini and Kirk (1991) formulated a typology of assessment approaches in which they included expert systems, described as “artificial intelligence” and “rule-based.” Millea and Mendall (1994) created the Automated Screening and Assessment Package (ASAP), a human service expert system, based on “[artificial intelligence] knowledge engineering methods”(p. 103). Little and Rixon (1998) developed and tested a decision tree “computer learning” system for risk assessment in child protection.

In the early 2020s, social work-related research involving data science has not just been concerned with application but also implication. This is evident in papers that discuss AI for social good and advantages and harms of using big data (Coulton et al., 2015; Tanweer et al., 2021; Vannini et al., 2020). Here, data science is not only defined by its technological function but also with a critical view of its impact on society.

Data Sources

Uncovering novel sources of data is a vital aspect of data science in social work. Massive, unstructured information requires extensive data representation, transformation, and engineering before it can be meaningfully analyzed. The process of delving into new data can be challenging, demanding computational power and creative prowess. However, the potential insights gleaned from “data diversification” can be of immense value to both research and practice. This section covers a variety of topics related to data sources, including data linkages, records and texts, social media, call detail records, remote sensing, and other nontraditional sources.

Data Linkages

Of particular interest to the social work field are data linkages. These allow social workers to access data that have already been collected; specifically, the administrative data that organizations collect for standard operation. Often, administrative data linkages are sought after to streamline services such as in child welfare. Longitudinal records from multiple sectors and at multiple levels can assess policy change or comprehensively study rare cases (Jonson-Reid & Drake, 2008). Though most studies that use data linkages are descriptive, they can also be used for studies that look at outcomes, predictive modeling, and intervention evaluation, to name a few (Soneson et al., 2023). When using data linkages, it is important to report on the linkage techniques used to increase the methodological rigor of the study or else data quality may be difficult to assess. Additionally, the opportunity to link administrative data is not limited by size. For example, the use of large-scale administrative data exists in family justice research (Broadhurst et al., 2021). Data linkages offer a promising opportunity for social work innovation, particularly with organizations that already collect administrative data.

Records and Texts

Vast amounts of documentation can also be harnessed as a data source. For example, case notes have been used to classify and predict a patient’s housing stability, tobacco use, and alcohol use status (Teng & Wilcox, 2022). Techniques like this are used to provide snapshots of a patient particularly when cases with extensive history get transferred. Electronic health records (EHRs) can be an even better source of data. EHRs have become a common data source, especially in clinical prediction studies, because they provide a large number of data and predictors. However, EHRs are known to have problems with missing data and bias (Goldstein et al., 2017). Similarly, free-text documents, like court proceedings, have also been utilized. For example, in one research study, 5,000 social work court statements were analyzed to understand harm and risk are experiencd by childiren in England (Coulthard & Taylor, 2022). Additionally, other potential free-text sources include assessments, reports, and case chronologies. Social work is uniquely positioned to benefit from record and text data considering the extensive documentation that occurs in many social work practices.

Social Media

One of the most popular nontraditional sources of data in social work is social media. Social media data can help to understand what the world or a group of people thinks about a specific topic (Olteanu et al., 2019). While social media data can be acquired through public application programming interfaces (APIs), the extent of data accessibility varies widely by platforms and it is becoming increasingly restricted. Voluminous social media data also requires extensive transformation and processing (Amaya et al., 2021).

X (formerly known as Twitter) data are popular because of its historically free and accessible APIs, which grant academic researchers access to its full-archive search. One study used 675,059 tweets to identify the topics and subtopics within parenting-focused accounts (Ryan et al., 2022). Social media data have also been used in various psychological studies, such as in determinants of depression (Thapa et al., 2021) and in understanding psychological well-being (Voukelatou et al., 2021). Another popular platform for data is Reddit. Subreddits, which cater to specific topics, enable researchers to study relevant discussion boards. For example,when there was an increase in the availability of sports betting, the activity significantly increased in a gambling subreddit (van der Maas et al., 2022).

Overall, social media platforms are a promising new source of data. However, they are subject to selection bias. The data may skew toward more educated young people living in urban areas with greater access to technology. This is in addition to biases and inaccuracies found in the source data and processing stages such as cleaning, enrichment, and aggregation (Olteanu et al., 2019).

Remote Sensing Data

Consistently available worldwide, remote sensing facilitates the collection of information at a high spatio-temporal resolution. Remotely sensed observations, such as those collected through satellite or airborne imagery, offer a way to gather regional characteristics without being restricted to spatial scales . Research in this area has evolved from using single band, nighttime data (Chen & Nordhaus, 2011) to multispectral daytime imagery (Yeh et al., 2020) to study local or global economic outputs.

Even though remote sensing data appears to be used less in social work than in other fields, there are notable examples of studies conducted by social work scholars or within social work and social welfare domains. For example, nighttime imagery has shown great promise in being used as a granular-level wealth metric in developing countries such as Myanmar. In other fields, convolutional neural networks (CNNs) extract poverty relevant features from daytime imagery in sub-Saharan Africa (Jean et al., 2016) or South-East Asia (Jung et al., 2024). These approaches address the limitations in labeled data by transferring the weights of networks trained on larger, labeled datasets. Other studies have used high resolution satellite imagery to detect slums in India Wurm et al., 2019. Some studies have used these methods to examine the relationship of low-economic housing and multiple facets of health (Friesen et al., 2020), while others have called for the use of satellite imagery to analyze the impact of housing on depression (Thapa et al., 2021).

Call Detail Records

Cellphone call detail records (CDRs) are valuable nontraditional sources of data. It is estimated that more than two-thirds of the world’s population carry mobile phones. Examples of CDR data are voice calls, phone tower coordinates, caller and callee information, text messages, Internet data use, and mobile money account use (Blondel et al., 2015;Lavelle-Hill et al., 2022). Mobile meta data can provide researchers with time and location stamped data at an individual (account holder) level. Information that can be inferred from CDRs includes but is not limited to communication frequency and timing, location, mobility, expenditures, social network structure, and financial transactions. CDRs have also been used to create a prediction model of wealth in Rwanda (Blumenstock et al., 2015), target aid in Togo (Aiken et al., 2022), and understand migration patterns in Tanzania (Lavelle-Hill et al., 2022). The social work field rarely utilizes mobile phone data, however, perhaps due to restricted access to raw CDRs and the high computational costs involved in processing massive data.

Geographic Features and Data

Geographic features refer to the attributes of locations on the Earth’s surface. They include both natural and man-made features such as water bodies, temperature, settlements, boundaries, and landmarks. Geospatial data are a combination of location details, attributes, and often temporal information (Stock & Guesgen, 2016). Spatial data are categorized into raster and vector formats. Raster data, such as remotely sensed imagery, approximate spatial attributes, with each grid cell containing a unique value. Vector data comprise points, lines, and polygons. Location, whether static or dynamic, is a key in both geographic features and geospatial data. Georeferenced data are highly valuable in artificial intelligence/machine learning analysis. One can spatially join geocoded data and combine them with other socioeconomic indicators to provide a comprehensive characterization of a region of interest. Geospatial analysis facilitates the interpolation of unobserved values (Jung, 2023), the identification of hotspots (Burke et al., 2016), and the evaluation of service accessibility using distance metrics (Bauer et al., 2015).

Other Data Sources

Other nontraditional sources of data include crowdsourcing, web searches, retail scanners, and the news (Behrend et al., 2011; Carlquist et al., 2017; Chan et al., 2011; Magruder, 2003; Voukelatou et al., 2021), to name a few. Crowdsourcing involves obtaining data from a large number of people who provide their data online. OpenStreetMap, for instance, allows contributors to add and modify geographic information such as point of interest (POI) data. Studies have used crowdsourced smartphone applications to model behavioral changes, stress, and physical symptoms during a peak influenza period (Madan et al., 2011). Web analytics, such as Google trends, are also unique forms of data that look at the queries people have typed into their search bars. According to Kristoufek et al. (2016), individuals who search “major depression” and “divorce” have been found to account for 30% of the variance in suicide data. Dynamic retail scanner data correlate flu-remedy sales with physician diagnoses of respiratory conditions, showing its potential as an early warning sign for human disease (Magruder, 2003). News records have also been utilized to identify usage patterns of core vocabulary of well-being, providing a snapshot of a town, state, and country from the perspective of multiple subject domains such as arrests and political events (Carlquist et al., 2017).

Methods

There are several data science methods, ranging in complexity and popularity, that address the social problems and needs of social work. These methods encompass qualitative approaches, Natural Language Processing (NLP), and computer vision. It also involves both supervised and unsupervised learning techniques for regression and classification tasks using various machine learning algorithms.

Natural Language Processing (NLP)

Natural language processing (NLP) combines linguistics, statistics, and computer science to analyze natural language text data (Conway et al., 2019). NLP is a popular data science method among social work researchers and human service providers seeking to use administrative data beyond descriptive purposes. The field has advanced from frequency-based methods to embedding-based representations and further to large generative models that capture contextual nuances in text. Contextual embeddings, such as the Bidirectional Encoder Representations from Transformers (BERT), show high performance in identifying immigrant-related non profits based on organizational names (Ren & Bloemraad, 2022). NLP includes subtypes such as topic modeling and sentiment analysis. Latent Dirichlet allocation (LDA) is a widely used topic modeling method that measures the recurrence and clustering of words to identify themes. For instance, coherent topics can be identified by LDA and then be described by sample comments, as seen in Lee et al. (2021). Another topic modeling, BERTopic, uses attention mechanisms to interpret meaning from complex phrases and word usage. Sentiment analyses is another tool to classify texts as positive, neutral, or negative valence. An example of this can be found inVictor et al.’s (2021) study where they examined the presence or absence of domestic violence by reviewing child welfare records.

Computer Vision

Whereas NLP handles text data, computer vision is used to algorithmically extract and interpret images into real-world descriptions of a range of objects (Naik et al., 2017). A critical step in computer vision is converting images into numerical matrices for statistical or ML applications. CNNs have been a prevalent approach for image processing. Researchers in this field describe demographic characteristics in U.S. neighborhoods by deploying CNN to classify cars from Google Street Images (Gebru et al., 2017). More recently, CNN-extracted satellite features have been used to infer the causal effect of electricity access on livelihoods in Uganda (Ratledge et al., 2022) and cash transfers on housing quality in Kenya (Huang et al., 2021). Vision Transformers, which apply self-attention mechanisms to identify the most important parts of an image, have emerged as prominent tools for image classification and feature detection tasks. Using image embeddings for prediction tasks has shown reasonable performance at scale, but these features are not interpretable. With adequate computational resources, the application of more intuitive methods, such as semantic segmentation combined with sub-meter resolution aerial tiles (e.g.,Hosseini et all., 2023), can enhance the interpretability of AI models for subnational analysis. Image analysis, while it requires substantial cyberinfrastructure and expertise, provides novel opportunities for detecting regional characteristics.

Multimodality

The combination of text, image, sound, and other types of data are referred to as multimodality, which serves an aim of artificial intelligence by providing the building blocks to mimic the “modes” of human experience (e.g., verbal, vocal, and visual; Lahat et al., 2015). Multimodal frameworks can outperform unimodal approaches (Suel et al., 2021). These new features can be used jointly to estimate or predict outcomes of interest, leveraging ML methods, as seen in Niu et al. (2020). Since these methods and data types involve big data, statistical approaches are required to minimize spurious relationships, overfitting, and other problems that may occur (Fan et al., 2014). Multimodal ML is infrequently used in social work research but is a promising frontier.

Qualitative Approach

Data science in social work research is sometimes considered synonymous with quantitative methods; however, qualitative inquiry is often involved. This is especially true in the task of labeling and annotating data where human reviewers describe and name samples of texts and images to train a machine. This human involvement in the learning process, or more broadly interactions between humans and ML algorithms are commonly referred to as humans-in-the-loop. Conversely, domain health experts may examine results from an unsupervised learning classification task to determine its meaning or accuracy (Andreotta et al., 2019). Some social work studies will involve multiple domain health experts, rounds of labeling, and interrater reliability measures (Victor et al., 2021). Other studies recognize this step as an opportunity to address social inclusion; for example, Frey et al. (2020) used gang-involved youth as domain health experts to analyze unstructured Twitter data and classify gang and violence-related content.

Unsupervised Learning

Whether social work researchers use NLP or computer vision methods, one must decide whether to instruct a machine on what to look for. The levels to which this occurs are classified as unsupervised or supervised learning. Unsupervised learning intends to discover hidden patterns, structures, or relationships within the data, whereas in supervised learning, the model is trained on a labeled dataset and tested on unseen cases. These two approaches can also be combined. Unsupervised learning, such as clustering and principal component analysis, can be used as a dimension reduction technique to narrow down the number of features prior to implementing supervised learning.

Supervised Learning

Supervised learning has a wide range of applications in social work that aid in making predictions. Prediction tasks can be categorized into two main types, depending on the outcome desired. Classification tasks output a class label, while regression tasks predict a continuous or discrete value. Classification is a prevalent approach in social work. For example, studies classify patterns of criminal offense or child abuse to estimate the chance of its reoccurrence (Perron et al., 2019; Travaini et al., 2022). It is not uncommon to categorize scores or scales as high or low, which converts the problem at hand into a classification task (e.g., dichotomizing the quality of life assessments in Nuutinen et al., 2023). In social work, logistic regression is frequently used for classification tasks (Byrne et al., 2019). Penalized logistic regression, for instance, is an effective method to predict adverse birth outcomes, especially when nonlinearities do not contribute to the predictive power (Pan et al., 2017). As for linear regression techniques, lasso (L1 regularization), ridge (L2 regularization), and elastic net (L1 and L2 regularization) can achievescarcity and/or prevent overfitting.

ML Algorithms

Many other ML algorithms are available, some of which are commonly associated with classification but can still be adapted for regression. One of the simplest and most intuitive methods is a single decision tree. Random forest (RF) combines multiple independent tree predictors (Breiman, 2001) and is one of the most popular methods in social work literature. Feature importance of RF is often used to reduce dimensions of input features before running generalized linear models. RF is also known as an ensemble method (Gautam & Singh, 2020) as they combine multiple prediction models or trees to produce a final result. Another ensemble method uses boosting techniques (weighted on performance), rather than bagging techniques (majority vote) like RF. Gradient boost, extreme boost, super learner ensemble, and component wise boosting are found in social work-related literature. Gradient boosting, for example, adds multiple weak learners (decision trees) using gradient descent to minimize the loss function (Suchting et al., 2020). Tree models that rely on a Bayesian statistical probability (e.g., Bayesian additive regression tree in Hill et al., 2020) are not commonly employed in social work research yet they hold potential. Other methods include support vector machine/support vector regression with various kernels and naïve Bayes (Bako et al., 2021; López-Castro et al., 2021).

Neural networks (NNs) are well-suited for complex and nonlinear problems. Feedforward multilayer perceptron (used in, for example, Marshall & English, 2000) is a foundational architecture for other advanced neural networks such as a CNNs for image analysis and recurrent NNs for sequential data. Deep NNs are known to achieve higher accuracy than simpler models such as a decision tree, but it is difficult to capture the inner workings of the algorithms. Approaches that have emerged as a valuable tool for gaining insights and creating more useful decision aids are post hoc explanations and predictive uncertainty (McGrath et al., 2020). Explainable AI methods include Shapley Additive explanations (SHAP), gradient-based methods (e.g., Saliency Maps, Class Activation Mapping, Integrated Gradients), and perturbation-based methods (e.g., Local Interpretable Model-agnostic Explanations: LIME). SHAP is well-suited for models that utilize an intuitive and limited number of features. Gradient-based methods, which are more scalable, are effective for processing images and multimodal data, as they highlight key inputs. Meanwhile, perturbation-based methods assess how changes in input affect the output. Additionally, uncertainty quantification employs statistical techniques to estimate outcome probabilities under uncertainty, for instance, through constructing prediction intervals. While NNs have yet to be extensively studied in social work, it remains an area of potential for future research. Overall, the decision to choose an algorithm is determined by the nature of the data and comparing prediction performance via evaluation metrics.

The development of ML models should also include hyperparameter tuning, which tweaks specifications to achieve a performance objective. Hyperparameters are parameters that define model architecture before training an ML model and may be used to select an algorithm to minimize the loss function (Yang & Shami, 2020). Serrano and Bajo’s (2019) study from the engineering field describes the process of optimizing NN models to diagnose chronic social isolation. Hyperparameter tuning is an area that has been underreported and is a potential area to increase the rigor of data science in social work research, as well as generally in social science research (Egger & Yu, 2022).

Substantive Areas

As the use of data science continues to grow in social work, so too do the number of substantive areas to which it can be applied. There is quite a variance; some areas have had a head start in data science applications, while others are just scratching the surface. The following sections discuss the use of data science in social work broadly; child welfare and related topics; mental health; healthcare; poverty, welfare, and global development; criminal justice; homelessness; and other areas. Though these are not all the areas in social work in which data science has been utilized, these appear to be the most common areas.

Social Work

From a practice perspective, social workers will need to be aware of the challenges that come with the application of data science: issues such as data protection, bias, and how data science will impact human decision-making processes in the field (Schneider & Seelmeyer, 2019). Some authors point toward the big data revolution in business and healthcare as reference points (Zetino & Mendoza, 2019). While data science can revolutionize social work practice, revolutions in the storage, merging, and analysis of data will have to come with it (Cariceo et al., 2018). Digital technology is drastically changing the way social work is performed; however, the theoretical and ethical discourse on the digital transformation of social work is alarmingly slim (Steiner, 2021). More research needs to focus on how service delivery in social work has changed since the adoption of technology use in the field.

Child Welfare

Child welfare and related fields use data science technologies more than most other substantive areas within social work. The use of algorithms to support child welfare decision-making is often discussed, both from the application perspective as well as with a critical lens. Predictive risk models (PRMs) are predominantly used in child welfare. PRM studies assess the risk of placement in foster care (Chouldechova et al., 2018) and aging out of foster care (Ahn et al., 2021). In addition, the risk of youth who are runaways was developed and validated using a time-to-event Cox model (Chor et al., 2022). Studies have shown how adaptive machine learning (ML) systems which allow for expert input have superior performance, such as in the case of predicting recurrent child maltreatment (Han et al., 2021).

Data from social media sites such as Reddit have also been used to analyze patterns of abuse and psychological impact in survivors of mother–daughter sexual abuse (Lin et al., 2022). Similarly, Reddit data have been used to examine families’ needs and prevalent issues during COVID-19 (Lee et al., 2021). Parenting has also been analyzed through social media data such as topic modeling in Twitter (Ryan et al., 2022). While new sources of data are used to further understand the experiences of youth, it is just as imperative to understand how big data may actually miss their experiences (Fink & Brito, 2021).

Child welfare researchers have hosted discussions about the challenges and risks of data science. In an ethnographic analysis of the child welfare system, Saxena et al. (2021) found that risk assessment limits the utility of algorithms to provide high-quality services for children and support human discretion as well as bureaucratic processes. The literature also highlights issues related to subjectivity, bias, discrimination, and inaccuracies (Keddell, 2019). For example, PRMs could lead to criminalization of vulnerable and marginalized children (Sacher, 2022). Another article discusses the application of ethical principles such as fairness, accountability, and transparency to birth match policies, which automatically assess all newborns for maltreatment risk by linking birth certificate and welfare records (Lanier et al., 2020).

Healthcare

In healthcare settings, data science plays a role in enhancing patient care and advancing medical research (Villarreal et al., 2019). An RF model that incorporates social determinants can effectively predict the in-hospital mortality rate for heart failure patients, especially among African Americans (Segar et al., 2022). Similarly, ML has been found to improve the ability to predict a patient’s quality of life during breast cancer treatment (Nuutinen et al., 2023). Data science techniques have also been used to predict hepatitis C incidence among substance users (Villarreal et al., 2019).

Like child welfare, the healthcare field offers frameworks and conceptualizations of how data science impacts society. Healthcare has been used to conceptualize the “Three Pillars for Fairness” which include transparency, impartiality, and inclusion (Sikstrom et al., 2022). Other ways to advance equity in healthcare is to address power dynamics in health information systems’ standards and practices, which can be enhanced by adopting a social determinants of health perspective (Berg et al., 2022). On a wider scale, big data improves the capacity for biomedical researchers to identify patterns that can help inform clinical biomarkers, pinpoint unsuspected treatment targets, and ultimately expedite a goal of precision medicine (Hulsen et al., 2019).

Mental Health

Data science contributes to improving mental health services by enabling personalized treatments and enhancing patient engagement and compliance. In this literature, the ecological momentary assessments (EMAs), which involves gathering real-time data in naturalistic environments, are used for psychological assessment (Bickman, 2020). Asystematic review also reported that predictive model-based text messages have the potential to reduce no-shows for outpatient appointments (Oikonomidi et al., 2023). Similarly, iterative random forest (RF) algorithms have been adopted to select predictors of attendance when addressing high dropout rates in clinical trials for those with comorbid post-traumatic stress disorder and substance use disorder (López-Castro et al., 2021). Regarding addiction, substance issues have been detected through text mining and ML (Perron et al., 2019).

Poverty, Welfare, and Global Development

Research in poverty, social welfare, and global development has seen increased leveraging of data science. Data science has been applied to creating alternative measures of wealth and poverty, particularly in data sparse contexts. For example, in regions such as Myanmar, where granular poverty data are unavailable, nightlight luminosity has shown strong predictive power for the locations of community development projects. Relatedly, document-level features from interview transcripts were used to classify who needed to be prioritized in social programsin Colombia (Muneton-Santa et al., 2022). In the same vein, a method based on Benford’s Law has been able to detect fraud in conditional cash transfer programs, such as Brazil’s Bolsa Familia (da Silva Azevedo et al., 2021). In the nongovernmental sector, community-based organizations in Macauhave a crucial role in facilitating the involvement of marginalized communities in data collaboration to support Sustainable Development Goals (Thinyane et al., 2018).

Criminal Justice

A key topic is the potential of data science to predict recidivism risk. A systematic review of 12 studies has shown that ML techniques lead to good performance based on error metrics (Travaini et al., 2022). Some studies have offered critical views to the overall contribution of data science techniques. For instance, the popular risk assessment software, Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), was not necessarily more accurate or fair than human assessments (Dressel & Farid, 2018).

Type I errors,or false positives, have particularly detrimental consequences in legal systems. A major challenge for data science tools is their potential for racial bias and unreliability, as outlined by Angwin et al. (2016). Discussions about bias, however, are contested and nuanced. Dieterich et al. (2016) argued that COMPAS is not discriminatory toward Black defendants, considering different base rates and predictive parity of recidivism across groups. Similarly, racial disparity in the Post Conviction Risk Assessment, a tool used by U.S. court systems, was mostly attributed to criminal history, which mediated the relationship between race and arrest (Skeem & Lowenkamp, 2016).

Data science techniques have also been used to study topics such as recidivism in juvenile justice and other youth risk, Minnesota sex offenders, correctional institutions, and drug court (Zolbanin et al., 2020). In the computer science field, spatio-temporal data have been used to cluster crime hot spots and predict crime incidents (Butt et al., 2021).

Homelessness

Studies have looked at several populations that experience homelessness, tapping into existing data, such as Veteran Affairs records, or collecting high frequency data. According to Byrne et al. (2019), ML models are better at identifying acute homelessness in veterans than chronic housing instability. The study also highlighted the issue of class imbalance with rare outcomes, where ML models such as RF have lower specificity, despite slightly outperforming logistic regression models. In a study of homeless youth, the component-wise gradient boosting (CGB) has been used to estimate the odds of youth being homeless, unstably housed, or residing at a shelter (Suchting et al., 2020). The CGB algorithm was chosen to account for longitudinal data drawn from daily diary assessments via EMA. A different study employed the K-means clustering, an unsupervised learning approach, to group homeless individuals served by a nongovernmental organization; the results correspond to three categories of homelessness: transitional, episodic, and chronic (Hong et al., 2018).

Other Areas and Populations

Population-specific data science research encompasses studies on aging, veterans, and domestic violence. Researchersin urban analytics estimate the age-friendliness of urban scenes in Australia, utilizing CNN embeddings from street images (Moradi et al., 2023). Other studies use NNs to predict the self-reported health rating of the elderly population (Qin et al., 2020). To select features, this study calculated a maximal information coefficient, accounting for nonlinear features. Research on veterans has included the development of a predictive model to assess self-reported suicide rates among veterans re-integrating into civilian life. Using panel surveys, this model employs SHAP values to illustrate feature importance and adopts the super learner ensemble method, which pools results across multiple algorithms (Stanley et al., 2022). ML has also been used in domestic violence cases. Gradient boosting models were found to be effective in identifying cognitive and neurobiological predictors of pain development in survivors of recent intimate partner violence (Lannon et al., 2021).

Applications, Ethics, and Framework

A rise in data science applications has raised ethical concerns about the use of data-driven methods and the need to understand their impact on social work. Extant social work literature has been particularly concerned with consent and ownership in data sharing and data management. In terms of data science applications, the same social work literature pays special attention to algorithmic decision-making. The emerging framework of data justice entails a new perspective of fairness given the advances of data science. Similarly, data governance protocols and considerations are an area in which social workers can contribute. There is an opportunity to move the field of data science through social work as well as move the field of social work through data science. There are several interdisciplinary, theoretical frameworks that are primed to facilitate these discussions, and social workers are uniquely trained for this interdisciplinary calling.

Deployment and Implementation in the Field

In light of the growing interest in data science, it is essential to explore how these techniques can be transferred to social domains and organizations. The literature presents implementation of machine learning (ML) tools for high-stakes decisions with case studies. For instance, the Eckerd’s Rapid Safety Feedback Program in Illinois was a proprietary system that lacked transparency and had very high error rates (Drake et al., 2020). In contrast, the Allegheny Family Screening Tool in Pennsylvania deployed a hotline screening tool for child maltreatment referral that attended to ethical concerns. Human-centered deployment requires agency leadership, transparency, ethical oversight, community engagement, and social license (Vaithianathan et al., 2021). The public sector may also find it difficult to adopt and execute data science and AI. Campion et al. (2020) has recommended utilizing existing management strategies to overcome obstacles. Strategies suggested include securing political and executive supportand establishing data standardization and datasharing agreements.

Algorithmic Decision-Making

Data science offers the benefits of automation and prediction; however, to do either, algorithmic decision-making is involved. These processes present several challenges. Issues of digitizing bureaucracy with little room for handling individual cases, bias from observations and dimension reduction, and the handling of probabilities in predictive analytics are significant (Schneider & Seelmeyer, 2019). This becomes even more intricate when considering the different paradigmatic frames or knowledge approaches in social work, including a data-driven diagnostic approach versus a participatory hermeneutic approach.

Some studies in this domain highlight the risk of algorithmic decision-making. A flawed algorithm used by the police to identify potential gang members has led to racial profiling and, once shared with schools and other third parties, resulted in lasting systemic impacts (Sacher, 2022). To reduce biase, other studies emphasize the need to disclose key data on the development of decision support tools and the strength of the correlation between predictors and the targeted phenomenon. For example, Gillingham (2021) examined child maltreatment-related decisionsand noted that previous involvement with child protection services needed to be carefully defined; the type of involvement matters, as well as allowing for the possibility of change.

Discrimination, Ethics, and Fairness

The data science literature discusses the important issues of discrimination, ethics, and fairness. The social work field emphasizes distributive justice, which encompasses social, environmental, and economic justice. With the increased utilization of data, a new movement of data justice has emerged. Data justice is concerned with the societal implications of data-driven technologies, specifically the inequalities that may emerge from them (Dencik et al., 2019). Several activist-proposed conceptual frameworks suggest how to operationalize human rights in technoculture (Goldkind et al., 2021). There is an emphasis on social workers recognizing new data-driven forms of inequality across individual, organizational, and community levels, which are at risk of experiencing data harm. According to Goldkind et al. (2021), data practices that promote transparency, accountability, nondiscrimination, dignity, and participation are required. Similarly, the data feminist framework examines power differentials, such as in hidden datapoints, through the lenses of gender, race, and class (Sandberg et al., 2022).

At the same time, some critical conversations challenge contemporary notions of fairness. Some have proposed to approach algorithmic fairness from a sociotechnical view rather than purely relying on black box engineering assumptions (Dolata et al., 2022). Another critique is that studies often either measure competing definitions of fairness mathematically or recommend governance tools which reinforce a technocratic approach. A nuanced approach to fairness emphasizes transparency, impartiality, and inclusion, rather than a binary rendering of fair versus unfair (Sikstrom et al., 2022). This supports the multifaceted concept of fairness (Corbett-Davies & Goel, 2018) but introduces the challenges of meeting multiple fairness criteria (Chouldechova & Roth, 2020).

There are other key ethical issues that arise when leveraging data science in social work. In developing ML models for detecting child abuse and neglect, it is crucial to include the perspectives of primary caregivers from marginalized communities, such as Black and Latinx groups, to address cultural differences and miscommunication and to mitigate the risk of false accusations by child protection services (Landau et al., 2022). In criminal justice and, to a lesser extent, in child welfare, issues around selective labels -in which only a fraction of outcomes are observed-complicate the evaluation of prediction performance (Chouldechova et al., 2018; Kleinberg et al., 2018). Nonetheless, studies attempt to reduce predictive bias across demographic groups using various fairness metrics (Ahn et al., 2021; Dieterich et al., 2016). Overall, the literature emphasizes the importance of involving multiple on-the-ground stakeholders both in the development and preprocessing of data collection. It also highlights the need for measurement and data standardization, as well as evaluation standards for AI/ML models. Others have approached similar issues from a technical perspective. This approach seeks to develop multiobjective algorithms that optimize both fairness and accuracy simultaneously (Valdivia et al., 2021).

Data Governance

No discussion of data science would be complete without discussing data governance. While the recommendations for successful deployment and implementation of data science involve a willingness to share data (Campion et al., 2020), data governance is required to prevent the risks of harm and poor data quality. These issues have been highlighted when examining international AI governance. Regarding health data in the Global Digital Health Partnership member countries, four areas for international collaborations were identified: oversight, the entire AI pipeline from data collection to model deployment and use, standards and regulation, and stakeholder engagement (Morley et al., 2022). Other frameworks have been proposed for specific issues. The health data cooperative model offers a framework that can be particularly effective in hidden areas of study, such as immigrant health and wellness, where availability of quality data is often a major barrier (Naeem et al., 2020). This model allows members to contribute, store, and manage their health-related information, and members maintain the rights of data ownership and sharing.

Theoretical Frameworks and Tools

As the incorporation of data science in social work continues to grow, there is also an increasing need for theoretical frameworks. However, the discussion of technology in social work predominantly refers to other disciplines, without fully considering theoretical differences between the referenced discipline and social work(Steiner, 2021). This creates a need to critically examine theoretical tools from digital studies with those that underpin social work practice.

Several frameworks are presented in the literature. One approach combines actor-network theory with social media and power theories (Steiner, 2021). Other frameworks use theories of system-level bureaucracy, digital discretion, and artificial discretion to address the increasing integration of information and communication technology in public service (Bullock et al., 2020). As AI increases within organizations, the very nature of work itself begins to change from person-to-person to person-to-computer interactions. Additionally, organizations evolve from strictly using human discretion in decision-making to balancing AI and digital discretion with human discretion. How social work will further evolve with data science and AI is yet to be understood.

Others have applied business and computer science frameworks to examine data science techniques in social work. By applying a business analytics framework to child maltreatment, for example, social work practitioners and researchers can examine how different data science tools fall into descriptive, diagnostic, predictive, or prescriptive taxonomies (Lanier et al., 2020). Another interesting perspective that comes from both cognitive theory and business is the black swan problem, in which an event deviates too much from the norm. According to Lanier et al. (2020), it is important to consider whether social work problems such as fatal child maltreatment represent unpredictable black swan events or predictable human behavior with underlying mechanisms. Finally, ML and data science have been proposed as ways to enhance cost benefit analysis tools. These tools can learn from criminal justice interventions and make evidence-based suggestions to users (Manning et al., 2018).

Global Context

Around the world, data science in social work and related fields focuses on improving data management and analysis, predicting risk, and generating global-scale data. In the United Kingdom, for instance, child welfare studies find that poor data quality is a pervasive problem (White et al., 2022). Data silos within child welfare institutions and disparate standards pose challenges when connecting data features and sources. As global social policy increasingly uses big data and advanced computational methods to inform decision-making, there is a pressing need to identify best practices. In Australia a study presents a six-stage data analysis pipeline to design an exemplary infrastructure for social policy (Gulliver et al., 2021).

Studies also focus on ways in which non traditional data can help guidemental health or health outcomes. In countries such as Brazil and India, several nontraditional data sources have been identified as proxies for depression. These include social media, mobile phones, and satellite imagery, to name a few (Thapa et al., 2021). Other studies have demonstrated that decision tree models outperform traditional logistic regression in predicting neurodegenerative diseases among older Chinese adults (Hao et al., 2022).

Research in this area generates rich cross-country data. Jung (2022)analyzed discrepancies between the income poverty measures and multidimensional poverty for 135 developing countries. This paper presents an ML-based method to impute poverty data, addressing a common missing data problem in social science research. Chi et al.’s (2022) study also generates global poverty maps, leveraging various data sources such as multispectral imagery, geographic data combined with phone networks (Chi et al., 2022). Such data generation applications can improve access to hard-to-reach populations around the world.

Call to Action

Much of the literature promotes a necessary and strong call to action for social workers to engage in data science. These calls span from social work curriculum to theory development and to the deployment of data science-involved interventions (Gillingham, 2021; Goldkind et al., 2021; Lanier et al., 2020; Steiner, 2021). Regarding social welfare/work education, both MSW and PhD curricula require innovative social work technical training with data science. While current MSW curricula include traditional hypothesis testing, data science would introduce education on discovering insights based on data acquisition, preparation, usage, and governance (Perron et al., 2022). Given the higher level of research in PhD programs, there are recommendations that social work scholars focus on developing a rigorous understanding and ability to use big data (Brown, 2017) as well as collaborate and learn with peers from different disciplines (Papagiannidis et al., 2023). Several universities with schools of social work are also host to social justice-oriented data science programs. Such programs offer training in data science methods with an emphasis on ethics, which is part and parcel for social work as a research field and profession. Overall, it is imperative that social work education provides rigorous yet tailored training, equipping students to collect new data sources, utilize cyber infrastructures, and apply various ML models for downstream tasks that align with their research interests.

As powerful as data science techniques are, social workers must also engage in promoting data justice by understanding who will be left behind, who are hidden, and how equity can be ensured (Goldkind et al., 2021). The ramifications of data science adoption on hidden populations, phenomena, or societal costs are currently not well-studied or documented. These may be at the algorithm specification stage, during policy development, at the evaluation of outcomes stage, or at the unintended consequences stage (Lanier et al., 2020). The development of algorithmic decision-making has specifically been highlighted as an important call to action for social workers (Gillingham, 2021). While there is skepticism, machine learning can outperform many tools and improve the impact of social workers in various contexts. Importantly, prediction performance can be more meaningful when underlying assumptions are understood (Athey, 2017), specify payoff, and establish counterfactuals (Kleinberg et al., 2018). Overall, social workers are uniquely positioned to use their interdisciplinary training to not only apply data science frameworks and methodologies but to also develop them.

Further Reading

References