Using Machine Learning with Real-World Data to Identify Autism Risk in Children
Project Period: 3/14/22 – 2/29/24
Funding Source: NIH National Institute of Mental Health
Award Number: 1R21MH129682-01
Anticipated Award: $468,562
Delayed diagnosis and under-identification of autism spectrum disorder (ASD) have significant personal, public health, and economic impacts; and girls and Latino children are disproportionately impacted by this problem, in part because clinicians are less likely to recognize ASD risk factors in them and refer them for an ASD evaluation. In this research proposal, we will use real-world data to (1) develop a computable phenotype for ASD using both structured (i.e., diagnosis codes) and unstructured (i.e., physician notes) electronic health record (EHR) data from two diverse locations (Children’s Hospital Los Angeles and the OneFlorida Data Trust), and (2) develop a machine-learning risk prediction model for ASD. This will lay the foundation for a clinical decision support tool, to be integrated into EHRs to notify a clinician when a child warrants ASD evaluation, that will be easily expandable into a next steps, large scale study to the overall PCORnet, which provides healthcare to over 24 million children.
Abstract: Early and accurate identification of autism spectrum disorder (ASD) is important because ASD interventions can support positive long-term developmental outcomes, but there is a delay of >2 years between the age children can reliably be diagnosed and the average age of diagnosis; and 1 in 4 U.S. children aged 8 with ASD have not been diagnosed. Girls and Latino children are disproportionately impacted by the problem of delayed diagnosis and under-identification of ASD, in part because clinicians are less likely to recognize ASD risk factors in them and refer them for an ASD evaluation. Therefore, predicting ASD risk at a population level is needed to enhance early and accurate detection, particularly in these underserved populations. Researchers are beginning to harness clinical informatics methods to identify ASD from real-world data in electronic health records (EHRs), using both structured (e.g., diagnosis codes) and unstructured data (e.g., physician notes). However, existing algorithms suffer from multiple major flaws, including non-representativeness of training samples, outdated diagnosis codes and natural language processing (NLP) methods, and a lack of ‘verified’ ASD diagnosis in their gold standard datasets.
This proposed research addresses these gaps by developing a contemporary ASD risk model that uses state-of-the-art machine learning and NLP methods. Using EHR data from Children’s Hospital Los Angeles (including a gold standard dataset with ‘verified’ ASD diagnoses from the Boone Fetter Clinic) and the OneFlorida Data Trust (a Florida state-wide EHR database), we will (1) develop a computable phenotype for ASD using both structured and unstructured EHR data (including parent-reported ASD discriminators and features associated with ASD that are often found in free text in children’s records), and (2) develop a machine-learning risk prediction model for ASD. This will lay the foundation for a clinical decision support tool, to be integrated into EHRs to notify a clinician when a child warrants ASD evaluation.
This has potential to improve ASD identification in all children, but it may particularly benefit girls and Latino children, reducing sex and ethnic disparities. Further, it will be easily expandable into a ‘next steps’ study to the overall PCORnet, which provides healthcare to over 24 million children. By using EHRs, this proposal holds promise for future cost-effective health systems interventions that can help to correct a sociodemographic ‘imbalance’ in ASD research by reaching girls and Latino children at risk for ASD.