<?xml version="1.0" encoding="iso-8859-1" standalone="no"?>
<!DOCTYPE GmsArticle SYSTEM "http://www.egms.de/dtd/2.0.34/GmsArticle.dtd">
<GmsArticle xmlns:xlink="http://www.w3.org/1999/xlink">
  <MetaData>
    <Identifier>25gmds093</Identifier>
    <IdentifierDoi>10.3205/25gmds093</IdentifierDoi>
    <IdentifierUrn>urn:nbn:de:0183-25gmds0930</IdentifierUrn>
    <ArticleType>Meeting Abstract</ArticleType>
    <TitleGroup>
      <Title language="en">Evaluating the Utility of Synthetic Data in Medical Machine Learning Using Pairwise Correlation Distance</Title>
    </TitleGroup>
    <CreatorList>
      <Creator>
        <PersonNames>
          <Lastname>Gamisch</Lastname>
          <LastnameHeading>Gamisch</LastnameHeading>
          <Firstname>John</Firstname>
          <Initials>J</Initials>
        </PersonNames>
        <Address>
          <Affiliation>Leipzig University, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany</Affiliation>
          <Affiliation>Leipzig University Medical Center, Dept. Medical Data Science, Leipzig, Germany</Affiliation>
        </Address>
        <Creatorrole corresponding="no" presenting="no">author</Creatorrole>
      </Creator>
      <Creator>
        <PersonNames>
          <Lastname>Sadeghi</Lastname>
          <LastnameHeading>Sadeghi</LastnameHeading>
          <Firstname>Sina</Firstname>
          <Initials>S</Initials>
        </PersonNames>
        <Address>
          <Affiliation>Leipzig University, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany</Affiliation>
          <Affiliation>Leipzig University Medical Center, Dept. Medical Data Science, Leipzig, Germany</Affiliation>
        </Address>
        <Creatorrole corresponding="no" presenting="no">author</Creatorrole>
      </Creator>
      <Creator>
        <PersonNames>
          <Lastname>Kirsten</Lastname>
          <LastnameHeading>Kirsten</LastnameHeading>
          <Firstname>Toralf</Firstname>
          <Initials>T</Initials>
        </PersonNames>
        <Address>
          <Affiliation>Leipzig University, Institute for Medical Informatics, Statistics, and Epidemiology, Leipzig, Germany</Affiliation>
          <Affiliation>Leipzig University Medical Center, Dept. Medical Data Science, Leipzig, Germany</Affiliation>
        </Address>
        <Creatorrole corresponding="no" presenting="no">author</Creatorrole>
      </Creator>
    </CreatorList>
    <PublisherList>
      <Publisher>
        <Corporation>
          <Corporatename>German Medical Science GMS Publishing House</Corporatename>
        </Corporation>
        <Address>D&#252;sseldorf</Address>
      </Publisher>
    </PublisherList>
    <SubjectGroup>
      <SubjectheadingDDB>610</SubjectheadingDDB>
      <Keyword language="en">Synthetic Data</Keyword>
      <Keyword language="en">Machine Learning</Keyword>
      <Keyword language="en">Classification</Keyword>
    </SubjectGroup>
    <DatePublishedList>
      <DatePublished>20251103</DatePublished>
    </DatePublishedList>
    <Language>engl</Language>
    <License license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
      <AltText language="en">This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License.</AltText>
      <AltText language="de">Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung).</AltText>
    </License>
    <SourceGroup>
      <Meeting>
        <MeetingId>M0631</MeetingId>
        <MeetingSequence>093</MeetingSequence>
        <MeetingCorporation>Deutsche Gesellschaft f&#252;r Medizinische Informatik, Biometrie und Epidemiologie</MeetingCorporation>
        <MeetingName>70. Jahrestagung der Deutschen Gesellschaft f&#252;r Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)</MeetingName>
        <MeetingTitle></MeetingTitle>
        <MeetingSession>V: Synthetic data and de-identification</MeetingSession>
        <MeetingCity>Jena</MeetingCity>
        <MeetingDate>
          <DateFrom>20250907</DateFrom>
          <DateTo>20250911</DateTo>
        </MeetingDate>
      </Meeting>
    </SourceGroup>
    <ArticleNo>Abstr. 327</ArticleNo>
  </MetaData>
  <OrigData>
    <TextBlock name="Text" linked="yes">
      <MainHeadline>Text</MainHeadline><Pgraph><Mark1>Introduction:</Mark1> The development of predictive models for medical applications is often hindered by the scarcity of high-quality datasets, particularly for rare diseases with limited documented cases. Incorporating synthetic data (SD) generated by advanced generative models (GMs) presents a promising solution to this challenge, however, their adoption in the medical field remains limited due to concerns about data fidelity and practical utility <TextLink reference="1"></TextLink>. This study aims to address these uncertainties by rigorously evaluating the potential of SD for medical ML. Specifically, we investigate whether Pairwise Correlation Distance (PCD) can serve as a reliable and practical utility metric for assessing the quality and predictive value of SD in medical ML applications. PCD quantifies the similarity between synthetic and real datasets by measuring the difference in their pairwise feature correlation matrices <TextLink reference="2"></TextLink>. </Pgraph><Pgraph><Mark1>Methods:</Mark1> The study utilizes the PIMA Indians Diabetes database <TextLink reference="3"></TextLink>, compromising eight laboratory features for 768 female patients, along with a binary label. We implemented three GMs and tuned their hyperparameters, using statistical similarity measures: CopulaGAN, Conditional-Tabular-GAN (CTGAN), and Tabular Variational Autoencoder (TVAE) <TextLink reference="4"></TextLink>, <TextLink reference="5"></TextLink>. SD quality and utility are evaluated through a comprehensive correlation analysis based on two complementary approaches: (1) statistical similarity measures, and (2) classifier-based assessment, where models are trained on SD and tested on real data (RD) to evaluate predictive performance. Empirically, we derive a statistical metric indicative of SD utility and suitable for optimizing GM hyperparameters. Further extensive and safeguarded evaluation of SD realism aims to support the rationale for the SD application in medical ML. </Pgraph><Pgraph><Mark1>Results:</Mark1> Table 1 <ImgLink imgNo="1" imgType="table" /> presents intermediate results that illustrate the general trend observed. The hyperparameters of the GMs are optimized based on the achieved PCD to guide model selection. The SD generated by the optimized GMs is then used to train classifiers, subsequently evaluated on test RD using AUROC. Classifier performance when trained on RD is reported as the <Mark2>baseline</Mark2>. We report the percentage difference in AUROC relative to the baseline (&#8710; baseline), to default GM configuration (&#8710; default), and the PCD of optimized and default GMs (PCD&#91;default, optimized&#93;).</Pgraph><Pgraph>Particularly both evaluated GANs benefit from PCD-based parameter optimization. While the presented experiment employs SD and RD of equal size, evaluating tenfold SD size consistently shows congruent classifier performance to the baseline. </Pgraph><Pgraph><Mark1>Conclusion:</Mark1> Our findings demonstrate that the PCD serves as a reliable and computationally efficient utility metric for SD across all evaluated generative models, as evidenced by strong correlation with classification performance. Furthermore, using PCD-based GM optimization provides classification results that are competitive with, or even superior to, those achieved using RD. However, it is important to note that PCD primarily captures linear correlations and may be insensitive to domain-specific or nonlinear dependencies. Despite this limitation, the consistently strong classification performance and high-quality SD produced using PCD support the viability of this approach. Although beyond the scope of present study, preliminary analysis suggests that PCD may contribute to a future framework for balancing privacy and utility in SD generation.</Pgraph><Pgraph>The authors declare that they have no competing interests.</Pgraph><Pgraph>The authors declare that an ethics committee vote is not required.</Pgraph></TextBlock>
    <References linked="yes">
      <Reference refNo="1">
        <RefAuthor>Kaabachi B</RefAuthor>
        <RefAuthor>Despraz J</RefAuthor>
        <RefAuthor>Meurers T</RefAuthor>
        <RefAuthor>Otte K</RefAuthor>
        <RefAuthor>Halilovic M</RefAuthor>
        <RefAuthor>Kulynych B</RefAuthor>
        <RefAuthor>Prasser F</RefAuthor>
        <RefAuthor>Raisaro JL</RefAuthor>
        <RefTitle>A scoping review of privacy and utility metrics in medical synthetic data</RefTitle>
        <RefYear>2025</RefYear>
        <RefJournal>NPJ Digit Med</RefJournal>
        <RefPage>60</RefPage>
        <RefTotal>Kaabachi B, Despraz J, Meurers T, Otte K, Halilovic M, Kulynych B, Prasser F, Raisaro JL. A scoping review of privacy and utility metrics in medical synthetic data. NPJ Digit Med. 2025 Jan 27;8(1):60. DOI: 10.1038&#47;s41746-024-01359-3</RefTotal>
        <RefLink>http:&#47;&#47;dx.doi.org&#47;10.1038&#47;s41746-024-01359-3</RefLink>
      </Reference>
      <Reference refNo="2">
        <RefAuthor>Goncalves A</RefAuthor>
        <RefAuthor>Ray P</RefAuthor>
        <RefAuthor>Soper B</RefAuthor>
        <RefAuthor>Stevens J</RefAuthor>
        <RefAuthor>Coyle L</RefAuthor>
        <RefAuthor>Sales AP</RefAuthor>
        <RefTitle>Generation and evaluation of synthetic patient data</RefTitle>
        <RefYear>2020</RefYear>
        <RefJournal>BMC Med Res Methodol</RefJournal>
        <RefPage>108</RefPage>
        <RefTotal>Goncalves A, Ray P, Soper B, Stevens J, Coyle L, Sales AP. Generation and evaluation of synthetic patient data. BMC Med Res Methodol. 2020 May 7;20(1):108. DOI: 10.1186&#47;s12874-020-00977-1</RefTotal>
        <RefLink>http:&#47;&#47;dx.doi.org&#47;10.1186&#47;s12874-020-00977-1</RefLink>
      </Reference>
      <Reference refNo="3">
        <RefAuthor>National Institute of Diabetes and Digestive and Kidney Diseases</RefAuthor>
        <RefTitle></RefTitle>
        <RefYear></RefYear>
        <RefBookTitle>Pima Indians Diabetes Database</RefBookTitle>
        <RefPage></RefPage>
        <RefTotal>National Institute of Diabetes and Digestive and Kidney Diseases. Pima Indians Diabetes Database. &#91;Accessed 2025 Apr 25&#93;. Available from: https:&#47;&#47;www.kaggle.com&#47;datasets&#47;uciml&#47;pima-indians-diabetes-database</RefTotal>
        <RefLink>https:&#47;&#47;www.kaggle.com&#47;datasets&#47;uciml&#47;pima-indians-diabetes-database</RefLink>
      </Reference>
      <Reference refNo="4">
        <RefAuthor>Patki N</RefAuthor>
        <RefAuthor>Wedge R</RefAuthor>
        <RefAuthor>Veeramachaneni K</RefAuthor>
        <RefTitle>The Synthetic Data Vault</RefTitle>
        <RefYear></RefYear>
        <RefBookTitle>2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2016 Oct 17-19; Montreal, QC, Canada</RefBookTitle>
        <RefPage>399-410</RefPage>
        <RefTotal>Patki N, Wedge R, Veeramachaneni K. The Synthetic Data Vault. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2016 Oct 17-19; Montreal, QC, Canada. p. 399-410. DOI: 10.1109&#47;DSAA.2016.49</RefTotal>
        <RefLink>https:&#47;&#47;doi.org&#47;10.1109&#47;DSAA.2016.49</RefLink>
      </Reference>
      <Reference refNo="5">
        <RefAuthor>Xu L</RefAuthor>
        <RefAuthor>Skoularidou M</RefAuthor>
        <RefAuthor>Cuesta-Infante A</RefAuthor>
        <RefAuthor>Veeramachaneni K</RefAuthor>
        <RefTitle>Modeling Tabular Data Using Conditional GAN &#91;Preprint&#93;</RefTitle>
        <RefYear>2019</RefYear>
        <RefJournal>arXiv</RefJournal>
        <RefPage></RefPage>
        <RefTotal>Xu L, Skoularidou M, Cuesta-Infante A, Veeramachaneni K. Modeling Tabular Data Using Conditional GAN &#91;Preprint&#93;. arXiv. 2019. DOI: 10.48550&#47;arXiv.1907.00503</RefTotal>
        <RefLink>http:&#47;&#47;dx.doi.org&#47;10.48550&#47;arXiv.1907.00503</RefLink>
      </Reference>
    </References>
    <Media>
      <Tables>
        <Table format="png">
          <MediaNo>1</MediaNo>
          <MediaID>1</MediaID>
          <Caption><Pgraph><Mark1>Table 1: Classification performance (AUROC) trained on SD by PCD-optimized GMs, evaluated on RD</Mark1></Pgraph></Caption>
        </Table>
        <NoOfTables>1</NoOfTables>
      </Tables>
      <Figures>
        <NoOfPictures>0</NoOfPictures>
      </Figures>
      <InlineFigures>
        <NoOfPictures>0</NoOfPictures>
      </InlineFigures>
      <Attachments>
        <NoOfAttachments>0</NoOfAttachments>
      </Attachments>
    </Media>
  </OrigData>
</GmsArticle>