<?xml version="1.0" encoding="iso-8859-1" standalone="no"?>
<!DOCTYPE GmsArticle SYSTEM "http://www.egms.de/dtd/2.0.34/GmsArticle.dtd">
<GmsArticle xmlns:xlink="http://www.w3.org/1999/xlink">
  <MetaData>
    <Identifier>25gmds135</Identifier>
    <IdentifierDoi>10.3205/25gmds135</IdentifierDoi>
    <IdentifierUrn>urn:nbn:de:0183-25gmds1359</IdentifierUrn>
    <ArticleType>Meeting Abstract</ArticleType>
    <TitleGroup>
      <Title language="en">Improving the detection of privacy risk in synthetic EHRs</Title>
    </TitleGroup>
    <CreatorList>
      <Creator>
        <PersonNames>
          <Lastname>Gerloff</Lastname>
          <LastnameHeading>Gerloff</LastnameHeading>
          <Firstname>Xenia F.</Firstname>
          <Initials>XF</Initials>
        </PersonNames>
        <Address>
          <Affiliation>Institute for Applied Medical Informatics (IAM), Center for Experimental Medicine, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany</Affiliation>
        </Address>
        <Creatorrole corresponding="no" presenting="no">author</Creatorrole>
      </Creator>
      <Creator>
        <PersonNames>
          <Lastname>Gr&#246;&#223;ler</Lastname>
          <LastnameHeading>Gr&#246;&#223;ler</LastnameHeading>
          <Firstname>Michael</Firstname>
          <Initials>M</Initials>
        </PersonNames>
        <Address>
          <Affiliation>Institute for Applied Medical Informatics (IAM), Center for Experimental Medicine, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany</Affiliation>
        </Address>
        <Creatorrole corresponding="no" presenting="no">author</Creatorrole>
      </Creator>
      <Creator>
        <PersonNames>
          <Lastname>Riemann</Lastname>
          <LastnameHeading>Riemann</LastnameHeading>
          <Firstname>Layla Tabea</Firstname>
          <Initials>LT</Initials>
        </PersonNames>
        <Address>
          <Affiliation>Institute for Applied Medical Informatics (IAM), Center for Experimental Medicine, University Hospital Hamburg-Eppendorf (UKE), Hamburg, Germany</Affiliation>
        </Address>
        <Creatorrole corresponding="no" presenting="no">author</Creatorrole>
      </Creator>
    </CreatorList>
    <PublisherList>
      <Publisher>
        <Corporation>
          <Corporatename>German Medical Science GMS Publishing House</Corporatename>
        </Corporation>
        <Address>D&#252;sseldorf</Address>
      </Publisher>
    </PublisherList>
    <SubjectGroup>
      <SubjectheadingDDB>610</SubjectheadingDDB>
      <Keyword language="en">synthetic data</Keyword>
      <Keyword language="en">privacy</Keyword>
      <Keyword language="en">electronic health records</Keyword>
      <Keyword language="en">attribute inference attack</Keyword>
    </SubjectGroup>
    <DatePublishedList>
      <DatePublished>20251103</DatePublished>
    </DatePublishedList>
    <Language>engl</Language>
    <License license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
      <AltText language="en">This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 License.</AltText>
      <AltText language="de">Dieser Artikel ist ein Open-Access-Artikel und steht unter den Lizenzbedingungen der Creative Commons Attribution 4.0 License (Namensnennung).</AltText>
    </License>
    <SourceGroup>
      <Meeting>
        <MeetingId>M0631</MeetingId>
        <MeetingSequence>135</MeetingSequence>
        <MeetingCorporation>Deutsche Gesellschaft f&#252;r Medizinische Informatik, Biometrie und Epidemiologie</MeetingCorporation>
        <MeetingName>70. Jahrestagung der Deutschen Gesellschaft f&#252;r Medizinische Informatik, Biometrie und Epidemiologie e. V. (GMDS)</MeetingName>
        <MeetingTitle></MeetingTitle>
        <MeetingSession>PS 6: Synthetic data, privacy &#38; consent</MeetingSession>
        <MeetingCity>Jena</MeetingCity>
        <MeetingDate>
          <DateFrom>20250907</DateFrom>
          <DateTo>20250911</DateTo>
        </MeetingDate>
      </Meeting>
    </SourceGroup>
    <ArticleNo>Abstr. 213</ArticleNo>
  </MetaData>
  <OrigData>
    <TextBlock name="Text" linked="yes">
      <MainHeadline>Text</MainHeadline><Pgraph><Mark1>Introduction:</Mark1> Nowadays, hospitals commonly store electronic health records (EHRs) but, due to privacy protection laws, this data can only be distributed under severe constraints. Recent research aims to solve this problem by fitting a generative model to a cohort in a secure environment and creating synthetic EHRs that share the statistical properties of the original cohort but preserve the real patients&#8217; privacy. Easy access to synthetic EHRs from different hospitals would accelerate research in rare diseases and facilitate the training of AI models to assist medical professionals e.g. in deciding on treatments. However, before sharing synthetic EHRs it is necessary to measure their privacy risk by running so-called privacy tests. Here, we propose an interpretable definition of the data&#8217;s privacy risk and a novel privacy test unique in its ability to make statistically precise statements.   </Pgraph><Pgraph><Mark1>State of the art:</Mark1>  Privacy tests usually assess the success of an attacker using the synthetic data set to gain information on individual patients and compare said success to a baseline. Most recent publications on synthetic EHRs <TextLink reference="1"></TextLink>, <TextLink reference="2"></TextLink>, <TextLink reference="3"></TextLink>, <TextLink reference="4"></TextLink>, <TextLink reference="5"></TextLink> tested an Attribute Inference Attack (AIA) in which the attacker has partial knowledge of some attributes of a real patient and infers the missing attributes based on the synthetic data matching their partial knowledge. So far, there is no consensus on how to choose the known attributes. The baseline is given by the average success of the attack based on an independent real data set instead of synthetic data.   </Pgraph><Pgraph><Mark1>Concept:</Mark1> The ideal synthetic data set consists of independently drawn samples from the true distribution of the real data. This ideal poses no privacy risk to any patient in the original cohort. Hence, we define privacy risk as the difference between an attack&#8217;s success based on synthetic data and its success based on the true distribution.   </Pgraph><Pgraph> </Pgraph><Pgraph>We apply this definition to test AIAs by setting the baseline to a lower bound of the success based on the true distribution. The bound is determined by confidence intervals computed using the real data set. Additionally, we automate the known attribute selection by randomly adding known attributes until no synthetic matches for any patient exist and repeat this process to increase the trustworthiness of the test. Our privacy test bounds the synthetic data&#8217;s deviation from the minimum privacy risk with respect to AIAs at an explicit confidence level and systematizes the selection of known attributes.  </Pgraph><Pgraph><Mark1>Implementation:</Mark1> We implemented our self-developed Python package (<Hyperlink href="https:&#47;&#47;github.com&#47;xeniagerloff&#47;StatAIT">https:&#47;&#47;github.com&#47;xeniagerloff&#47;StatAIT</Hyperlink>) offering automated privacy tests with minimal hyperparameter choices and tools for the detailed analysis of the results.  </Pgraph><Pgraph><Mark1>Lessons learned:</Mark1> To the best of our knowledge, we are the first to formulate a privacy test that bounds the privacy risk posed by a synthetic data set at an explicit confidence level. Our Python package enables rigorous and interpretable privacy assessment benefiting the medical community and patients by increasing the trust in synthetic EHRs. In addition to the significant benefits of facilitated data sharing, privacy-preserving synthetic data will further improve patient support for data-driven research.</Pgraph><Pgraph>The authors declare that they have no competing interests.</Pgraph><Pgraph>The authors declare that an ethics committee vote is not required.</Pgraph></TextBlock>
    <References linked="yes">
      <Reference refNo="1">
        <RefAuthor>Yuan H</RefAuthor>
        <RefAuthor>Zhou S</RefAuthor>
        <RefAuthor>Yu S</RefAuthor>
        <RefTitle>EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models &#91;Preprint&#93;</RefTitle>
        <RefYear>2023</RefYear>
        <RefJournal>arXiv</RefJournal>
        <RefPage></RefPage>
        <RefTotal>Yuan H, Zhou S, Yu S. EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models &#91;Preprint&#93;. arXiv. 2023. DOI: 10.48550&#47;arxiv.2303.05656</RefTotal>
        <RefLink>https:&#47;&#47;doi.org&#47;10.48550&#47;arxiv.2303.05656</RefLink>
      </Reference>
      <Reference refNo="2">
        <RefAuthor>Sun H</RefAuthor>
        <RefAuthor>Lin H</RefAuthor>
        <RefAuthor>Yan R</RefAuthor>
        <RefTitle>Collaborative synthesis of patient records through multi-visit health state inference</RefTitle>
        <RefYear>2024</RefYear>
        <RefJournal>Proceedings of the AAAI Conference on Artificial Intelligence</RefJournal>
        <RefPage>19044-52</RefPage>
        <RefTotal>Sun H, Lin H, Yan R. Collaborative synthesis of patient records through multi-visit health state inference. Proceedings of the AAAI Conference on Artificial Intelligence. 2024;38(17):19044-52. DOI: 10.1609&#47;aaai.v38i17.29871</RefTotal>
        <RefLink>https:&#47;&#47;doi.org&#47;10.1609&#47;aaai.v38i17.29871</RefLink>
      </Reference>
      <Reference refNo="3">
        <RefAuthor>Yoon J</RefAuthor>
        <RefAuthor>Mizrahi M</RefAuthor>
        <RefAuthor>Ghalaty NF</RefAuthor>
        <RefAuthor>Jarvinen T</RefAuthor>
        <RefAuthor>Ravi AS</RefAuthor>
        <RefAuthor>Brune P</RefAuthor>
        <RefAuthor>Kong F</RefAuthor>
        <RefAuthor>Anderson D</RefAuthor>
        <RefAuthor>Lee G</RefAuthor>
        <RefAuthor>Meir A</RefAuthor>
        <RefAuthor>Bandukwala F</RefAuthor>
        <RefTitle>EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records</RefTitle>
        <RefYear>2023</RefYear>
        <RefJournal>NPJ digital medicine</RefJournal>
        <RefPage>141</RefPage>
        <RefTotal>Yoon J, Mizrahi M, Ghalaty NF, Jarvinen T, Ravi AS, Brune P, Kong F, Anderson D, Lee G, Meir A, Bandukwala F. EHR-Safe: generating high-fidelity and privacy-preserving synthetic electronic health records. NPJ digital medicine. 2023;6(1):141. DOI: 10.1038&#47;s41746-023-00888-7</RefTotal>
        <RefLink>http:&#47;&#47;dx.doi.org&#47;10.1038&#47;s41746-023-00888-7</RefLink>
      </Reference>
      <Reference refNo="4">
        <RefAuthor>Theodorou B</RefAuthor>
        <RefAuthor>Xiao C</RefAuthor>
        <RefAuthor>Sun J</RefAuthor>
        <RefTitle>Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model</RefTitle>
        <RefYear>2023</RefYear>
        <RefJournal>Nat Commun</RefJournal>
        <RefPage>5305</RefPage>
        <RefTotal>Theodorou B, Xiao C, Sun J. Synthesize high-dimensional longitudinal electronic health records via hierarchical autoregressive language model. Nat Commun. 2023;14(1):5305. DOI: 10.1038&#47;s41467-023-41093-0</RefTotal>
        <RefLink>http:&#47;&#47;dx.doi.org&#47;10.1038&#47;s41467-023-41093-0</RefLink>
      </Reference>
      <Reference refNo="5">
        <RefAuthor>Das T</RefAuthor>
        <RefAuthor>Wang Z</RefAuthor>
        <RefAuthor>Sun J</RefAuthor>
        <RefTitle>TWIN: Personalized clinical trial digital twin generation</RefTitle>
        <RefYear>2023</RefYear>
        <RefBookTitle>Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.</RefBookTitle>
        <RefPage>402-13</RefPage>
        <RefTotal>Das T, Wang Z, Sun J. TWIN: Personalized clinical trial digital twin generation. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Long Beach CA USA: ACM; 2023. p. 402-13. DOI: 10.1145&#47;3580305.3599534</RefTotal>
        <RefLink>https:&#47;&#47;doi.org&#47;10.1145&#47;3580305.3599534</RefLink>
      </Reference>
    </References>
    <Media>
      <Tables>
        <NoOfTables>0</NoOfTables>
      </Tables>
      <Figures>
        <NoOfPictures>0</NoOfPictures>
      </Figures>
      <InlineFigures>
        <NoOfPictures>0</NoOfPictures>
      </InlineFigures>
      <Attachments>
        <NoOfAttachments>0</NoOfAttachments>
      </Attachments>
    </Media>
  </OrigData>
</GmsArticle>