Shadow AI: The hidden threat to enterprise adoption of generative AI

Generative artificial intelligence (GenAI) technologies, especially large-scale language models like OpenAI’s GPT-4, continue to attract interest from companies eager to gain a competitive advantage. Many businesses recognize the potential of these technologies to revolutionize every aspect of their operations. However, despite growing interest, there is a clear hesitancy to adopt generative AI within enterprises.

Data privacy is one of the biggest concerns for businesses. It's not just a problem, it's a critical element of doing business.

● 91% of organizations say they need to do more to reassure customers about how their data will be used by AI.

● 98% of organizations report privacy metrics to their board of directors.

● 94% of organizations said that their customers would not buy from them if their data was not adequately protected.

GenAI puts artificial intelligence capabilities into the hands of more users. 92% of respondents believe that GenAI is a completely different technology with new challenges and issues that require new technologies to manage data and risks.

Additionally, we are seeing an increase in record-breaking fines being imposed on businesses globally for breaches of customer trust. For example,

● In September 2022, Instagram was fined $403 million by the Irish Data Protection Commissioner (DPC) for violating GDPR and infringing on children's privacy.

● Chinese ride-hailing company Didi Global Travel Co., Ltd. (Didi) was fined 8.026 billion yuan (approximately US$1.18 billion) for violating cybersecurity and data-related laws.

● In the summer of 2021, the financial records of retail giant Amazon revealed that Luxembourg authorities fined it 746 million euros ($877 million) for violating GDPR.

The stakes for data privacy have never been higher.

The rise of shadow AI

As artificial intelligence continues its inexorable march toward the enterprise, a potential threat lurks in the darkness that could undermine its widespread adoption: shadow artificial intelligence.

Shadow AI is very similar to the "shadow IT" phenomenon of unauthorized use of software, which refers to the deployment or use of artificial intelligence systems without organizational oversight. But the risks it poses to businesses are much greater.

Whether out of convenience or ignorance, properly managing AI development can create a ticking time bomb. As AI becomes more accessible through cloud services while remaining opaque, lax controls leave backdoors that can easily be abused.

Employees eager to gain an advantage can easily paste corporate data into ChatGPT or GoogleBard for good purposes, such as getting their work done faster and more efficiently. In the absence of a security solution, employees will turn to accessible solutions.

Last spring, Samsung employees accidentally shared confidential information with ChatGPT three times. The leaked information included software code and meeting minutes, which led the company to ban employees from using GenAI services.

Additionally, because the GenAI API is easily accessible, software developers can easily integrate GenAI into their projects, which can add exciting new capabilities, but often at the expense of best security practices.

The Risks of Shadow AI

As the pressure to exploit GenAI increases, so do multiple threats.

data breach

The proliferation of GenAI tools is a double-edged sword. On the one hand, these tools offer exceptional capabilities in increasing productivity and promoting innovation. On the other hand, they also pose significant risks related to data breaches, especially in the absence of strong AI Acceptable Use Policies (AUP) and enforcement mechanisms. The ease of use of GenAI tools has led to a worrying trend: employees, driven by enthusiasm or the pursuit of efficiency, may inadvertently leak sensitive corporate data to third-party services.

It’s not just regular knowledge workers using chatbots. Last year, Microsoft employees also made a mistake and accidentally leaked 38TB of LLM training data when they uploaded it to the developer platform GitHub. This includes backups of Microsoft employees' personal computers. The backup contained sensitive personal data, including passwords and keys for Microsoft services and more than 30,000 internal Microsoft Teams messages for 359 Microsoft employees.

Compliance violation

Shadow AI tools that have not been reviewed for compliance may violate regulations such as GDPR, resulting in legal consequences and fines. In addition to this, there is an increasing number of laws across multiple jurisdictions that businesses need to pay attention to.

The upcoming EU Artificial Intelligence Act makes the situation even more complicated. Failure to comply can result in fines ranging from €35 million or 7% of global turnover to €7.5 million or 1.5% of turnover, depending on the offense and the size of the business.

On January 29, the Italian Data Protection Authority (DPA, known as Garanteperla Protezionedei DatiPersonali) notified OpenAI of a violation of data protection laws. In March last year, Garante temporarily banned OpenAI from processing data. Based on the results of the fact-finding exercise, the Italian DPA concluded that the available evidence showed that OpenAI violated the provisions of the EU GDPR.

Demystifying Shadow Artificial Intelligence

Organizations need a privacy-preserving AI solution that bridges the gap between protecting privacy and realizing the full potential of LLM.

Despite significant advances in AI technology, only a few AI-based applications have been successfully implemented by organizations to securely handle confidential and sensitive data. To protect privacy throughout the generative AI lifecycle, strict data security techniques must be implemented to safely and efficiently perform all safety-critical operations involving the model and all confidential data used for training and inference.

Data cleansing and anonymization are often proposed as methods to enhance data privacy. However, these methods may not be as effective as expected. Data sanitization, the process of removing sensitive information from data sets, can be compromised by the very nature of GenAI.

Anonymization, the process of stripping personally identifiable information from data sets, also has shortcomings in the context of GenAI. Advanced artificial intelligence algorithms have demonstrated the ability to re-identify individuals within anonymized data sets. For example, research from Imperial College London has shown that machine learning models can re-identify individuals in anonymized data sets with astonishing accuracy. The study found that 99.98% of Americans could be re-identified in any given anonymized data set using just 15 characteristics, including age, gender and marital status.

Additionally, a study reported in MIT Technology Review highlights that individuals can be easily re-identified from anonymized databases, even if the data set is incomplete or altered. The use of machine learning models in this context demonstrates that current anonymization practices are insufficient to cope with the capabilities of modern artificial intelligence technologies.

These findings demonstrate the need for policymakers and technology experts to develop more robust privacy-preserving technologies to keep up with advances in artificial intelligence, as traditional methods such as data cleansing and anonymization are no longer sufficient to ensure data privacy in the GenAI era.

Better data privacy solutions in GenAI

Privacy Enhancement Technology (PET) is considered the best solution to protect data privacy in the GenAI field. By protecting data processing and maintaining system functionality, PET solves data sharing, leakage and privacy regulatory issues.

Notable PETs include:

Homomorphic encryption: allows calculations to be performed on encrypted data and the output results are processed as if they were plain text. Limitations include slower speeds and reduced query complexity. Data integrity risks remain.
Secure multi-party computation (MPC): facilitates multi-party processing of encrypted data sets and protects data privacy. Disadvantages include performance degradation, especially in LLM training and inference.
Differential Privacy: Adding noise to data to prevent user re-identification, balancing privacy and data analysis accuracy. However, it may affect analysis accuracy and does not protect data during calculations, so it needs to be used in conjunction with other PETs.

While each of the above techniques provides a way to protect sensitive data, none ensures that the computing power required for generative AI models can be fully effective. However, a new approach called confidential computing uses hardware-based Trusted Execution Environments (TEEs) to prevent unauthorized access or modification of applications and data during use.

This prevents unauthorized entities such as the host operating system, hypervisor, system administrator, service provider, infrastructure owner, or anyone with physical access to the hardware from viewing or changing data or code in the environment. This hardware-based technology provides a secure environment to keep sensitive data safe.

Confidential Computing as a Privacy-Preserving AI Solution

Confidential computing is an emerging standard in the technology industry that focuses on protecting data while in use. This concept extends data protection from data at rest and in transit to data in use, which is especially important in today's computing environments that span multiple platforms, from on-premises to cloud and edge computing.

This technology is critical for organizations that handle sensitive data such as personally identifiable information (PII), financial data, or health information, as threats to the confidentiality and integrity of data in system memory are a significant concern.

This cross-industry effort is critical due to the complexity of confidential computing, which involves significant hardware changes as well as the structure of programs, operating systems, and virtual machines. Various projects under the CCC umbrella are advancing the field by developing open source software and standards, which are critical for developers working to protect data in use.

Confidential computing can be implemented in different environments, including public clouds, on-premises data centers, and distributed edge locations. This technology is critical for data privacy and security, multi-party analytics, regulatory compliance, data localization, sovereignty and residency. It ensures that sensitive data is protected and compliant with local laws even in multi-tenant cloud environments.

The ultimate goal: classified artificial intelligence

Confidential AI Solution is a secure platform that uses a hardware-based Trusted Execution Environment (TEE) to train and run machine learning models on sensitive data. TEE enables training, fine-tuning, and inference without exposing sensitive data or proprietary models to unauthorized parties.

Data owners and users can use Local Learning Models (LLMs) on their data without revealing confidential information to unauthorized parties. Likewise, model owners can train their models while protecting their training data and model architecture and parameters. In the event of a data breach, hackers can only access the encrypted data and not the sensitive data protected within the TEE.

However, confidential computing alone cannot prevent a model from accidentally revealing details about the training data. Confidential computing techniques can be combined with differential privacy to reduce this risk. This approach involves computing the data within the TEE and applying differential privacy updates before publishing, thereby reducing the risk of leakage during inference.

Additionally, the confidential AI platform helps LLMs and data providers comply with privacy laws and regulations. By protecting confidential and proprietary data using advanced encryption and secure TEE technology, model builders and providers don’t need to worry about the amount and type of user data they can collect.

Confidential computing technologies such as trusted execution environments lay the foundation for protecting privacy and intellectual property rights in AI systems. Confidential AI solutions combined with technologies like differential privacy and thoughtful data governance policies allow more organizations to benefit from AI while building stakeholder trust and transparency.

While there is still much work to be done, advances in cryptography, security hardware, and privacy-enhancing methods suggest that AI can be deployed ethically in the future. However, we must continue to advocate for responsible innovation and push for platforms to empower individuals and organizations to control how their sensitive data is used.

新聞