Overcoming Challenges in ETL Migration to Cloud using LLMs

Priyadarshi Kunal

Solution Architect, Cloud Unit, TCS

Service

HIGHLIGHTS

Migrating extract, transform, load (ETL) jobs from legacy platforms to cloud-native solutions is a critical step in an enterprise’s initiative to modernise data ecosystems.
The future of ETL migration lies in leveraging artificial intelligence (AI)-driven automation while maintaining human oversight and custom components embodying practitioner expertise.
Large language models (LLMs) offer a promising way to automate ETL job conversions, but challenges like runtime errors, hallucinations, and token size limitations remain.
A hybrid approach combining LLMs, domain expertise, and custom tools can address these challenges effectively.

NextGen ETL on cloud

In today’s fast-paced digital landscape, enterprises are increasingly moving their data ecosystems from on-premises infrastructure to hybrid or cloud-native architectures.

This shift is driven by the need for scalability, flexibility, and cost efficiency.

A critical part of this transformation involves migrating extract, transform, load (ETL) processes from legacy platforms like Informatica PowerCenter, IBM DataStage, and AbInitio to modern, cloud-agnostic solutions such as PySpark. PySpark, with its combination of Python’s simplicity and Apache Spark’s distributed processing power, has become a go-to choice for modern data processing.

However, this migration is far from straightforward. Legacy ETL tools rely on graphical interfaces and metadata-driven workflows, while PySpark requires programmatic implementation using Python and Spark SQL. The structural and functional differences between these technologies make ETL migration a complex and resource-intensive process.

ETL migration using LLMs

Large language models (LLMs), such as GPT, have emerged as powerful tools for automating ETL job conversions.

These artificial intelligence (AI) models can analyse XML files exported from legacy ETL tools and generate PySpark SQL scripts, offering several advantages:

Automation: Reduces manual effort by generating code automatically
Efficiency: Accelerates the conversion process compared with traditional methods
Scalability: Handles large volumes of XML files with consistent quality

By acting as co-pilots for developers, LLMs can streamline the migration process, enabling faster and more efficient transitions to modern data platforms.

Challenges

While LLMs hold great promise, they are not without limitations.

Runtime errors and inaccuracies often arise in the generated PySpark SQL code, stemming from the following challenges:

1. Hallucinations and incomplete code

LLMs may misinterpret the flow of ETL processes, leading to incorrect or incomplete outputs. Instruction overload (too many instructions under the same prompt), mapping flow issues, as well as complex column renaming logic can also result in hallucinations.

2. Complex transformations

Intricate logic, such as reusable transformations, custom SQL overrides, and sequence generators, can be difficult for LLMs to interpret accurately.

3. Token size limitations

Large XML files often exceed the token size limits of LLMs, requiring additional processing techniques like recursive chunking.

4. Platform-specific issues

Informatica PowerCenter: Challenges include handling router transformations, source qualifier overrides, sequence generators, router transformations hanging sources and large XML files.
IBM DataStage: Issues arise with surrogate key generation, stage variables parameterised workflows, row generators, and pivot stages.

Mitigating the gaps

To overcome the challenges of LLM-based ETL migration, organisations need to adopt some strategies.

These include:

Recursive chunking- Breaking large XML files into manageable chunks to address token size limitations.
Prompt engineering- Optimising prompts to improve the accuracy and relevance of LLM-generated code.
Custom code enhancements- Supplementing LLM outputs with Python scripts to handle complex transformations and exceptions.
Manual intervention- For highly complex scenarios, human expertise remains essential to ensure accuracy and functionality.
Agentic validations- Automated validation using additional checks through agentic flow.
It is important to note that future releases of LLMs may address some of these issues. At present, however, human-in-the-loop interventions or a custom code base capable of applying the necessary modifications to the converted code can serve as practical solutions, based on practitioner experience, to the challenges that LLMs pose for ETL conversions.

The way forward

The migration of ETL jobs from legacy platforms to modern cloud-native solutions is a complex but necessary step in data modernisation.

While LLMs offer significant potential to automate and bring efficiency in this process, their limitations highlight the need for a hybrid approach.

By combining AI-driven automation with domain expertise and custom toolsets, organisations can overcome the challenges of ETL migration and unlock the full potential of modern data platforms. This approach ensures that migrated ETL processes maintain their original functionality while leveraging the scalability and flexibility of PySpark.

As LLMs continue to evolve, their capabilities will only improve, reducing the need for manual intervention and enabling even more efficient migrations. The future of ETL modernisation lies in embracing AI-driven solutions while maintaining a strategic focus on accuracy, scalability, and innovation.

About the author

Priyadarshi Kunal

Priyadarshi Kunal, a seasoned solution architect at TCS, drives cloud data migration and modernisation using GCP, Snowflake, Informatica, and IICS, focusing on consulting, presales, and productivity tools for data platform migration efficiency. Priyadarshi Kunal is a seasoned solution architect with 19 years of expertise in data warehousing, data lake creation, migration, and modernisation. As part of TCS’ cloud engineering team, he specialises in technologies like GCP Data Ecosystem, Snowflake, IBM Information Server, Informatica, and IICS. In recent years, Priyadarshi has focused on leading data platform migration and modernisation initiatives, serving as both a solution and presales architect. His current work emphasises consulting on cloud migration strategies and developing tools to enhance productivity in cloud migration projects, driving innovation and efficiency for enterprise clients.

Write to me

Infrastructure to Intelligence

About Us

TCS Insights

Upcoming events

Recent recognitions

Want to be a global change-maker? Join our team.

Find the latest news about TCS

Recent Press Releases

Recent News

Infrastructure to Intelligence

About Us

TCS Insights

Upcoming events

Recent recognitions

Want to be a global change-maker? Join our team.

Find the latest news about TCS

Recent Press Releases

Recent News

Strategies to overcome LLM limitations in cloud ETL migration

Service

HIGHLIGHTS

On this page

NextGen ETL on cloud

ETL migration using LLMs

Large language models (LLMs), such as GPT, have emerged as powerful tools for automating ETL job conversions.

Challenges

Mitigating the gaps

To overcome the challenges of LLM-based ETL migration, organisations need to adopt some strategies.

The way forward

The migration of ETL jobs from legacy platforms to modern cloud-native solutions is a complex but necessary step in data modernisation.

About the author

Priyadarshi Kunal

With you for the long run

Infrastructure to Intelligence

About Us

TCS Insights

Upcoming events

Recent recognitions

Want to be a global change-maker? Join our team.

Find the latest news about TCS

Recent Press Releases

Recent News

Service

HIGHLIGHTS

On this page

NextGen ETL on cloud

ETL migration using LLMs

Large language models (LLMs), such as GPT, have emerged as powerful tools for automating ETL job conversions.

Challenges

Mitigating the gaps

To overcome the challenges of LLM-based ETL migration, organisations need to adopt some strategies.

The way forward

The migration of ETL jobs from legacy platforms to modern cloud-native solutions is a complex but necessary step in data modernisation.

About the author

Priyadarshi Kunal

Related reading

With you for the long run

Accessibility Adjustments