How to import millions of data from Excel to database

Scenario analysis

This case actually involves many aspects and requires us to analyze it systematically. Let's first look at the problems you may encounter when reading millions of data from Excel and inserting it into a database:

Memory overflow risk

Loading such a large amount of Excel data may cause memory overflow, so you need to pay attention to memory management.

Performance bottleneck

Processing reads and inserts of millions of data can be time-consuming, and performance optimization is critical.

Exception handling strategy

There will be various potential problems during the reading and importing process, and we need to properly handle various abnormal situations.

Memory overflow problem

It is obviously unrealistic to process millions of data directly into memory. The solution is to use streaming reading and process data in batches.

In terms of technology selection, choosing EasyExcel is a wise move. It is optimized for handling large data volumes and complex Excel files. When EasyExcel parses Excel, it does not load the entire file into memory at once, but reads the data from the disk one by one and parses it line by line.

Performance issues

For processing millions of data, a single thread is obviously inefficient. The key to improving performance is multi-threading.

Multi-threaded applications involve two scenarios: one is reading files with multiple threads, and the other is implementing data insertion with multiple threads. This involves a producer-consumer model, multi-threaded reads and multi-threaded inserts to maximize overall performance.

In terms of data insertion, in addition to using multi-threading, the batch insertion function of the database should also be combined to further improve the speed.

Error handling

During the file reading and database writing process, you may encounter many problems, such as data format errors, inconsistencies, and duplicate data.

Therefore, it should be handled in two steps. First perform data inspection, check data format and other issues before inserting, and then handle exceptions during the inserting process.

There are many ways to handle this, including transaction rollback or logging. It is generally not recommended to directly roll back the operation, but to automatically retry it. If it fails after multiple attempts, a log will be recorded and the data will be re-inserted.

In addition, during this process, the problem of data duplication needs to be considered, and several fields can be set in Excel as unique constraints for the database. When encountering data conflicts, you can overwrite, skip or report error processing. Choose an appropriate processing method based on the actual business situation. Under normal circumstances, skipping and recording logs is a relatively reasonable choice.

Solutions

So, the overall plan is as follows:

Use EasyExcel to read Excel data because it reads data line by line instead of loading the entire file into memory at once. In order to improve concurrency efficiency, millions of data are distributed in different work tables, and thread pools and multi-threads are used to read each work table at the same time. During the reading process, data processing is performed with the help of EasyExcel's ReadListener.

During the processing, not every piece of data directly operates on the database to avoid excessive pressure on the database. Set a batch size, for example, every 1000 pieces of data, the data read from Excel will be temporarily stored in memory (can be implemented using List). After every 1,000 pieces of data are read, a batch insertion operation of data is performed. You can simply use mybatis to implement batch insertion.

In addition, during processing, concurrency issues need to be considered, so we will use a thread-safe queue to store temporary data in memory, such as ConcurrentLinkedQueue.

It has been verified that through the above solution, the time required to read and insert 1 million pieces of data in Excel is about 100 seconds, no more than 2 minutes.

Implementation

In order to improve concurrent processing capabilities, we store millions of data in different worksheets in the same Excel file, and then read these worksheet data concurrently through EasyExcel.

EasyExcel provides the ReadListener interface, allowing customized processing after each batch of data is read. We can implement batch reading of files based on this function.

pom dependency

First, you need to add the following dependencies:

<dependencies>
    <!-- EasyExcel -->
    <dependency>
        <groupId>com.alibaba</groupId>
        <artifactId>easyexcel</artifactId>
        <version>latest_version</version>
    </dependency>

    <!-- 数据库连接和线程池 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-jpa</artifactId>
    </dependency>
</dependencies>1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.

Read multiple sheets concurrently

Then implement the code to read multiple sheets concurrently:

@Service
public class ExcelImporterService {

    @Autowired
    private MyDataService myDataService;
    
    public void doImport() {
        // Excel文件的路径
        String filePath = "users/paidaxing/workspace/excel/test.xlsx";

        // 需要读取的sheet数量
        int numberOfSheets = 20;

        // 创建一个固定大小的线程池，大小与sheet数量相同
        ExecutorService executor = Executors.newFixedThreadPool(numberOfSheets);

        // 遍历所有sheets
        for (int sheetNo = 0; sheetNo < numberOfSheets; sheetNo++) {
            // 在Java lambda表达式中使用的变量需要是final
            int finalSheetNo = sheetNo;

            // 向线程池提交一个任务
            executor.submit(() -> {
                // 使用EasyExcel读取指定的sheet
                EasyExcel.read(filePath, MyDataModel.class, new MyDataModelListener(myDataService))
                         .sheet(finalSheetNo) // 指定sheet号
                         .doRead(); // 开始读取操作
            });
        }

        // 启动线程池的关闭序列
  executor.shutdown();

        // 等待所有任务完成，或者在等待超时前被中断
        try {
            executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
        } catch (InterruptedException e) {
            // 如果等待过程中线程被中断，打印异常信息
            e.printStackTrace();
        }
    }
}1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.

This code creates a fixed-size thread pool to concurrently read an Excel file containing multiple sheets. The reading of each sheet is submitted to the thread pool as a separate task.

We used a MyDataModelListener in the code , which is an implementation class of ReadListener . When EasyExcel reads each row of data, it will automatically call the invoke method of the ReadListener instance we passed in. In this method, we can define how to process this data.

MyDataModelListener also contains the doAfterAllAnalysed method, which is called after all data has been read. This is where you can perform some cleanup or process the remaining data.

ReadListener

Next, let's implement our ReadListener:

import com.alibaba.excel.context.AnalysisContext;
import com.alibaba.excel.read.listener.ReadListener;
import org.springframework.transaction.annotation.Transactional;
import java.util.ArrayList;
import java.util.List;

// 自定义的ReadListener，用于处理从Excel读取的数据
public class MyDataModelListener implements ReadListener<MyDataModel> {
    // 设置批量处理的数据大小
    private static final int BATCH_SIZE = 1000;
    // 用于暂存读取的数据，直到达到批量大小
    private List<MyDataModel> batch = new ArrayList<>();

    
    private MyDataService myDataService;

    // 构造函数，注入MyBatis的Mapper
    public MyDataModelListener(MyDataService myDataService) {
        this.myDataService = myDataService;
    }

    // 每读取一行数据都会调用此方法
    @Override
    public void invoke(MyDataModel data, AnalysisContext context) {
        //检查数据的合法性及有效性
        if (validateData(data)) {
            //有效数据添加到list中
            batch.add(data);
        } else {
            // 处理无效数据，例如记录日志或跳过
        }
        
        // 当达到批量大小时，处理这批数据
        if (batch.size() >= BATCH_SIZE) {
            processBatch();
        }
    }

    
    private boolean validateData(MyDataModel data) {
        // 调用mapper方法来检查数据库中是否已存在该数据
        int count = myDataService.countByColumn1(data.getColumn1());
        // 如果count为0，表示数据不存在，返回true；否则返回false
        if(count == 0){
         return true;
        }
        
        // 在这里实现数据验证逻辑
        return false;
    }


    // 所有数据读取完成后调用此方法
    @Override
    public void doAfterAllAnalysed(AnalysisContext context) {
        // 如果还有未处理的数据，进行处理
        if (!batch.isEmpty()) {
            processBatch();
        }
    }

    // 处理一批数据的方法
    private void processBatch() {
        int retryCount = 0;
        // 重试逻辑
        while (retryCount < 3) {
            try {
                // 尝试批量插入
                myDataService.batchInsert(batch);
                // 清空批量数据，以便下一次批量处理
                batch.clear();
                break;
            } catch (Exception e) {
                // 重试计数增加
                retryCount++;
                // 如果重试3次都失败，记录错误日志
                if (retryCount >= 3) {
                    logError(e, batch);
                }
}1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.

By customizing MyDataModelListener, data processing can be implemented during the process of reading Excel files. After each piece of data is read, it is added to the list. When the list reaches 1,000 items, a database batch insertion operation is performed. If the insertion fails, retry will be performed; if multiple attempts still fail, an error log will be recorded.

Batch insert

Here, the batch insertion of MyBatis is used. The code is implemented as follows:

import org.apache.ibatis.annotations.Mapper;
import java.util.List;

@Mapper
public interface MyDataMapper {
    void batchInsert(List<MyDataModel> dataList);

    int countByColumn1(String column1);
}1.
2.
3.
4.
5.
6.
7.
8.
9.

mapper.xml file:

<insert id="batchInsert" parameterType="list">
    INSERT INTO paidaxing_test_table_name (column1, column2, ...)
    VALUES 
    <foreach collection="list" item="item" index="index" separator=",">
        (#{item.column1}, #{item.column2}, ...)
    </foreach>
</insert>

<select id="countByColumn1" resultType="int">
    SELECT COUNT(*) FROM your_table WHERE column1 = #{column1}
</select>1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.

新聞