Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
232 views
in Technique[技术] by (71.8m points)

Looking for a better way to process large XML file with Laravel to import portions into MySQL database

I have a fairly large (~65MB) XML file with nearly 1 million lines and I am using Laravel to parse and process the contents and then insert the new data into a MySQL database.

It is a music library that I update regularly and the software I use generates this XML file.
The code itself works fine however it takes a very LONG time. Over 30 minutes to process around 50,000 records! I'm looking for a way to speed this up. I am using Laravel 6 on Ubuntu Server running Apache if that helps.

I basically read the XML file, extract what I need, sanitize the data a little, and then insert it into my database. Here is the relevant part of my code. Can anyone suggest a better way to make this more efficient? I'm not a Laravel expert, so any feedback would be great.

                use AppMusic;
                Music::truncate(); //clear existing data
                DB::disableQueryLog(); //helps speed up queries by disabling log
                ini_set('memory_limit', '512M'); //boost memory limit
                ini_set('max_execution_time', '90'); //try to prevent time-out

                //list of files to import (I sometimes have more than 1):
                $files = [                    
                    "path/to/my/database.xml",
                    "path/to/my/database2.xml"
                ];
                $video_files = ["mp4","mov","avi","flv"]; //used to identify music videos

                foreach($files as $file){
                    $reader = new XMLReader();
                    if(!$reader->open($file)){
                        die("Failed to open xml file!");
                    }
                    $doc = new DOMDocument;
                    while ($reader->read() && $reader->name !== 'Song');

                    while ($reader->name === 'Song'){
                        $song = simplexml_import_dom($doc->importNode($reader->expand(), true));
                        if(strpos($song['FilePath'], 'netsearch://') === false && strpos($song['FilePath'], ':/DJ Tools/') === false){

                            $music = new Music;  //create new instance of model

                            foreach ($song->Tags as $tag){
                                if(($tag['Author'] != "" || $tag['Title'] != "") && ($tag['Grouping'] != "Studio")){

                                    $insert = true; //insert record or not

                                    foreach($song->Infos as $info){
                                        $music->length = gmdate("H:i:s",floatval($info['SongLength']));
                                        $music->file_date = date("Y-m-d",intval($info['FirstSeen']));
                                    }

                                    if($insert == true){
                                        $music->bpm = ($tag['Bpm'] > 0) ? round(1 / floatval($tag['Bpm']) * 60) : null; //to calculate use 1/bpm * 60 and round
                                        $music->file_path = $song['FilePath'];
                                        $music->artist = trim($tag['Author']);
                                        $music->title = trim($tag['Title']);
                                        $music->remix = trim($tag['Remix']);
                                        $music->album = trim($tag['Album']);
                                        $music->genre = trim($tag['Genre']);
                                        $music->filetype = substr($song['FilePath'],-3);
                                        $music->year = ($tag['Year'] > 0) ? intval($tag['Year']) : null;

                                        //set the kind (audio, video or karaoke):
                                        if(strpos($song['FilePath'], '/Karaoke/') !== false){
                                            $kind = "karaoke";
                                        }
                                        
                                        elseif(in_array(strtolower(substr($song['FilePath'],-3)),$video_files)){
                                            $kind = "video";
                                        }
                                        else{
                                            $kind = "audio";
                                        }
                                        $music->kind = $kind;
                                        
                                        $music->save();  //adds song to mysql

                                    }//end if insert true
                                } //end has title or author + non-studio
                            } //end for each tag

                        } //end not a netsearch file
                        $reader->next('Song');
                    } //end while
                    $reader->close();
                } //end for each files

The structure of the XML file looks something like this:

 <Song FilePath="D:/Path/To/Music/Michael Jackson/The Ultimate Collection/2-03 Thriller.mp3" FileSize="12974048">
  <Tags Author="Michael Jackson" Title="Thriller" Genre="Pop" Album="The Ultimate Collection" Composer="Rod Temperton" TrackNumber="3/11" Grouping="Halloween" Year="2004" Bpm="0.504202" Key="G#m" Flag="1" />
  <Infos SongLength="356.960363" FirstSeen="1501430558" Bitrate="282" Cover="1" />
  <Comment>Great for parties</Comment>
  <Scan Version="801" Bpm="0.506077" AltBpm="0.379569" Volume="1.101067" Key="G#m" Flag="32768" />
  <Poi Pos="17.171541" Type="beatgrid" />
  <Poi Pos="0.634195" Type="automix" Point="realStart" />
  <Poi Pos="356.051882" Type="automix" Point="realEnd" />
  <Poi Pos="17.30" Type="automix" Point="fadeStart" />
  <Poi Pos="352.750" Type="automix" Point="fadeEnd" />
  <Poi Pos="41.695057" Type="automix" Point="cutStart" />
  <Poi Pos="343.074830" Type="automix" Point="cutEnd" />
  <Poi Pos="44.289569" Type="automix" Point="tempoStart" />
  <Poi Pos="298.550091" Type="automix" Point="tempoEnd" />
 </Song>
 <Song FilePath="D:/Path/To/Music/Black Sabbath/We Sold Our Soul for Rock &apos;n&apos; Roll/09 Sweet Leaf.m4a" FileSize="10799807">
  <Tags Author="Black Sabbath" Title="Sweet Leaf" Genre="Heavy Metal" Album="We Sold Our Soul For Rock &apos;n&apos; Roll" Composer="Geezer Butler" TrackNumber="9/14" Year="1987" Key="Am" Flag="1" />
  <Infos SongLength="306.456961" FirstSeen="1501430556" Bitrate="259" Cover="1" />
  <Scan Version="801" Bpm="0.411757" AltBpm="0.617438" Volume="0.680230" Key="Am" Flag="32768" />
  <Poi Pos="1.753537" Type="beatgrid" />
  <Poi Pos="0.220590" Type="automix" Point="realStart" />
  <Poi Pos="301.146848" Type="automix" Point="realEnd" />
  <Poi Pos="0.30" Type="automix" Point="fadeStart" />
  <Poi Pos="291.50" Type="automix" Point="fadeEnd" />
 </Song>
 ...tens of thousands of more songs, nearly 1 million lines
question from:https://stackoverflow.com/questions/65948260/looking-for-a-better-way-to-process-large-xml-file-with-laravel-to-import-portio

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you create an instance and insert a record in each loop, it will create a Music instance and 1 insert query for each loop and this is not efficient. What if you save data to an array first, chunk it, and then save it to database.

For example, you have 1000 music data, if you make music instance for each loop, it will create 1000 times of Music instance and 1000 times of insert operation in database. But if you save your music data first into an array and chunk it to 20 array (each array contains 50 music data) it will execute 20 insert operations only. Kinda more efficient, isn't?

So, your codes will looks like it :

<?php

use AppMusic;
Music::truncate(); //clear existing data
DB::disableQueryLog(); //helps speed up queries by disabling log
ini_set('memory_limit', '512M'); //boost memory limit
ini_set('max_execution_time', '90'); //try to prevent time-out

//list of files to import (I sometimes have more than 1):
$files = [                    
    "path/to/my/database.xml",
    "path/to/my/database2.xml"
];
$video_files = ["mp4","mov","avi","flv"]; //used to identify music videos

//declare array of music here
$arrayOfMusic = [];

foreach($files as $file){
    $reader = new XMLReader();
    if(!$reader->open($file)){
        die("Failed to open xml file!");
    }
    $doc = new DOMDocument;
    while ($reader->read() && $reader->name !== 'Song');

    while ($reader->name === 'Song') {
        $song = simplexml_import_dom($doc->importNode($reader->expand(), true));
        if(strpos($song['FilePath'], 'netsearch://') === false && strpos($song['FilePath'], ':/DJ Tools/') === false) {

            foreach ($song->Tags as $tag) {
                if (($tag['Author'] != "" || $tag['Title'] != "") && ($tag['Grouping'] != "Studio")) {

                    $insert = true; //insert record or not

                    foreach ($song->Infos as $info) {
                        $length = gmdate("H:i:s",floatval($info['SongLength']));
                        $file_date = date("Y-m-d",intval($info['FirstSeen']));
                    }

                    if($insert == true){
                        //set the kind (audio, video or karaoke):
                        if(strpos($song['FilePath'], '/Karaoke/') !== false){
                            $kind = "karaoke";
                        } elseif (in_array(strtolower(substr($song['FilePath'],-3)),$video_files)) {
                            $kind = "video";
                        } else{
                            $kind = "audio";
                        }

                        //Fill array of music
                        $arrayOfMusic[] = [
                            'bpm'       => ($tag['Bpm'] > 0) ? round(1 / floatval($tag['Bpm']) * 60) : null, //to calculate use 1/bpm * 60 and round
                            'file_path' => $song['FilePath'],
                            'artist'    => trim($tag['Author']),
                            'length'    => $length ?? '0', //set $length to 0 if it cannot be found
                            'file_date' => $file_date ?? '0', //set $file_date to 0 if it cannot be found
                            'title'     => trim($tag['Title']),
                            'remix'     => trim($tag['Remix']),
                            'album'     => trim($tag['Album']),
                            'genre'     => trim($tag['Genre']),
                            'filetype'  => substr($song['FilePath'],-3),
                            'year'      => ($tag['Year'] > 0) ? intval($tag['Year']) : null;
                            'kind'      => $kind,
                        ];
                    }//end if insert true
                } //end has title or author + non-studio
            } //end for each tag

        } //end not a netsearch file
        $reader->next('Song');
    } //end while
    $reader->close();
} //end for each files

//Chunk the array if $arrayOfMusic is not null
if (!empty($arrayOfMusic)) {
    $arrayOfMusicChunked = array_chunk($arrayOfMusic, 30); //Chunk large array, in this example, chunked array will contains 30 items

    //loop the array and insert it use insert() function
    foreach ($arrayOfMusicChunked as $arrayOfMusicToSave) {
        Music::insert($arrayOfMusicToSave);
    }
}

Source

  1. https://www.w3schools.com/php/func_array_chunk.asp

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...