'How to improve my tables and queries for Big Data applications?

I created an API on Symfony which produces more than 1 million entries by day into one of the MySql tables. This table structure is defined this way:

enter image description here

After some weeks of use, the table has already 35 million (not trillion) of rows. And when I query this table, the response time is almost 20 seconds for a simple query like this one:

public function findAllCryptosByRank($date_minute)
{
    $query = $this->_em->createQueryBuilder()
        ->select("cm")
        ->from("APIBundle:CoinmarketcapSnapshot", "cm")
        ->where("cm.date_minute = :date_minute")
        ->orderBy("cm.rank", "ASC")
        ->setMaxResults(10)
        ->setParameters(array(
            'date_minute' => $date_minute,
        ));
    $finalQuery = $query->getQuery();
    return $finalQuery->getArrayResult();
}

When doing something more complicated, it's even worse; the queries take more than one minute. For example for something like that:

public function findAllCryptosByRank($date_minute,$date_hour,$date_day,$date_month,$date_year)
{
    $query = $this->_em->createQueryBuilder()
        ->select("cm", "c.logo", "c.title")
        ->from("APIBundle:CoinmarketcapSnapshot", "cm")
        ->where("cm.date_minute = :date_minute")
        ->andWhere("cm.date_hour = :date_hour")
        ->andWhere("cm.date_day = :date_day")
        ->andWhere("cm.date_month = :date_month")
        ->andWhere("cm.date_year = :date_year")
        ->leftJoin(
            'APIBundle:Cryptocurrency',
            'c',
            \Doctrine\ORM\Query\Expr\Join::WITH,
            'cm.cryptocurrency__id = c. coinmarketcap_id'
        )
        ->orderBy("cm.rank", "ASC")
        ->setMaxResults(10)
        ->setParameters(array('date_minute'=>$date_minute,'date_hour'=>$date_hour,'date_day'=>$date_day,'date_month'=>$date_month,'date_year'=>$date_year))
    ;
    $finalQuery = $query->getQuery();
    return $finalQuery->getArrayResult();
}

So, what can I do to hugely improve these performance? I read about Doctrine, that was not designed for Big Data use cases.

I know it's possible to improve MySQL performance by using tools such has Hadoop, or by optimising indexes for example.

But will this be enough regarding the very low performances at this moment?

I want to be sure Symfony is the good choice for this application. I am considering migrating the API to another Backend Framework, such as ASP.NET or Node.JS. What do you think?



Solution 1:[1]

First, you shouldn't treat Doctrine as an inseparable part of Symfony. You are free to remove it and switch to a different ORM implementation, or maybe skip Doctrine ORM and just use the DBAL instead. Also, sometimes removing the overhead of ORM or iterating the result sets might give you a performance upgrade.

Second, this is not about Symfony or Doctrine, this is actually about the way you organize data in your application. The question you should be asking is whether you use the right tool for the job. As suggested in several comments, you could switch the storage completely (like, for example, use ElasticSearch as storage).

And also you should definitely use the typical query optimization techniques (like, run an EXPLAIN query and see where's the bottleneck).

Solution 2:[2]

  • Learn about indexing. That particular query (which seems weird) needs `INDEX(date_minute, rank).
  • Do the math: There is no disk made that can accept 35 trillion rows in a few weeks.
  • Your INT SIGNED AUTO_INCREMENT has a limit of 2 billion. That won't support 35 trillion rows.
  • Don't split date/time values into multiple columns.
  • Don't use 4-byte INT for 1-byte quantities such as "date_minute". See TINYINT UNSIGNED.
  • Use numeric datatypes (not LONGTEXT) for numeric values (volume, market cap, etc).
  • Don't include created and updated unless you expect to actually do anything with those columns. (3rd packages seem to like generating them, but then don't use them.)

I'm harping on datatypes because the table will get very big; shrinking the datatypes where possible will help it fit into your limited disk, and will help speed up queries.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 kix
Solution 2