Basic Data Import¶
This cookbook will guide you through the process of importing data into your e-shop from an external source, e.g., from an information system.
After completing this cookbook, you should know:
- how to implement basic data import
- what are the best practices for importing data
- what are the pitfalls of importing data, and how can you deal with them
Example: Products import step by step¶
Let us create simple products import from an external source. The source is mocked with apiari.io service which returns example products data in JSON format:
[
{
"id": 1001,
"name": "Emos LED light",
"price_without_vat": "354",
"vat_percent": "21",
"ean": "12345678901",
"brand_id": 101,
"description": "light source: 1 x 3W CREE LED + 12 LED 5 mm ..."
},
{
"id": 1002,
"name": "Digital camera EasyPix GoXtreme Impact, Full HD Action 1080p, white",
"price_without_vat": "2050",
"vat_percent": "21",
"ean": "12345678902",
"brand_id": 102,
"description": "EasyPix GoXtreme - outdoor waterproof camera."
},
{
"id": 1004,
"name": "Power gril PARTY TIME Tescoma",
"price_without_vat": "2478.51",
"vat_percent": "21",
"ean": "12345678904",
"brand_id": 103,
"description": "Highly powerful mobile grill with turbo vent."
}
]
Step 1 - Add $extId
to Product entity¶
We need to store the relation between your application database and the external source of data because later, in data transfer processing,
we will decide whether to create a new product or update an existing one based on the $extId
attribute.
If you do not know how to add an attribute to an entity, take a look at the cookbook.
Step 2 - Create new cron module¶
Cron modules are the best way to handle data downloaded from external sources because they can be scheduled, run in the background, and even iterated when necessary.
2.1 - Add new ImportProductsCronModule
class that implements SimpleCronModuleInterface
¶
// src/Model/Product/ImportProductsCronModule.php
namespace App\Model\Product;
use Shopsys\Plugin\Cron\SimpleCronModuleInterface;
use Symfony\Bridge\Monolog\Logger;
class ImportProductsCronModule implements SimpleCronModuleInterface
{
/**
* {@inheritdoc}
*/
public function setLogger(Logger $logger)
{
}
public function run()
{
}
}
Warning
Cron modules are not suitable for data transfers initialized by an external source. It would be best if you implemented some Web Services for that purpose.
2.2 - Add cron configuration to cron.yaml
¶
# config/services/cron.yaml
services:
_defaults:
autowire: true
autoconfigure: true
public: false
+ App\Model\Product\ImportProductsCronModule:
+ tags:
+ - { name: shopsys.cron, hours: '*', minutes: '*' }
Note: You can schedule cron modules to run whenever you want using cron expression. This example module will be run every 3 hours.
Step 3 - Process incoming data in our new ImportProductsCronModule
¶
In this step, we will download data from the external source
and then decide whether to create or update existing product based on $extId
.
3.2 - Download external data and create or update product based on $extId
¶
If a product exists with a given $extId
we modify it otherwise, we need to create a new one.
// src/Model/Product/ImportProductsCronModule.php
// ...
use App\Model\Product\ProductFacade;
// ...
const PRODUCT_DATA_URL = 'https://private-anon-38d0154157-ssfwbasicdataimportdemo.apiary-mock.com/products';
// ...
/**
* @var \App\Model\Product\ProductFacade
*/
private $productFacade;
// ...
/**
* @param \App\Model\Product\ProductFacade $productFacade
*/
public function __construct(
ProductFacade $productFacade
) {
$this->productFacade = $productFacade;
}
public function run()
{
$externalProductsJsonData = file_get_contents(self::PRODUCT_DATA_URL);
$externalProductsData = json_decode($externalProductsJsonData, true);
$this->importExternalProductsData($externalProductsData);
}
/**
* @param array $externalProductsData
*/
private function importExternalProductsData(array $externalProductsData)
{
foreach ($externalProductsData as $externalProductData) {
$extId = $externalProductData['id'];
$product = $this->productFacade->findByExternalId($extId); // will be implemented in next steps
if ($product === null) {
$this->createProduct($externalProductData); // will be implemented in next steps
} else {
$this->editProduct($product, $externalProductData); // will be implemented in next steps
}
}
}
Note
We need to know whether the product with the given $extId
exists.
For that purpose, we will use a descendant of ProductFacade
(more about facades)
which will use a descendant of ProductRepository
(more about repositories)
that can talk to the persistence layer.
We will extend the framework classes and implement new methods in the next two steps.
3.2 - Extend ProductRepository
and implement method findByExternalId()
to be able find a Product
by an external ID¶
Create new class App/Model/Product/ProductRepository
that extends ProductRepository
from the framework.
// src/Model/Product/ProductRepository.php
namespace App\Model\Product;
use Shopsys\FrameworkBundle\Model\Product\ProductRepository as BaseProductRepository;
class ProductRepository extends BaseProductRepository
{
/**
* @param int $extId
* @return \Shopsys\FrameworkBundle\Model\Product\Product|null
*/
public function findByExternalId($extId)
{
return $this->getProductRepository()->findOneBy(['extId' => $extId]);
}
}
Add information about the class extension into the container configuration in services.yaml
.
To make the service public in TEST environment, you need to add the same line into services_test.yaml
as well.
Shopsys\FrameworkBundle\Model\Product\ProductRepository: '@App\Model\Product\ProductRepository'
3.3 - Extend ProductFacade
and implement method findByExternalId()
in order to get Product
from repository¶
Create new class App/Model/Product/ProductFacade
that extends ProductFacade
from the framework.
// src/Model/Product/ProductFacade.php
namespace App\Model\Product;
use Shopsys\FrameworkBundle\Model\Product\ProductFacade as BaseProductFacade;
class ProductFacade extends BaseProductFacade
{
/**
* @var \App\Model\Product\ProductRepository
*/
protected $productRepository;
/**
* @param int $extId
* @return \Shopsys\FrameworkBundle\Model\Product\Product|null
*/
public function findByExternalId($extId)
{
return $this->productRepository->findByExternalId($extId);
}
}
Tip
You should overwrite the protected $productRepository
annotation so IDE knows that you are using the extended ProductRepository
.
Add information about the class extension into the container configuration in services.yaml
.
To make the service public in TEST environment,
you need to add the same line into services_test.yaml
as well.
Shopsys\FrameworkBundle\Model\Product\ProductFacade: '@App\Model\Product\ProductFacade'
3.4 - Implement ImportProductsCronModule::createProduct()
and ImportProductsCronModule::updateProduct()
¶
As an entry-point for data processing in Shopsys Platform, we use facades.
In this case, our new ProductFacade
and its inherited methods create()
and edit()
.
Those methods expect ProductData
class as a parameter, you can use ProductDataFactory
to create it.
// src/Model/Product/ImportProductsCronModule.php
// ...
use App\Model\Product\ProductDataFactory;
// ...
/**
* @var \App\Model\Product\ProductDataFactory
*/
private $productDataFactory;
// ...
/**
* @param \App\Model\Product\ProductFacade $productFacade
* @param \App\Model\Product\ProductDataFactory $productDataFactory
*/
public function __construct(
ProductFacade $productFacade,
ProductDataFactory $productDataFactory
)
{
$this->productFacade = $productFacade;
$this->productDataFactory = $productDataFactory;
}
// ...
/**
* @param array $externalProductData
*/
private function createProduct(array $externalProductData) {
$productData = $this->productDataFactory->create();
$this->fillProductData($productData, $externalProductData); // will be implemented in next step
$this->productFacade->create($productData);
}
/**
* @param \App\Model\Product\Product $product
* @param array $externalProductData
*/
private function editProduct(Product $product, array $externalProductData) {
$productData = $this->productDataFactory->createFromProduct($product);
$this->fillProductData($productData, $externalProductData); // will be implemented in next step
$this->productFacade->edit($product->getId(), $productData);
}
3.5 - Implement ImportProductsCronModule::fillProductData()
¶
Finally, we can implement the private method for filling data object
ProductData
with external source data.
// src/Model/Product/ImportProductsCronModule.php
// ...
use App\Model\Pricing\Vat\VatFacade;
use Shopsys\FrameworkBundle\Component\Money\Money;
use Shopsys\FrameworkBundle\Model\Product\ProductData;
use Shopsys\FrameworkBundle\Model\Product\ProductInputPriceDataFactory;
// ...
const DOMAIN_ID = 1;
const LOCALE = 'en';
const PRICING_GROUP_ID = 1;
// ...
public function __construct(
// ...
private readonly VatFacade $vatFacade,
private readonly ProductInputPriceDataFactory $productInputPriceDataFactory,
) {
}
/**
* @param \App\Model\Product\ProductData $productData
* @param array $externalProductData
*/
private function fillProductData(ProductData $productData, array $externalProductData)
{
$productData->name[self::LOCALE] = $externalProductData['name'];
$productData->productInputPricesByDomain[self::DOMAIN_ID] = $this->productInputPriceDataFactory->create(
$this->vatFacade->getVatByPercent($externalProductData['vat_percent']), // will be implemented in next step
[self::PRICING_GROUP_ID => Money::create($externalProductData['price_without_vat']),
);
$productData->ean = $externalProductData['ean'];
$productData->descriptions[self::DOMAIN_ID] = $externalProductData['description'];
$productData->extId = $externalProductData['id'];
}
Note
Data from external sources contain only integer value for vat percent information, but we need Vat
object
in ProductData
.
So we will extend VatRepository
and VatFacade
and implement appropriate methods.
Tip
Money::create()
can be used only for integers and numeric strings.
If you use floats in your $externalProductData
array, you should always use Money::createFromFloat()
and specify scale explicitly.
Read more about monetary values in How to Work with Money.
3.6 - Extend VatRepository
and implement method getVatByPercent()
in order to load Vat
by percent¶
// src/Model/Pricing/Vat/VatRepository.php
namespace App\Model\Pricing\Vat;
use Shopsys\FrameworkBundle\Model\Pricing\Vat\VatRepository as BaseVatRepository;
class VatRepository extends BaseVatRepository
{
/**
* @param int $percent
* @return \Shopsys\FrameworkBundle\Model\Pricing\Vat\Vat
*/
public function getVatByPercent($percent)
{
$vat = $this->getVatRepository()->findOneBy(['percent' => $percent]);
if ($vat === null) {
throw new \Shopsys\FrameworkBundle\Model\Pricing\Vat\Exception\VatNotFoundException('Vat with ' . $percent . '% not found.');
}
return $vat;
}
}
Add information about the class extension into the container configuration in services.yaml
Shopsys\FrameworkBundle\Model\Pricing\Vat\VatRepository: '@App\Model\Pricing\Vat\VatRepository'
Danger
The method throws an exception when Vat
object is not found by the given percent value.
Do not forget to handle it (e.g., skip the product data processing and log the exception).
3.7 - Extend VatFacade
and implement method getVatByPercent()
in it¶
// src/Model/Pricing/Vat/VatFacade.php
namespace App\Model\Pricing\Vat;
use Shopsys\FrameworkBundle\Model\Pricing\Vat\VatFacade as BaseVatFacade;
class VatFacade extends BaseVatFacade
{
/**
* @var \App\Model\Pricing\Vat\VatRepository
*/
protected $vatRepository;
/**
* @param int $percent
* @return \Shopsys\FrameworkBundle\Model\Pricing\Vat\Vat
*/
public function getVatByPercent($percent)
{
return $this->vatRepository->getVatByPercent($percent);
}
}
Tip
You should overwrite the protected $vatRepository
annotation so IDE knows that you are using the extended VatRepository
Add information about the class extension into the container configuration in services.yaml
.
To make the service public in TEST environment, you need to add the same line into services_test.yaml
as well.
Shopsys\FrameworkBundle\Model\Pricing\Vat\VatFacade: '@App\Model\Pricing\Vat\VatFacade'
Step 4 - Everything is done, you can run it now¶
4.1 Run php phing cron-list
in console¶
You will get a list of all available cron modules as an output.
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Component\Error\ErrorPageCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Cart\Item\DeleteOldCartsCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Feed\FeedCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Pricing\Vat\VatDeletionCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Product\Availability\ProductAvailabilityCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Product\Pricing\ProductPriceCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Product\Elasticsearch\ProductExportCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Product\ProductVisibilityImmediateCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Product\ProductVisibilityMidnightCronModule"
php bin/console shopsys:cron --module="Shopsys\FrameworkBundle\Model\Sitemap\SitemapCronModule"
php bin/console shopsys:cron --module="Shopsys\ProductFeed\HeurekaBundle\Model\HeurekaCategory\HeurekaCategoryCronModule"
php bin/console shopsys:cron --module="App\Model\Product\ImportProductsCronModule"
4.2 - Find your module and run the appropriate console command¶
php bin/console shopsys:cron --module="App\Model\Product\ImportProductsCronModule"
Hint
More information about what Phing targets are and how they work can be found in Console Commands for Application Management (Phing Targets).
Best practices¶
- Validate all incoming data before putting them into data objects.
- You will avoid SQL errors (e.g., when incoming text is longer than the database column allows).
- Transfer modifications only (if possible).
- Persist the last successful transfer start datetime so next time you can import changes done afterward.
- It means less data to process.
- Transfer deleted entities too.
- It is often omitted, but it is necessary if you want to keep consistency between your application database and the external source.
- Use database transactions.
- Use
EntityManager
methodsbeginTransaction()
,commit()
androllback()
. - Do not forget to clear the identity map when doing a rollback (so even entity modifications are reverted).
IteratedCronModuleInterface
offers a way to implement longer processes that can span over many iterations.
- Use
- Disable entity manager SQL logging (use
SqlLoggerFacade::temporarilyDisableLogging()
).- By default, every executed SQL query is logged, and that slows down the process and consumes much memory when there are many iterations.
- Do not forget to re-enable the logging after the processing is done (use
SqlLoggerFacade::reenableLogging()
).
- Logging of some key functions might be helpful for future debugging , but it is not a good idea to log "everything". Too much information in logs might be counterproductive.
- Clear entity manager identity map once in a while because
EntityManager::flush()
searches for changes in all mapped entities, and after time, it consumes a huge amount of resources to persist a single entity.- Call
EntityManager::flush()
with parameter (i.e. entity or array of entities you want to flush) anytime it is possible.
Warning: Flushing is not a cascade operation, i.e. when you flush an entity that contains any other entities (e.g., translations), these are not flushed automatically. You should not forget to flush them as well. - You should load any entity again after clearing the identity map because any attempt to flush the old one will result in an exception.
- Call
- Use streamed input for XML and JSON.
- So you do not load huge files at once (can lead to memory overflow).
- Store external source credentials in environment variables.
- Storing credentials in the local configuration instead of hard-coding them in source code prevents the accidental corrupting of production data.
- Restrict editing of the transferred fields in administration.
- At least mark them as transferred to avoid confusion when an administrator changes the field value and then data import overrides the value.
- Transfer overview in administration can be very useful for both the administrator and developer of an e-shop.
- It is handy to know which transfers are currently in progress, which are scheduled, which failed, etc.
- Be careful with the order of your data transfers.
- For example, products have an association with their categories, so first, you want to transfer products and then their relation to categories.
Pitfalls¶
- External data source often sends null values as empty strings, so be very careful with validating the incoming data. (
$value !== null
might not be sufficient in that case). - When processing a large amount of data, you can use native queries instead of using ORM, which might save you a certain amount of SQL queries. On the other hand, you have to handle all related logic manually then.
- It is necessary to mark transferred products for recalculations (e.g., price, visibility, etc.).
- In the current state, e-shop can not handle transferring huge amount (thousands) of categories.
- Categories use the "nested set" structure (
@Gedmo\Tree(type="nested")
) which produces many SQL queries.
- Categories use the "nested set" structure (
Conclusion¶
Now, you know how to implement simple data transfer to your e-shop from an external source.
You know why you should persist external ID and how to decide whether to create new entities or update existing ones.
You learned about Shopsys Platform cron modules, how to create and run them. You also know how to get desired objects based on external data
(e.g., Vat
object based on vat percent).
You are familiar with best practices for implementing data transfers,
what pitfalls you can encounter with, and what are the ways of dealing with them.